It is best practice to set up a new directory each time we start a new project in R. To do so, complete the following steps:
library(tidyverse)
listings <- read_csv("data/Boston_listings.csv")
Most tidyverse
functions, including
read_csv
, output tibble objects. A dataset stored as a
tibble will return tbl
and tbl_df
in its
class
.
class(listings)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
A tbl_df
prints different details than a traditional
data.frame
.
head(listings)
## # A tibble: 6 × 36
## id listing_url neighbourhood neighbourhood_c… city state zipcode latitude
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 3781 https://www… East Boston East Boston Bost… MA 2128 42.4
## 2 5506 https://www… Roxbury Roxbury Bost… MA 2119 42.3
## 3 6695 https://www… Roxbury Roxbury Bost… MA 2119 42.3
## 4 6976 https://www… Roslindale Roslindale Bost… MA 2131 42.3
## 5 8789 https://www… Beacon Hill Downtown Bost… MA 2108 42.4
## 6 8792 https://www… Beacon Hill Downtown Bost… MA 2108 42.4
## # … with 28 more variables: longitude <dbl>, property_type <chr>,
## # room_type <chr>, accommodates <dbl>, bathrooms <dbl>, bedrooms <dbl>,
## # beds <dbl>, bed_type <chr>, price <dbl>, security_deposit <dbl>,
## # cleaning_fee <dbl>, guests_included <dbl>, extra_people <chr>,
## # minimum_nights <dbl>, maximum_nights <dbl>, number_of_reviews <dbl>,
## # first_review <chr>, last_review <chr>, review_scores_rating <dbl>,
## # review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>, …
In some cases, when using packages outside the
tidyverse
, we can avoid or fix problems by converting a
tbl_df
into a data.frame
with
as.data.frame
. Note that head
prints
all columns of the dataset and does not specify the
datatype of each column.
listings.df <- as.data.frame(listings)
class(listings.df)
## [1] "data.frame"
head(listings.df)
## id listing_url neighbourhood neighbourhood_cleansed
## 1 3781 https://www.airbnb.com/rooms/3781 East Boston East Boston
## 2 5506 https://www.airbnb.com/rooms/5506 Roxbury Roxbury
## 3 6695 https://www.airbnb.com/rooms/6695 Roxbury Roxbury
## 4 6976 https://www.airbnb.com/rooms/6976 Roslindale Roslindale
## 5 8789 https://www.airbnb.com/rooms/8789 Beacon Hill Downtown
## 6 8792 https://www.airbnb.com/rooms/8792 Beacon Hill Downtown
## city state zipcode latitude longitude property_type room_type
## 1 Boston MA 2128 42.36524 -71.02936 Apartment Entire home/apt
## 2 Boston MA 2119 42.32981 -71.09559 Guest suite Entire home/apt
## 3 Boston MA 2119 42.32994 -71.09351 Condominium Entire home/apt
## 4 Boston MA 2131 42.29244 -71.13577 Apartment Private room
## 5 Boston MA 2108 42.35919 -71.06265 Apartment Entire home/apt
## 6 Boston MA 2108 42.35850 -71.06201 Apartment Entire home/apt
## accommodates bathrooms bedrooms beds bed_type price security_deposit
## 1 2 1 1 1 Real Bed 125 0
## 2 2 1 1 1 Real Bed 145 0
## 3 4 1 1 2 Real Bed 169 NA
## 4 2 1 1 1 Real Bed 65 0
## 5 2 1 1 1 Real Bed 99 1000
## 6 2 1 1 1 Real Bed 154 1000
## cleaning_fee guests_included extra_people minimum_nights maximum_nights
## 1 75 2 $0.00 30 1125
## 2 40 1 $0.00 3 730
## 3 70 2 $8.00 3 730
## 4 0 1 $20.00 3 60
## 5 150 1 $0.00 14 365
## 6 250 1 $0.00 20 365
## number_of_reviews first_review last_review review_scores_rating
## 1 14 7/10/2015 7/2/2018 99
## 2 71 3/21/2009 9/3/2018 95
## 3 75 8/6/2009 8/26/2018 96
## 4 68 7/19/2009 7/31/2018 98
## 5 18 8/12/2014 8/27/2018 91
## 6 24 10/11/2009 8/15/2018 93
## review_scores_accuracy review_scores_cleanliness review_scores_checkin
## 1 10 10 10
## 2 10 10 10
## 3 10 10 10
## 4 10 9 10
## 5 9 9 10
## 6 10 9 10
## review_scores_communication review_scores_location review_scores_value
## 1 10 10 10
## 2 10 9 10
## 3 10 9 9
## 4 10 9 10
## 5 10 10 9
## 6 10 10 9
## cancellation_policy calculated_host_listings_count reviews_per_month
## 1 super_strict_30 2 0.36
## 2 strict_14_with_grace_period 6 0.61
## 3 strict_14_with_grace_period 6 0.68
## 4 moderate 1 0.61
## 5 strict_14_with_grace_period 11 0.36
## 6 strict_14_with_grace_period 11 0.22
Exploratory Data Analysis (EDA) is the process of exploring your data, focused on discovering patterns rather than statistical testing. R for Data Science breaks this down into three steps:
- Generate questions about your data.
- Search for answers by visualizing, transforming and modelling your data.
- Use what your learn to refine your questions and/or generate new questions.
This process can be repeated many times.
dim
, str
, and
summary
Before we start asking questions of our dataset, we need to determine the basic contents of the dataset.
dim(listings)
## [1] 5957 36
We have 36 variables (columns) across 5957 observations (rows).
In R, we can use square brackets, []
, to quickly select
a subset of rows and/or columns by position using numeric indices.
Unlike some programming languages, R starts counting positions at 1
instead of zero.
For example, if we want to look at row 5, column 3 of our listings data frame, we can do it like so:
listings[5,3]
## # A tibble: 1 × 1
## neighbourhood
## <chr>
## 1 Beacon Hill
There are a number of different ways we can use indexing to subset our data:
listings[,1] # column 1
listings[,7:10] # columns 7 through 10
listings[,c(1,8,11)] # columns 1, 8 and 11
listings[2,] # row 2
listings[3:9,] # rows 3 through 9
listings[c(7,10),] # rows 7 and 10
For a vector, we only need one number to describe position:
myvec <- c("a","b","c","d","e","f","g")
myvec
## [1] "a" "b" "c" "d" "e" "f" "g"
myvec[1]
## [1] "a"
myvec[4:7]
## [1] "d" "e" "f" "g"
myvec[c(1,3,7)]
## [1] "a" "c" "g"
str
and summary
(with Indexing)We can use str
and summary
to learn more
about each of these 36 variables. We’ll use indexing to get results for
only the first four columns with str
str(listings[,1:4])
## tibble [5,957 × 4] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:5957] 3781 5506 6695 6976 8789 ...
## $ listing_url : chr [1:5957] "https://www.airbnb.com/rooms/3781" "https://www.airbnb.com/rooms/5506" "https://www.airbnb.com/rooms/6695" "https://www.airbnb.com/rooms/6976" ...
## $ neighbourhood : chr [1:5957] "East Boston" "Roxbury" "Roxbury" "Roslindale" ...
## $ neighbourhood_cleansed: chr [1:5957] "East Boston" "Roxbury" "Roxbury" "Roslindale" ...
The equivalent way to do this using dplyr and the %>%
we learned last week would be:
listings %>%
select(id, listing_url, neighbourhood, neighbourhood_cleansed) %>%
str()
## tibble [5,957 × 4] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:5957] 3781 5506 6695 6976 8789 ...
## $ listing_url : chr [1:5957] "https://www.airbnb.com/rooms/3781" "https://www.airbnb.com/rooms/5506" "https://www.airbnb.com/rooms/6695" "https://www.airbnb.com/rooms/6976" ...
## $ neighbourhood : chr [1:5957] "East Boston" "Roxbury" "Roxbury" "Roslindale" ...
## $ neighbourhood_cleansed: chr [1:5957] "East Boston" "Roxbury" "Roxbury" "Roslindale" ...
We’ll use summary
with the 5th, 6th, and 12th
columns.
summary(listings[,c(5,6,12)])
## city state accommodates
## Length:5957 Length:5957 Min. : 1.00
## Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Median : 3.00
## Mean : 3.38
## 3rd Qu.: 4.00
## Max. :16.00
To get the same results using dplyr and the %>%, we could use:
listings %>%
select(city, state, accommodates) %>%
summary()
## city state accommodates
## Length:5957 Length:5957 Min. : 1.00
## Class :character Class :character 1st Qu.: 2.00
## Mode :character Mode :character Median : 3.00
## Mean : 3.38
## 3rd Qu.: 4.00
## Max. :16.00
A natural starting place for EDA is learning more about individual variables. We’re often interested in the relative frequency of different values of a given variable.
What types of beds are most popular in Boston AirBnBs?
table
table
provides a simple frequency table for a
categorical (usually factor or character) variable.
table(listings$bed_type)
##
## Airbed Couch Futon Pull-out Sofa Real Bed
## 31 5 32 19 5870
Clearly “Real Bed” is the most popular bed type. Interestingly, it appears that Airbeds are more popular than Couches and Pull-out Sofas combined!
geom_bar
We can also represent categorical variables with a bar chart, using
ggplot
and geom_bar
. By default,
geom_bar
uses the x
aesthetic and provides a
count of the frequency of each category across the dataset.
ggplot(data=listings,aes(x=bed_type))+
geom_bar(width=0.5)+
theme_bw()
Note: Calling fct_infreq
will quickly re-arrange our
categories to be plotted in frequency order. I’ve also used the
width
parameter for geom_bar
to make the bars
half as wide.
ggplot(data=listings,aes(x=fct_infreq(bed_type)))+
geom_bar(width=0.5)+
theme_bw()
What is the distribution of prices for AirBnB listings in Boston?
Unfortunately, with continuous variables, we usually have too many distinct values to make a table or bar chart useful.
ggplot(data=listings,aes(x=price))+
geom_bar()+
theme_bw()
geom_hist
A common way to overcome this problem is to bin values into ranges with a histogram.
ggplot(data=listings,aes(x=price))+
geom_histogram(bins=100)+
theme_bw()
It’s important to note that a histogram can look very different depending on how many ranges or “bins” we use.
## Warning: package 'gridExtra' was built under R version 4.2.2
We can plot transformations of the data directly.
For example, it’s common to use a logarithm tranformation with skewed
positive data. Let’s look at \(log_{10}(price)\) with
log10
.
ggplot(data=listings,aes(x=log10(price)))+
geom_histogram(bins=30)+
theme_bw()
geom_boxplot
The boxplot provides another convenient summary of a continuous
variable. geom_boxplot
requires a y
aesthetic
instead of an x
as in boxplot. Here I’ve used the
coord_flip
function at the end of my code to flip the X and
Y axes to make the output more comprable to our histograms above.
ggplot(data=listings,aes(y=price))+
geom_boxplot()+
theme_bw()+
coord_flip()
The box of the boxplot ranges from the 25th to the 75th percentile, and therefore contains half of the observed data. The line running down the middle of the box is the median.
The difference between the 25th and 75th percentiles, the width of the box, is known as the inter-quartile range (IQR).
The lines extending out to the left and right of the box represent the furthest extent of datapoints that are within 1.5 IQR of the box. Our lower, left-hand, whisker is much shorter becuase AirBnB prices cannot be negative.
Finally, the points, representing prices from roughly $500 up to roughly $4000 represent outliers.
Outliers are data that are “far” away from your other observations in terms of one or more variables. There are a number of frequently used cutoffs to help identify outliers in continuous variables.
However, there is no fixed rule as to when you should remove an observation you have identified as an outlier. In general, you want to remove observations that aren’t really in the population you’re studying. This includes values that may have been entered in error or produced by malfunctioning sensors or equipment. In other cases, a suprirsing, but accurate, outlying value, can be the most important result. Observations shouldn’t be removed just because they’re outliers.
Once we have a handle on single variables in our dataset, we can start hypothesizing about relationships between variables.
Do more expensive properties get more reviews?
The most common way to explore the interaction between two continuous
variables is with a scatterplot, using geom_point
.
listings %>%
filter(price<500) %>%
ggplot(aes(x=price,y=number_of_reviews))+
geom_point()+
theme_bw()
As in many large datasets, we have so many points in certain areas
that it can be hard to tell how many observations are actually
represented. We can get a better sense of how many observations are in a
given area by adding some transparency, using the alpha
aesthetic of geom_point
.
listings %>%
filter(price<500) %>%
ggplot(aes(x=price,y=number_of_reviews))+
geom_point(alpha=0.05)+
theme_bw()
For comparison, let’s zoom in on the grid where both price and number of reviews are less than 100:
It’s clear from our alpha
plot that most of the AirBnBs
with price under 100 also have less than 10 reviews.
Which room types can accommodate the most people?
We can use table
with two variables to get a quick
summary.
table(listings$room_type,listings$accommodates)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## Entire home/apt 37 994 591 921 506 378 95 157 26 39 8 26
## Private room 552 1194 125 137 24 10 0 2 1 0 0 0
## Shared room 47 28 4 9 0 0 0 0 0 0 0 0
##
## 13 14 15 16
## Entire home/apt 4 15 4 22
## Private room 0 0 0 0
## Shared room 0 0 0 1
We can represent this graphically in a number of ways. For example,
we can use geom_count
to plot points sized by frequency.
This has the benefit of avoiding overlaping points.
ggplot(data=listings, aes(x=accommodates,y=room_type))+
geom_count()+
theme_bw()
What happens if we use geom_point
instead?
Which room types are the most expensive?
Let’s continue looking at room types, but lets compare a continous
variable, prices, across the categories. Recall that we can use
group_by
and summarize
from last week to get a
summary table:
listings %>%
group_by(room_type) %>%
summarize(mean_price=mean(price),median_price=median(price))
## # A tibble: 3 × 3
## room_type mean_price median_price
## <chr> <dbl> <dbl>
## 1 Entire home/apt 237. 199
## 2 Private room 84.0 70
## 3 Shared room 80.2 62
However, we’ll usually get a better sense of the data with a visualization. When we have a reasonable number of categories, we can spread boxplots across categories by adding an aesthetic for the category.
ggplot(data=listings,aes(y=price,x=room_type))+
geom_boxplot()+
theme_bw()+
coord_flip()
We can control the order of boxplots with reorder
.
reorder(room_type,price,mean)
will reorder the categories
of room_type based on mean price. Check the help with
?reorder
for more information.
ggplot(data=listings,aes(y=price,x=reorder(room_type,price,mean)))+
geom_boxplot()+
theme_bw()+
coord_flip()
Unfortunately, for histograms, we already have counts on the y axis by default, so we’ll use facets instead. Facets allow you to create multiple plots - one for each level of a categorical variable.
ggplot(data=listings,aes(x=price))+
geom_histogram(bins=50)+
facet_grid(rows=vars(room_type))+
theme_bw()
To place several histograms in one overlapping plot, we should to
convert the bar representation into lines, to make overlaps easier to
see and understand. We’ll use the color
aesthetic to
separate our categories.
*geom_freqpoly
represents each histogram as a line plot
through the top of each bar. We’ll use the size
parameter
for geom_freqpoly
to make the lines a little thicker than
the default.
ggplot(data=listings,aes(x=price,color=room_type))+
geom_freqpoly(size=1)+
theme_bw()
geom_density
provides a “smoothed” and rescaled version
of each histogram. The area under each line is the same, so we lose
relative frequency between categories in favor of a better view of
relative frequency within each category.
bw
parameter controls the extent of smoothing,
similar to bins
for histograms. Try some other values to
see how the plot changes!ggplot(data=listings,aes(x=price,color=room_type))+
geom_density(bw=10,size=1)+
theme_bw()
table
and geom_bar
with
x
aestheticgeom_hist
with x
aestheticgeom_boxplot
with y
aestheticgeom_point
with
x
and y
aestheticstable
or
geom_count
with x
and y
aestheticsgeom_boxplot
with
y
(continuous) and x
(categorical) aesthetics
+ geom_histogram
with x
(continuous) aesthetic
and facet_grid
using the categorical variable +
geom_freqpoly
with x
(continuous) and
color
(categorical) aesthetic + geom_density
x
(continuous) and color
(categorical)
aestheticMake a scatterplot of review_scores_rating vs reviews_per_month.
Add a best fit linear regression line with geom_smooth
(check the help with ?geom_smooth
, or refer back to week
one)
Which neighborhoods have the highest median price? Which
neighborhoods have the highest variance or largest IQR? Hint: Use
neighbourhood_cleansed
for neighborhoods.
Load the built-in dataset mpg
with
data(mpg)
. Use ?mpg
to learn more about the
variables included. Come up with a hypothesis or expectation you have
about the data and check it with one of the methods used above.
Load the built-in dataset msleep
with
data(msleep)
. Use ?msleep
to learn more about
the variables included. Come up with a hypothesis or expectation you
have about the data and check it with one of the methods used
above.