This introductory workshop aims to cover:
Scientific reproducibility - if you repeat my experiment, you will get the same results as me. This is beyond the scope of this workshop (see here for an overview)
Computational reproducibility - if I give you my data and code, you will get the same results as me. We’ll focus on the basics of computational reproducibility today.
For your results to be reproducible, every step you take to manipulate the data should be explicitly documented in an R script.
This also means that you want to make sure that your results don’t have hidden/implicit dependencies in your R environment. There are two steps you can to take to prevent this.
Reconfigure RStudio not to save your workspace between sessions. (Tools > Global Options > General)
Restart RStudio from scratch and rerun the current script often.
Note: If you have learned R in the past, you may have been taught to use the expression rm(list = ls())
to clear your R environment. This will not actually return R to a blank slate! Although this code will delete objects in your R environment, it won’t reset global settings that have been changed. It is always safer and more effective to restart RStudio.
You can use the R function setwd()
to point to locations of data files, but this isn’t ideal for reproducible research. When you switch between computers, it is likely that the absolute directory path will change, so you will have to modify the path for each computer. Users of your code will not have the same absolute directory path that you have, so they will have to modify the path in the setwd()
function before they can use your code.
RStudio Projects provide a way to bundle together data, code, and other files together so that you can refer to files with relative directory paths, which creates a much more portable structure that won’t need to be modified between computers or users.
We’ll begin by loading the tidyverse library. This library includes readr, which we will use to read in our data file (for more details on readr, see Chapter 11 of the text).
RStudio has a useful keyboard shortcut for <- (the assignment operator): Alt + -, i.e. press the Alt key and the minus/hyphen key at the same time.
# Purpose: Exploratory data analysis of airbnb data
# Author: Nuvan Rathnayaka
# Date: September 2018
# Setup -----------------------------------------------------------
library(tidyverse)
## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Data Import ------------------------------------------------------
airbnb_dc <- read_csv("data/listings.csv") #Since I'm using an RStudio project, I only have to specify the path relative to the top directory
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## property_type = col_character(),
## room_type = col_character(),
## accommodates = col_double(),
## bathrooms = col_double(),
## bedrooms = col_double(),
## price = col_double(),
## extra_people = col_double(),
## minimum_nights = col_double(),
## number_of_reviews = col_double(),
## review_scores_rating = col_double(),
## cancellation_policy = col_character(),
## reviews_per_month = col_double()
## )
#If you have problems running the read_csv above, the following code should also work
#(Download data directly from the internet)
#airbnb_dc <- read_csv(url("https://unc-libraries-data.github.io/R-Open-Labs/week2_Workflow/data/listings.csv"))
In R, the # symbol indicates the beginning of a comment. Everything to the right of that is ignored by R, so you can write plain text to document what you’re doing, organize your code, and explain any tricky or unusual steps.
Something that I’ve glossed over is that you have options for how to name objects. For example, we called our imported data airbnb_dc, but we could have called it airbnbListings, or airbnb.listings
Having a consistent way you name objects and functions helps you remember what things are called. It’s especially useful when you’re working with others to agree upon a coding convention. It doesn’t particularly matter which one you choose, as long as you choose a convention and stick to it.
Example style guide: http://style.tidyverse.org/index.html
R is an object-oriented programming language. One way to understand how objects work in programming is to look at how objects work in real life.
We tend to group objects by their structure. For example, there are many objects we call cars, and they all tend to share similarities in structure - they have wheels, windows, doors, etc. An individual car such as my car or your car is an instance of this structure. Cars also have a set of properties such as make, model, year and color. Each individual car has some variation of these properties.
Objects in R work in a similar way.
In the diagram above, we are looking at three common data structures often used in R: vectors, matrices and data frames. We can create individual objects from these data structures, and those objects have their own properties. The way R defines what properties a given object has is through it’s class. You can think of a class as a contract between R and its users, defining the structure and properties of an object. People who develop R libraries take advantage of this system to reliably customize the behavior of functions.
Vectors are the most basic collection of items, often created with c()
. Depending on what you put into a vector, it also automatically decides a type. The most common types are: numeric (integer or double), character, and logical.
c(1,2,3,4) #Numeric (integer)
## [1] 1 2 3 4
c(1.4,2.7,3.2) #Numeric (double)
## [1] 1.4 2.7 3.2
c("cat","dog","fish") #Character
## [1] "cat" "dog" "fish"
c(TRUE,FALSE,TRUE) #Logical
## [1] TRUE FALSE TRUE
c("cat",1,FALSE) #Character - check other combinations yourself!
## [1] "cat" "1" "FALSE"
Vectors only have one dimension: length (accessible with the length
function.)
We’ll learn more about another special case of vectors, “factors”, as the semester goes on.
Matrices are two-dimensional generalizations of vectors. The dimensions of a matrix are the number of rows and the number of columns (in that order).
Like vectors, matrices have a single type, and we’ll most commonly see numeric, character, and logical types.
matrix(1:9,nrow=3,ncol=3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
#1:9 is equivalent to c(1,2,3,4,5,6,7,8,9)
Like a matrix, the data.frame has rows and columns, but each column can have its own different data type. This is the most common way to store datasets in R.
For example, our airbnb_dc
dataset has been read into a data.frame. We can check its type with class
.
class(airbnb_dc)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can see that airbnb_dc
has multiple classes! "tbl_df"
and "tbl"
indicate that this data.frame is also a “tibble”.
One way to extract a single column from a dataframe uses the $
operator with the column’s name.
class(airbnb_dc$property_type)
## [1] "character"
The dim()
function will print the dimensions (rows x columns) of your data frame.
head()
will print the first few rows of your data frame.
str()
is a very general purpose function that will display the structure of an object. This is especially useful for debugging.
# Some useful functions ----------------------------------------
dim(airbnb_dc)
## [1] 7788 14
head(airbnb_dc)
## # A tibble: 6 x 14
## id name property_type room_type accommodates bathrooms bedrooms price
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 4.87e6 Cozy~ Apartment Entire h~ 3 1 0 95
## 2 1.67e7 Larg~ Apartment Entire h~ 2 1 0 200
## 3 1.50e7 Newl~ Apartment Entire h~ 2 1 1 100
## 4 5.96e6 Spac~ Apartment Entire h~ 3 1 1 129
## 5 1.57e7 Luxu~ Townhouse Entire h~ 4 2.5 2 500
## 6 4.02e6 Dupo~ Apartment Private ~ 2 2.5 1 110
## # ... with 6 more variables: extra_people <dbl>, minimum_nights <dbl>,
## # number_of_reviews <dbl>, review_scores_rating <dbl>,
## # cancellation_policy <chr>, reviews_per_month <dbl>
str(airbnb_dc)
## tibble [7,788 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:7788] 4873463 16736650 14999877 5955860 15655208 ...
## $ name : chr [1:7788] "Cozy Pied-a-Terre, the Heart of DC" "Large, welcoming studio in a central location" "Newly Renovated Apartment in the Heart of DC" "Spacious Dupont Circle Condo" ...
## $ property_type : chr [1:7788] "Apartment" "Apartment" "Apartment" "Apartment" ...
## $ room_type : chr [1:7788] "Entire home/apt" "Entire home/apt" "Entire home/apt" "Entire home/apt" ...
## $ accommodates : num [1:7788] 3 2 2 3 4 2 3 2 4 2 ...
## $ bathrooms : num [1:7788] 1 1 1 1 2.5 2.5 1 1 1 1 ...
## $ bedrooms : num [1:7788] 0 0 1 1 2 1 1 1 2 1 ...
## $ price : num [1:7788] 95 200 100 129 500 110 225 79 172 110 ...
## $ extra_people : num [1:7788] 10 0 0 50 0 15 20 0 0 10 ...
## $ minimum_nights : num [1:7788] 2 1 27 2 2 6 3 3 1 1 ...
## $ number_of_reviews : num [1:7788] 29 2 0 79 1 7 3 4 24 4 ...
## $ review_scores_rating: num [1:7788] 94 90 NA 85 100 97 100 89 88 100 ...
## $ cancellation_policy : chr [1:7788] "flexible" "flexible" "moderate" "flexible" ...
## $ reviews_per_month : num [1:7788] 1.01 0.55 NA 3.13 1 0.23 0.82 0.09 0.48 0.67 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. name = col_character(),
## .. property_type = col_character(),
## .. room_type = col_character(),
## .. accommodates = col_double(),
## .. bathrooms = col_double(),
## .. bedrooms = col_double(),
## .. price = col_double(),
## .. extra_people = col_double(),
## .. minimum_nights = col_double(),
## .. number_of_reviews = col_double(),
## .. review_scores_rating = col_double(),
## .. cancellation_policy = col_character(),
## .. reviews_per_month = col_double()
## .. )
We can use the table()
function to view counts/frequencies of categorical variables.
# Categorical Variables ----------------------------------------
table(airbnb_dc$property_type)
##
## Apartment Bed & Breakfast Boat Boutique hotel
## 4191 73 3 11
## Bungalow Cabin Castle Condominium
## 4 1 1 420
## Dorm Guest suite Guesthouse Hostel
## 25 11 17 2
## House In-law Loft Other
## 2480 12 33 31
## Serviced apartment Timeshare Townhouse Train
## 1 1 468 1
## Treehouse Villa
## 1 1
As we saw in week 1, you can also next functions within each other. Here, let’s generate a contingency table using table()
, then sort it by count using sort()
.
sort(table(airbnb_dc$property_type), decreasing=TRUE)
##
## Apartment House Townhouse Condominium
## 4191 2480 468 420
## Bed & Breakfast Loft Other Dorm
## 73 33 31 25
## Guesthouse In-law Boutique hotel Guest suite
## 17 12 11 11
## Bungalow Boat Hostel Cabin
## 4 3 2 1
## Castle Serviced apartment Timeshare Train
## 1 1 1 1
## Treehouse Villa
## 1 1
That works, but it’s a little hard to read. We could try breaking this into two steps and storing our intermediate output.
property_type_table <- table(airbnb_dc$property_type)
sort(property_type_table, decreasing = TRUE)
##
## Apartment House Townhouse Condominium
## 4191 2480 468 420
## Bed & Breakfast Loft Other Dorm
## 73 33 31 25
## Guesthouse In-law Boutique hotel Guest suite
## 17 12 11 11
## Bungalow Boat Hostel Cabin
## 4 3 2 1
## Castle Serviced apartment Timeshare Train
## 1 1 1 1
## Treehouse Villa
## 1 1
This works ok, but it does mean we’re creating an additional object, property_type_table
, and we could end up with lots of objects that are trivially different from each other if we do this too often. In the next lesson, we’ll introduce the pipe operator, which makes this kind of nested computation easier to read.
Suppose we want to know whether a listings price is greater than 500. We can create a logical type vector as follows (output suppressed for space):
airbnb_dc$price > 500 #Note that since we're not assigning this vector to a variable, R displays it, but doesn't save it as an object in the environment.
You can see other logical operators here.
We can feed this logical vector into the table()
function to get a count of how many listings have prices above $500.
table(airbnb_dc$price > 500)
##
## FALSE TRUE
## 7029 759
We can also use the table function to look at the joint counts of multiple categorical variables. For example, if we want to look at the how many properties have prices above $500 across different property types.
with(airbnb_dc, table(property_type, price > 500))
##
## property_type FALSE TRUE
## Apartment 3884 307
## Bed & Breakfast 72 1
## Boat 3 0
## Boutique hotel 11 0
## Bungalow 4 0
## Cabin 1 0
## Castle 1 0
## Condominium 356 64
## Dorm 25 0
## Guest suite 11 0
## Guesthouse 16 1
## Hostel 2 0
## House 2150 330
## In-law 12 0
## Loft 28 5
## Other 28 3
## Serviced apartment 1 0
## Timeshare 1 0
## Townhouse 420 48
## Train 1 0
## Treehouse 1 0
## Villa 1 0
R has built-in functions to compute the mean, standard deviation, min, median, and max.
# Continuous Variables ----------------------------------------
mean(airbnb_dc$price)
## [1] 247.1451
sd(airbnb_dc$price)
## [1] 385.8856
min(airbnb_dc$price)
## [1] 0
median(airbnb_dc$price)
## [1] 125
max(airbnb_dc$price)
## [1] 6000
R also has a built-in function to compute correlation coefficients. This let’s us explore the pairwise associations between continuous variables.
with(airbnb_dc, cor(price, number_of_reviews))
## [1] -0.1560021
There are some missing values in the review_scores_rating and reviews_per_month columns, even within the first six rows. We can check which values of review_scores_rating are NA with is.na()
# Continuous Variables ----------------------------------------
sum(is.na(airbnb_dc$review_scores_rating)) #summing a logical (TRUE/FALSE) vector counts the number of TRUE values
## [1] 2207
We can check all of our columns using R’s colSums
to sum all of the columns at the same time.
colSums(is.na(airbnb_dc))
## id name property_type
## 0 5 0
## room_type accommodates bathrooms
## 0 0 14
## bedrooms price extra_people
## 14 0 0
## minimum_nights number_of_reviews review_scores_rating
## 0 0 2207
## cancellation_policy reviews_per_month
## 0 2128
Handling missing data is an advanced topic that is a bit beyond the scope of this lesson. There is no good way to handle missing data, there are only bad and less bad ways.
One of the bad ways to handle missing data is what’s known as a complete case analysis. In this kind of analysis, you only keep observations (rows) where you have observed (non-missing) values for ALL columns.
R has built-in functions to facilitate this:
complete.cases
returns a logical vector indicating whether or not an entire row of a dataset contains any missing values. You can also use this function on a subset of columns to determine if only those columns are complete
na.omit
returns a new dataframe with only the complete cases (as identified above)
For most R functions, the default is to return NA is any of the values are NA:
mean(airbnb_dc$review_scores_rating)
## [1] NA
with(airbnb_dc, cor(review_scores_rating, reviews_per_month))
## [1] NA
Some functions, like the linear regression function lm()
, perform a complete-case analysis automatically:
lm(review_scores_rating~price, data=airbnb_dc)
##
## Call:
## lm(formula = review_scores_rating ~ price, data = airbnb_dc)
##
## Coefficients:
## (Intercept) price
## 94.230536 0.003191
When a function complains about missing values and fails, there’s usually an argument that can be modified to change it to a complete-case analysis:
mean(airbnb_dc$review_scores_rating, na.rm=TRUE)
## [1] 94.72406
with(airbnb_dc, cor(review_scores_rating, reviews_per_month, use="pairwise.complete.obs"))
## [1] 0.09933035
Suppose, we’ve done some computation, and we want to save our modified dataframe. We can use write_csv()
to save our dataset as comma separated values file.
airbnb_dc$price_gt_500 <- (airbnb_dc$price > 500)
write_csv(airbnb_dc, "data/modified_airbnb.csv")
Last week, we introduced ggplot2 for statistical graphics. This library implements the grammar of graphics, which provides a structured, modular way to think about plots. The three most commonly altered components are:
Let’s look at a few examples. We can look at a histogram of review scores. Here, the only aesthetic we need to set is the x location, and then we can request a histogram.
ggplot(airbnb_dc, aes(x=review_scores_rating)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2207 rows containing non-finite values (stat_bin).
To get a smoothed histogram (aka kernel density estimate), all we have to do is change to geom_density:
ggplot(airbnb_dc, aes(x=review_scores_rating)) + geom_density()
## Warning: Removed 2207 rows containing non-finite values (stat_density).
If we want to look at the distribution of reviews across different room types, we could add a group aesthetic for the room_type variable:
ggplot(airbnb_dc, aes(x=review_scores_rating, color=room_type)) + geom_density() + guides(color=guide_legend(title="Room Type")) + theme_bw()
## Warning: Removed 2207 rows containing non-finite values (stat_density).
Since, there a lot of overlap, we could use facets to create a subplot for each room_type:
ggplot(airbnb_dc, aes(x=review_scores_rating, color=)) + geom_density() + facet_wrap(~ room_type) + xlab("Review Score Rating") + theme_bw()
## Warning: Removed 2207 rows containing non-finite values (stat_density).
So far, we’ve looked a single continuous variable (review_scores_rating) as it related to a categorical variables (room_type). If we want to look at how two continuous variables are related, we could create a scatterplot by specifiying x and y aesthetics and using geom_point()
ggplot(data=airbnb_dc,aes(x=number_of_reviews,y=review_scores_rating))+
geom_point() + xlab("Review Scores Rating") + ylab("Number of Reviews") + theme_bw()
## Warning: Removed 2207 rows containing missing values (geom_point).
Again, we can use facets to look at how this varies across room_type:
ggplot(data=airbnb_dc,aes(x=number_of_reviews,y=review_scores_rating))+
geom_point() + xlab("Review Scores Rating") + ylab("Number of Reviews") + theme_bw() + facet_wrap(~ room_type)
## Warning: Removed 2207 rows containing missing values (geom_point).
Concise guide to basic computational reproducibility
Eight things you can do to make your open science more understandable
A detailed guide to reproducible workflows in collaborative settings
For more complicated projects, consider using make and docker