Scientific reproducibility - if you repeat my experiment, you will get the same results as me. This is beyond the scope of this workshop (see here for an overview)
Computational reproducibility - if I give you my data and code, you will get the same results as me. We’ll focus on the basics of computational reproducibility today.
Why does computational reproducibility matter?
Your most important collaborator is you six months from now.
We want Science (with a capital S) to be self-correcting. Since you’re human, you will make mistakes, so you want to make it as easy as possible for others to catch your errors (see the Reinhart and Rogoff Excel error and the Guinea worm wars for recent examples)
Your script is real, your environment is not
For your results to be reproducible, every step you take to manipulate the data should be explicitly documented in an R script.
This also means that you want to make sure that your results don’t have hidden/implicit dependencies in your R environment. There are two steps you can to take to prevent this.
Reconfigure RStudio not to save your workspace between sessions. (Tools > Global Options > General)
Restart RStudio from scratch and rerun the current script often.
Cmd/Ctrl + Shift + F10 to restart RStudio
Cmd/Ctrl + Shift + S to rerun the current script
Note: If you have learned R in the past, you may have been taught to use the expression rm(list = ls()) to clear your R environment. This will not actually return R to a blank slate! Although this code will delete objects in your R environment, it won’t reset global settings that have been changed. It is always safer and more effective to restart RStudio.
1: RStudio Projects
You can use the R function setwd() to point to locations of data files, but this isn’t ideal for reproducible research. When you switch between computers, it is likely that the absolute directory path will change, so you will have to modify the path for each computer. Users of your code will not have the same absolute directory path that you have, so they will have to modify the path in the setwd() function before they can use your code.
RStudio Projects provide a way to bundle together data, code, and other files together so that you can refer to files with relative directory paths, which creates a much more portable structure that won’t need to be modified between computers or users.
Configuration
File > New Project > New Directory > New Project
Create a new subdirectory called “data” inside this project to store the data.
Copy the Airbnb data (listings.csv) into this data directory
Create a new R Script for the data analysis in this project. (File > New File > R Script)
2: Import the data
We’ll begin by loading the tidyverse library. This library includes readr, which we will use to read in our data file (for more details on readr, see Chapter 11 of the text).
RStudio has a useful keyboard shortcut for <- (the assignment operator): Alt + -, i.e. press the Alt key and the minus/hyphen key at the same time.
# Purpose: Exploratory data analysis of airbnb data# Author: Nuvan Rathnayaka# Date: September 2018# Setup -----------------------------------------------------------library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Data Import ------------------------------------------------------airbnb_dc <-read_csv("data/listings.csv") #Since I'm using an RStudio project, I only have to specify the path relative to the top directory
Rows: 7788 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, property_type, room_type, cancellation_policy
dbl (10): id, accommodates, bathrooms, bedrooms, price, extra_people, minimu...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#If you have problems running the read_csv above, the following code should also work#(Download data directly from the internet)#airbnb_dc <- read_csv(url("https://unc-libraries-data.github.io/R-Open-Labs/week2_Workflow/data/listings.csv"))
Reminder: Comments
In R, the # symbol indicates the beginning of a comment. Everything to the right of that is ignored by R, so you can write plain text to document what you’re doing, organize your code, and explain any tricky or unusual steps.
Coding conventions
Something that I’ve glossed over is that you have options for how to name objects. For example, we called our imported data airbnb_dc, but we could have called it airbnbListings, or airbnb.listings
Having a consistent way you name objects and functions helps you remember what things are called. It’s especially useful when you’re working with others to agree upon a coding convention. It doesn’t particularly matter which one you choose, as long as you choose a convention and stick to it.
Example style guide: http://style.tidyverse.org/index.html
3: Objects and classes
R is an object-oriented programming language. One way to understand how objects work in programming is to look at how objects work in real life.
We tend to group objects by their structure. For example, there are many objects we call cars, and they all tend to share similarities in structure - they have wheels, windows, doors, etc. An individual car such as my car or your car is an instance of this structure. Cars also have a set of properties such as make, model, year and color. Each individual car has some variation of these properties.
Objects in R work in a similar way.
In the diagram above, we are looking at three common data structures often used in R: vectors, matrices and data frames. We can create individual objects from these data structures, and those objects have their own properties. The way R defines what properties a given object has is through it’s class. You can think of a class as a contract between R and its users, defining the structure and properties of an object. People who develop R libraries take advantage of this system to reliably customize the behavior of functions.
Vectors
Vectors are the most basic collection of items, often created with c(). Depending on what you put into a vector, it also automatically decides a type. The most common types are: numeric (integer or double), character, and logical.
c(1,2,3,4) #Numeric (integer)
[1] 1 2 3 4
c(1.4,2.7,3.2) #Numeric (double)
[1] 1.4 2.7 3.2
c("cat","dog","fish") #Character
[1] "cat" "dog" "fish"
c(TRUE,FALSE,TRUE) #Logical
[1] TRUE FALSE TRUE
c("cat",1,FALSE) #Character - check other combinations yourself!
[1] "cat" "1" "FALSE"
Vectors only have one dimension: length (accessible with the length function.)
We’ll learn more about another special case of vectors, “factors”, as the semester goes on.
Matrices
Matrices are two-dimensional generalizations of vectors. The dimensions of a matrix are the number of rows and the number of columns (in that order).
Like vectors, matrices have a single type, and we’ll most commonly see numeric, character, and logical types.
matrix(1:9,nrow=3,ncol=3)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
#1:9 is equivalent to c(1,2,3,4,5,6,7,8,9)
Data.frames
Like a matrix, the data.frame has rows and columns, but each column can have its own different data type. This is the most common way to store datasets in R.
For example, our airbnb_dc dataset has been read into a data.frame. We can check its type with class.
class(airbnb_dc)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
We can see that airbnb_dc has multiple classes! "tbl_df" and "tbl" indicate that this data.frame is also a “tibble”.
One way to extract a single column from a dataframe uses the $ operator with the column’s name.
Apartment Bed & Breakfast Boat Boutique hotel
4191 73 3 11
Bungalow Cabin Castle Condominium
4 1 1 420
Dorm Guest suite Guesthouse Hostel
25 11 17 2
House In-law Loft Other
2480 12 33 31
Serviced apartment Timeshare Townhouse Train
1 1 468 1
Treehouse Villa
1 1
As we saw in week 1, you can also next functions within each other. Here, let’s generate a contingency table using table(), then sort it by count using sort().
Apartment House Townhouse Condominium
4191 2480 468 420
Bed & Breakfast Loft Other Dorm
73 33 31 25
Guesthouse In-law Boutique hotel Guest suite
17 12 11 11
Bungalow Boat Hostel Cabin
4 3 2 1
Castle Serviced apartment Timeshare Train
1 1 1 1
Treehouse Villa
1 1
This works ok, but it does mean we’re creating an additional object, property_type_table, and we could end up with lots of objects that are trivially different from each other if we do this too often. In the next lesson, we’ll introduce the pipe operator, which makes this kind of nested computation easier to read.
Suppose we want to know whether a listings price is greater than 500. We can create a logical type vector as follows (output suppressed for space):
airbnb_dc$price >500#Note that since we're not assigning this vector to a variable, R displays it, but doesn't save it as an object in the environment.
We can feed this logical vector into the table() function to get a count of how many listings have prices above $500.
table(airbnb_dc$price >500)
FALSE TRUE
7029 759
We can also use the table function to look at the joint counts of multiple categorical variables. For example, if we want to look at the how many properties have prices above $500 across different property types.
R also has a built-in function to compute correlation coefficients. This let’s us explore the pairwise associations between continuous variables.
with(airbnb_dc, cor(price, number_of_reviews))
[1] -0.1560021
6: Missing data
There are some missing values in the review_scores_rating and reviews_per_month columns, even within the first six rows. We can check which values of review_scores_rating are NA with is.na()
# Continuous Variables ----------------------------------------sum(is.na(airbnb_dc$review_scores_rating)) #summing a logical (TRUE/FALSE) vector counts the number of TRUE values
[1] 2207
We can check all of our columns using R’s colSums to sum all of the columns at the same time.
Handling missing data is an advanced topic that is a bit beyond the scope of this lesson. There is no good way to handle missing data, there are only bad and less bad ways.
One of the bad ways to handle missing data is what’s known as a complete case analysis. In this kind of analysis, you only keep observations (rows) where you have observed (non-missing) values for ALL columns.
R has built-in functions to facilitate this:
complete.cases returns a logical vector indicating whether or not an entire row of a dataset contains any missing values. You can also use this function on a subset of columns to determine if only those columns are complete
na.omit returns a new dataframe with only the complete cases (as identified above)
For most R functions, the default is to return NA is any of the values are NA:
Suppose, we’ve done some computation, and we want to save our modified dataframe. We can use write_csv() to save our dataset as comma separated values file.
Last week, we introduced ggplot2 for statistical graphics. This library implements the grammar of graphics, which provides a structured, modular way to think about plots. The three most commonly altered components are:
aesthetics - the visual properties of objects on the plot, e.g x location, y location, color, size
geom - the geometric object used to represent your data, e.g. point, line, histogram
facets - let you create subplots for each level of a categorical variable
Let’s look at a few examples. We can look at a histogram of review scores. Here, the only aesthetic we need to set is the x location, and then we can request a histogram.
Warning: Removed 2207 rows containing non-finite outside the scale range
(`stat_density()`).
So far, we’ve looked a single continuous variable (review_scores_rating) as it related to a categorical variables (room_type). If we want to look at how two continuous variables are related, we could create a scatterplot by specifiying x and y aesthetics and using geom_point()
ggplot(data=airbnb_dc,aes(x=number_of_reviews,y=review_scores_rating))+geom_point() +xlab("Review Scores Rating") +ylab("Number of Reviews") +theme_bw()
Warning: Removed 2207 rows containing missing values or values outside the scale range
(`geom_point()`).
Again, we can use facets to look at how this varies across room_type:
ggplot(data=airbnb_dc,aes(x=number_of_reviews,y=review_scores_rating))+geom_point() +xlab("Review Scores Rating") +ylab("Number of Reviews") +theme_bw() +facet_wrap(~ room_type)
Warning: Removed 2207 rows containing missing values or values outside the scale range
(`geom_point()`).