beginR: Introduction

Author

University of North Carolina at Chapel Hill

Why R?

R is widely used in statistics and has become quite popular in the biomedical and social sciences over the last decade. R can be used as either a programming language (like Python or Java) or as a standard statistical package (like Stata, SAS, or SPSS).

R is open source and free to use for any purpose. This has clear advantages and a few disadvantages:

Advantages

Disadvantages

  • No commercial support
    • R is available free as is. If something doesn’t exist or doesn’t work the way you think it should you can wait for someone else to (hopefully) build it, or it’s up to you
    • No commercial hotline for help
    • Support and libraries from other users varies by popularity of method/area of study.

Learning objectives for the semester

R is a big tent that supports a variety of programming styles and tasks, so we can’t possibly cover all of it. We will focus instruction on using R for analyzing data. If you are interested in software development in R, you can use the unstructured time for help with that. We’ll also provide links to some great online resources for software development in R.

We are going to focus on the Tidyverse family of R packages for data analysis. The authors of the Tidyverse have also written a free ebook, R for Data Science. We will not follow this book exactly, but for each of our workshops we will announce the corresponding chapters from the book, so you can have an additional perspective and practice on what we are teaching.

This workshop does not aim to give a comprehensive introduction to all aspects of R.

We also hope to empower you to become a self-sufficient R user, so we’ll be showing you examples of how to troubleshoot and debug R code. Those of us who have used R regularly for years still troubleshoot, refer to documentation, and google error codes all the time!

Setup: R, R Studio

This introductory workshop aims to:

  • Install R and R Studio if you haven’t already
  • Briefly familiarize or re-familiarize you with some of the most important elements of R
    • RStudio interface
    • Tidyverse library
    • Graphics using ggplot2
    • Data types
  • R for Data Science Chapters 1-3

Installation

Mac Installation PC Installation

Check your version number and make sure to download the correct version of R/RStudio. Download R from https://cran.r-project.org/bin/macosx/

  • Choose the .pkg link (e.g. R-4.2.1.pkg as of this writing)
Download R from https://cran.r-project.org/bin/windows/base/

Download R Studio at https://posit.co/download/rstudio-desktop/#download

  • Skip to 2: Install RStudio

Download R Studio at https://posit.co/download/rstudio-desktop/#download

  • Skip to 2: Install RStudio

Optional:

Optional:

  • Install https://cran.r-project.org/bin/windows/Rtools/
    • Make sure to choose the Rtools version that matches your R installation
    • On the next page, click the link for the “Rtools<xx> installer” where <xx> is the appropriate version number.

R Studio orientation

Panes

R Studio shows four panes by default. The two most important for writing, testing, and executing R code are the console and the script editor.

Console (bottom left)

The console pane allows immediate execution of R code. This pane can be used to experiment with new functions and execute one-off commands, like printing sample data or exploratory plots.

Type next to the > symbol, then hit Enter to execute.

Script Editor (top left)

In contrast, code typed into the script editor pane does not automatically execute. The script editor provides a way to save the essential code for your analysis or project so it can be re-used or referred to for replication in the future. The editor also allows us to save, search, replace, etc. within an R ‘script.’ (This is much like the difference between using a Stata .do file and the main Stata window!)

The script should contain every necessary step in a data analysis; you should aim to be able to open and run a script in a new R session without relying on anything you’ve only run in the console. In contrast, the console is a great place to test code or run one-off checks of the data.

R Studio provides a few convenient ways to send code from the script editor to be executed in the console pane.

  • The Run button at the top right of the script editor pane provides several options.
  • One of the most useful is the CTRL+Enter (PC) or CMD+Enter (Mac) shortcut to execute whichever lines are currenly selected in the script editor
    • If no complete lines are selected, this will execute the line where the cursor is placed.
Other Panes

The other panes provided in R Studio are provided to make some coding tasks easier. Some users hide these windows as they are less essential to writing code. Some of the most useful panes are:

  • Environment (top): Lists all objects and datasets defined in your current R session.

  • Plots (bottom): Displays plots generated by R code.

  • Help (bottom): Displays html format help files for R packages and functions

  • Files (bottom): Lists files in the current working directory (more on this later!)

Loading the tidyverse

First, we need to load the tidyverse package. (Note, the terms package and library are interchangeable). You can install the library using the install.packages command or through the Packages tab on the lower right pane. Once you have it installed, you can load this library with the following code:

#install.packages("tidyverse", dependencies=TRUE)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You only have to install a package once, but you’ll need to load it each time you start a new R session.

Note: install.packages("tidyverse", dependencies=TRUE) is a great candidate to run in the console window, since you only need to run it once and it won’t be needed again. library(tidyverse) has to be run before using the tidyverse package, so it is probably necessary in your script!

A quick example

For simplicity, today we’ll be working with datasets that are included with Base R.

data()

From the Packages pane, we can click on the datasets package, which will take us to the help page:

Now we can read up on the available datasets.

Let’s start exploring the New York Air Quality dataset

data(airquality)
head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
tail(airquality)
    Ozone Solar.R Wind Temp Month Day
148    14      20 16.6   63     9  25
149    30     193  6.9   70     9  26
150    NA     145 13.2   77     9  27
151    14     191 14.3   75     9  28
152    18     131  8.0   76     9  29
153    20     223 11.5   68     9  30
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               

Interlude: What is airquality? data()? head()? summary()?

Unlike other popular statistical software, R is an object oriented programming language, similar to other popular languages such as Java, Python, Ruby, or C++.

airquality is an object. data(), head(), and summary() are functions. We’ll learn more about these as we go along, but for now, it may be useful to think of objects as nouns and functions as verbs.

Plotting with ggplot2

ggplot(airquality, aes(x = Temp, y = Ozone)) + 
  geom_point()
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

ggplot has a modular structure, so it’s easy to modify plots. Let’s add a best fit linear regression line.

ggplot(airquality, aes(x = Temp, y = Ozone)) + 
  geom_point() + geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Let’s change that to a loess (a locally-weighted version of linear regression).

ggplot(airquality, aes(x = Temp, y = Ozone)) + 
  geom_point() + geom_smooth(method="loess")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

What if we want to plot the data over time? We’ll have to do a little work.

Creating a date variable

I don’t remember how to create dates in R, so let’s do a little searching.

Yes, I’m live googling in a class. The pros do it all the time. You should too.

This stackoverflow.com page suggests a solution.

First, we’ll create a new column in the airquality dataframe for the year.

airquality$Year <- "1973"

There’s a lot going on in that one line of code. Let’s dissect it.

  • We are using the $ operator. This is a way to access individual columns of a dataframe.

  • We are dealing with a brand new data type, character. This is indicated by the quotation marks around "1973".

  • We are using the assignment operator <-, which can be read as “assign the character value 1973 to the Year column of the airquality dataframe.”

Now we can use the code snippet we found on stackoverflow.

airquality$Date <- as.Date(with(airquality, paste(Year, Month, Day, sep="-")), "%Y-%m-%d")

Let’s figure out why this works using R’s help system:

  • ? followed by a function name will bring up the documentation for that function
  • ?? will search for any funtions related to the term supplied

Here we know that as.Date is a function, so we’ll use a single question mark.

?as.Date
#equivalent to help("as.Date")
#also try ??date

So as.Date() is a function that can take a character representation of a date and convert it into a Date object (or vice versa).

?with

with() is a function that allows us to reference elements of data frame without having to type the data frame’s name repeatedly. This suggests we could get the same results as above with:

airquality$Date2 <- as.Date(paste(airquality$Year, airquality$Month, airquality$Day, sep="-"), "%Y-%m-%d")

We can check the dataframe viewer to see that the results are indentical or we can use the all.equal() function.

?paste

paste() lets us stick a bunch of character values together.

Finally, we can create a plot.

ggplot(airquality, aes(x=Date, y=ozone)) +
  geom_point() + geom_smooth(method="loess") + theme_bw()
Error in `geom_point()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'ozone' not found

Oops. Let’s try that again.

ggplot(airquality, aes(x=Date, y=Ozone)) +
  geom_point() + geom_smooth(method="loess") + theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Other geoms: geom_boxplot() and geom_jitter()

ggplot(airquality, aes(x = Month, y = Ozone)) +
        geom_boxplot()
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Looks like I left out something geom_boxplot() needs. Time for google!

ggplot(airquality, aes(x = Month, y = Ozone, group=Month)) +
        geom_boxplot()
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Let’s fix up the x-axis labels.

airquality$MonthFac <- factor(airquality$Month,labels = c("May", "Jun", "Jul", "Aug", "Sep"))
ggplot(airquality, aes(x = MonthFac, y = Ozone)) +
        geom_boxplot()
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Getting there, let’s try a few more tweaks.

ggplot(airquality, aes(x = MonthFac, y = Ozone)) +
        geom_boxplot() + xlab("Month") + ylab("Mean Daily Ozone (ppb)") + ggtitle("New York City Air Quality")  + theme_bw()
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).

What if I want to see the original data and the boxplots?

ggplot(airquality, aes(x = MonthFac, y = Ozone)) +
        geom_boxplot() + xlab("Month") + ylab("Mean Daily Ozone (ppb)") + ggtitle("New York City Air Quality")  + theme_bw() + geom_point()
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

That’s hard to read, so let’s try using geom_jitter()

ggplot(airquality, aes(x = MonthFac, y = Ozone)) +
        geom_boxplot() + xlab("Month") + ylab("Mean Daily Ozone (ppb)") + ggtitle("New York City Air Quality")  + theme_bw() + geom_jitter()
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Maybe I can make the jittering a bit more compact. Time for the R help system, which we can access through the RStudio Help pane.

ggplot(airquality, aes(x = MonthFac, y = Ozone)) +
        geom_boxplot() + xlab("Month") + ylab("Mean Daily Ozone (ppb)") + ggtitle("New York City Air Quality")  + theme_bw() + geom_jitter(width=0.2)
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Facets

ggplot(airquality, aes(x=Temp, y=Ozone)) + geom_point() + geom_smooth(method = "lm") + facet_grid(. ~ Month) + xlab("Maxiumum Daily Temperature (degrees Fahrenheit") + ylab("Mean Daily Ozone (ppb)") + ggtitle("New York City Air Quality") + theme_bw()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 37 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 37 rows containing missing values or values outside the scale range
(`geom_point()`).

Resources

Beginners

Review today’s material by reading R for Data Science Chapters 1-3. Try the exercises in Chapter 3.

Next week, we’ll be covering the material in Chapters 4, 6, and 8 if you’d like to look ahead!

Other Resources

Quick References

Tidyverse Cheatsheets

Quick-R for Base R

Useful R Packages:

CRAN Task Views

Quick list of useful packages

Advanced R Books:

ggplot2

Advanced R

R packages

Efficient R

Statistical Modeling Textbooks:

An Introduction to Statistical Learning

Regression Modeling Strategies

Statistical Rethinking

Misc:

https://github.com/ujjwalkarn/DataScienceR

Feedback

You’ll receieve an automated email with feedback information after the workshop!