beginR: Data Transformations

Data and other downloads

Five Thousand Wine Reviews

Today

This workshop aims to cover:

  • Review of Previous Topics
    • Starting a New Project
    • Loading the tidyverse
    • Importing Data
  • Data Transformations
    • Filtering Data
    • Relational and Assignment Operators
    • Reordering Data (arrange)
    • Selecting Data
    • Renaming Columns
    • Adding new Variables (mutate)
    • Summerising Data
  • R for Data Science Chapter 5

Start a New Project in R

It is best practice to set up a new directory each time we start a new project in R. To do so, complete the following steps:

  1. Go to File > New Project > New Directory > New Project.
  2. Type in a name for your directory and click Browse. Be sure to pick a place for your directory that you will be able to find later.
  3. Go to Finder on Mac or File Explorer on PC and find the directory you just created.
  4. Inside your project directory, create a new folder called data.
  5. Download or copy the data file (5k_wine_reviews.csv) into the data folder.
  6. Go to File > Save As to give your R script a name and save it in your project directory.

Loading the tidyverse

The first few lines of our R script are made up of comments to document the script’s purpose, creator and date. Next, we use the library() function to load any packages that our script will need to run. In this workshop, we are using the tidyverse package for all of our projects.

# Purpose: Transforming Wine Review Data
# Author: Nuvan Rathnayaka
# Date: January 2019

# Setup -----------------------------------------------------------
library(tidyverse)

Importing Data

We use the read_csv() function to import our data.

# Data Import ------------------------------------------------------
reviews <- read_csv("data/5k_wine_reviews.csv")

When we use read_csv(), our data is stored as a type of object in R called a data frame. One of the nice things about data frames is that we can see a visual representation of them using R Studio. Take a look at your Environment tab in the upper right and click on the reviews object you just created.

The data will be represented as a table in a new tab.

Filtering

We have a fairly big data set. What if we want to focus on certain records? For example, let’s say I’m only interested in Chilean wine. I can use the filter() function to only look at wines with “Chile” as their country of origin.

filter(reviews, country == "Chile")
# A tibble: 194 × 10
   country description desig…¹ points price provi…² regio…³ title variety winery
   <chr>   <chr>       <chr>    <dbl> <dbl> <chr>   <chr>   <chr> <chr>   <chr> 
 1 Chile   White flow… Estate      86    15 Colcha… <NA>    Esta… Viogni… Estam…
 2 Chile   A berry ar… <NA>        86     9 Maule … <NA>    Sund… Merlot  Sunda…
 3 Chile   This is mu… Gran R…     85    22 Colcha… <NA>    Casa… Petit … Casa …
 4 Chile   Lightly he… Reserve     85    13 Maipo … <NA>    Tres… Pinot … Tres …
 5 Chile   Caramelize… Specia…     86    12 Rapel … <NA>    Ares… Carmen… Aresti
 6 Chile   A bright n… Single…     87    18 Leyda … <NA>    Leyd… Chardo… Leyda 
 7 Chile   This blend… Condes…     91    29 Aconca… <NA>    Cond… Red Bl… Conde…
 8 Chile   Dry, briar… 20 Bar…     91    32 Maipo … <NA>    Cono… Cabern… Cono …
 9 Chile   Dry, spicy… The 7t…     91    20 Loncom… <NA>    G7 2… Cabern… G7    
10 Chile   Haven't se… Unusua…     88    50 Maipo … <NA>    Terr… Red Bl… Terra…
# … with 184 more rows, and abbreviated variable names ¹​designation, ²​province,
#   ³​region_1
# ℹ Use `print(n = ...)` to see more rows

When we use the filter function on our reviews object, it desplays the results in our Console tab. However, these results have not been saved, only displayed. If we want to save our results so we can perform additional transformations on them later, we need to assign them to a new data frame. Here, we’ll name the data frame “chilean”.

chilean <- filter(reviews, country == "Chile")

Let’s say I also don’t have a lot of money to spend on wine, so I need to limit my data set to wines that cost $20 or less. We should edit our filter to include a condition on price.

chilean <- filter(reviews, country == "Chile", price <= 20)

Note that this time we are assigining the results to the same object we used before. This allows us to replace our previous data for that object with new data. Now, if we click on the “chilean” data frame in our Environment tab, we can see our data set has been limited only to wines from Chile that are $20 or less.

Relational and Assignment Operators

Our filter uses some symbols you may not be familiar with, such as == and <=. These are called relational operators. We use a relational operator to check the relationship between the two operands on either side of it.

Operator Relationship Check
> Is the left operand greater than the right operand?
< Is the left operand less than the right operand?
>= Is the left operand greater than or equal to the right operand?
<= Is the left operand less than or equal to the right operand?
== Is the left operand equal to the right operand?
!= Is the left operand not equal to the right operand?

Above, you’ll notice that we use == to check if something is equal. This is different from defining something as equal. To set or assign a value, we use either = or <- instead of ==. For example:

Use Meaning
filter(reviews, country == "Chile") Check each review to see if the country is equal to “Chile”
age = 52 Assign the value 52 to a variable called “age”
data <- survey_results Assign the contents of the object called “survey_results” to an object called “data”.

Reordering Data (arrange)

Let’s arrange the rows of our data frame in a way that’s more informative. This time, we’ll just take a look at the results in our console rather than assigning them to an object.

First, I want to know which wines in my Chilean data set get the highest number of review points. The arrange() function allows us to reorder our data by a particular variable. Below, we’ll reorder by points.

arrange(chilean, points)
# A tibble: 154 × 10
   country description desig…¹ points price provi…² regio…³ title variety winery
   <chr>   <chr>       <chr>    <dbl> <dbl> <chr>   <chr>   <chr> <chr>   <chr> 
 1 Chile   Aromas of … Gran R…     80    19 Leyda … <NA>    Viña… Chardo… Viña …
 2 Chile   Fluffy, sw… Reserve     80    15 Maule … <NA>    Cuev… Chardo… Cueva…
 3 Chile   Jumpy on t… Grand …     81    20 Rapel … <NA>    Pura… Pinot … Pura 8
 4 Chile   Crisp but … <NA>        81    14 Maipo … <NA>    Cous… Sauvig… Cousi…
 5 Chile   Flat aroma… Reserv…     81    12 Maule … <NA>    Ovej… Rosé    Oveja…
 6 Chile   Funky and … <NA>        81    10 Centra… <NA>    Arau… Carmen… Arauco
 7 Chile   Rubbery, m… <NA>        81    10 Cachap… <NA>    Corn… Cabern… Corne…
 8 Chile   This is ed… Albamar     81    13 Casabl… <NA>    Will… Pinot … Willi…
 9 Chile   Smells veg… Selecc…     82    15 Maipo … <NA>    Viña… Sauvig… Viña …
10 Chile   Smells per… Hacien…     82    13 Centra… <NA>    Fran… Pinot … Franç…
# … with 144 more rows, and abbreviated variable names ¹​designation, ²​province,
#   ³​region_1
# ℹ Use `print(n = ...)` to see more rows

Well, that sort of worked. My wines are reordered by points, but in ascending order. To reorder them in descending order, we need to use another function inside the arrange function. As you learned last week, this is called a nested function.

arrange(chilean, desc(points))
# A tibble: 154 × 10
   country description desig…¹ points price provi…² regio…³ title variety winery
   <chr>   <chr>       <chr>    <dbl> <dbl> <chr>   <chr>   <chr> <chr>   <chr> 
 1 Chile   Dry, spicy… The 7t…     91    20 Loncom… <NA>    G7 2… Cabern… G7    
 2 Chile   Composed b… Gran R…     90    15 Colcha… <NA>    Carm… Carmen… Carmen
 3 Chile   Over the p… 1865 S…     90    19 Leyda … <NA>    San … Sauvig… San P…
 4 Chile   Toasty and… Gran R…     89    18 Maipo … <NA>    Tara… Carmen… Tarap…
 5 Chile   This dry-f… Maripo…     89    20 Loncom… <NA>    Gill… Syrah-… Gillm…
 6 Chile   Composed a… Envero…     88    15 Colcha… <NA>    Apal… Carmen… Apalt…
 7 Chile   Bisquertt … Casa L…     88    11 Colcha… <NA>    Viña… Merlot  Viña …
 8 Chile   Aromas of … Reserv…     88    10 Maipo … <NA>    De M… Carmen… De Ma…
 9 Chile   At first t… Estate…     88    12 Rapel … <NA>    Lapo… Merlot  Lapos…
10 Chile   With its h… Estate…     88    15 Maipo … <NA>    Carm… Cabern… Carmen
# … with 144 more rows, and abbreviated variable names ¹​designation, ²​province,
#   ³​region_1
# ℹ Use `print(n = ...)` to see more rows

Now we can see which wines have the highest number of points. But what if we want to reorder by multiple variables? For example, many wines with the same number of points have different prices. What if we wanted to reorder those wines by price as well, so we can see which wines have the highest reviews with the lowest prices? The arrange() function allows us to add multiple variables simply by using commas.

arrange(chilean, desc(points), price)
# A tibble: 154 × 10
   country description desig…¹ points price provi…² regio…³ title variety winery
   <chr>   <chr>       <chr>    <dbl> <dbl> <chr>   <chr>   <chr> <chr>   <chr> 
 1 Chile   Dry, spicy… The 7t…     91    20 Loncom… <NA>    G7 2… Cabern… G7    
 2 Chile   Composed b… Gran R…     90    15 Colcha… <NA>    Carm… Carmen… Carmen
 3 Chile   Over the p… 1865 S…     90    19 Leyda … <NA>    San … Sauvig… San P…
 4 Chile   Toasty and… Gran R…     89    18 Maipo … <NA>    Tara… Carmen… Tarap…
 5 Chile   This dry-f… Maripo…     89    20 Loncom… <NA>    Gill… Syrah-… Gillm…
 6 Chile   Aromas of … Reserv…     88    10 Maipo … <NA>    De M… Carmen… De Ma…
 7 Chile   Dark berry… <NA>        88    10 Centra… <NA>    Másc… Cabern… Másca…
 8 Chile   Bisquertt … Casa L…     88    11 Colcha… <NA>    Viña… Merlot  Viña …
 9 Chile   At first t… Estate…     88    12 Rapel … <NA>    Lapo… Merlot  Lapos…
10 Chile   Right off … Reserve     88    12 Maipo … <NA>    Casa… Cabern… Casa …
# … with 144 more rows, and abbreviated variable names ¹​designation, ²​province,
#   ³​region_1
# ℹ Use `print(n = ...)` to see more rows

Selecting Data (select)

Most real world data sets we want to use in our research are going to be messy, and we’ll need to go through a process of cleaning the data before we can start to explore and analyze it. The first steps in many data cleaning processes involve removing data that we don’t need.

Going back to our original reviews data frame, let’s imagine we are only interested in the country, province and variety of our wines. We could create a new object called “wine_types” using the select() function. We’ll also preview our new object by simply typing its name and running the script.

wine_types <- select(reviews, country, province, variety)
wine_types
# A tibble: 5,000 × 3
   country  province          variety           
   <chr>    <chr>             <chr>             
 1 Italy    Sicily & Sardinia White Blend       
 2 Portugal Douro             Portuguese Red    
 3 US       Oregon            Pinot Gris        
 4 US       Michigan          Riesling          
 5 US       Oregon            Pinot Noir        
 6 Spain    Northern Spain    Tempranillo-Merlot
 7 Italy    Sicily & Sardinia Frappato          
 8 France   Alsace            Gewürztraminer    
 9 Germany  Rheinhessen       Gewürztraminer    
10 France   Alsace            Pinot Gris        
# … with 4,990 more rows
# ℹ Use `print(n = ...)` to see more rows

But what if we have a lot of columns in our dataset and only need to remove one or two? Look at our chilean data frame again.

chilean
# A tibble: 154 × 10
   country description desig…¹ points price provi…² regio…³ title variety winery
   <chr>   <chr>       <chr>    <dbl> <dbl> <chr>   <chr>   <chr> <chr>   <chr> 
 1 Chile   White flow… Estate      86    15 Colcha… <NA>    Esta… Viogni… Estam…
 2 Chile   A berry ar… <NA>        86     9 Maule … <NA>    Sund… Merlot  Sunda…
 3 Chile   Lightly he… Reserve     85    13 Maipo … <NA>    Tres… Pinot … Tres …
 4 Chile   Caramelize… Specia…     86    12 Rapel … <NA>    Ares… Carmen… Aresti
 5 Chile   A bright n… Single…     87    18 Leyda … <NA>    Leyd… Chardo… Leyda 
 6 Chile   Dry, spicy… The 7t…     91    20 Loncom… <NA>    G7 2… Cabern… G7    
 7 Chile   Composed a… Envero…     88    15 Colcha… <NA>    Apal… Carmen… Apalt…
 8 Chile   Bisquertt … Casa L…     88    11 Colcha… <NA>    Viña… Merlot  Viña …
 9 Chile   Clean and … Natura      87    11 Casabl… <NA>    Emil… Chardo… Emili…
10 Chile   Rooty and … Grey […     87    20 Maipo … <NA>    Vent… Cabern… Venti…
# … with 144 more rows, and abbreviated variable names ¹​designation, ²​province,
#   ³​region_1
# ℹ Use `print(n = ...)` to see more rows

You’ll notice it has two columns which are no longer useful for us. The “country” column is the same for every row, and the “region_1” column is missing data in every row. Let’s remove those columns using select(). This time, we’ll use another nested function.

chilean <- select(chilean, -c(country, region_1))

chilean
# A tibble: 154 × 8
   description                 desig…¹ points price provi…² title variety winery
   <chr>                       <chr>    <dbl> <dbl> <chr>   <chr> <chr>   <chr> 
 1 White flower, lychee and a… Estate      86    15 Colcha… Esta… Viogni… Estam…
 2 A berry aroma comes with c… <NA>        86     9 Maule … Sund… Merlot  Sunda…
 3 Lightly herbal strawberry … Reserve     85    13 Maipo … Tres… Pinot … Tres …
 4 Caramelized oak and vanill… Specia…     86    12 Rapel … Ares… Carmen… Aresti
 5 A bright nose with green a… Single…     87    18 Leyda … Leyd… Chardo… Leyda 
 6 Dry, spicy aromas of tobac… The 7t…     91    20 Loncom… G7 2… Cabern… G7    
 7 Composed and structured, w… Envero…     88    15 Colcha… Apal… Carmen… Apalt…
 8 Bisquertt usually does wel… Casa L…     88    11 Colcha… Viña… Merlot  Viña …
 9 Clean and honest up front,… Natura      87    11 Casabl… Emil… Chardo… Emili…
10 Rooty and leafy on the nos… Grey […     87    20 Maipo … Vent… Cabern… Venti…
# … with 144 more rows, and abbreviated variable names ¹​designation, ²​province
# ℹ Use `print(n = ...)` to see more rows

We use the c() function to combine elements and the - sign to indicate that we want those elements to be removed.

Sometimes, we’ll need to keep or remove a series of columns that are directly adjacent to each other in the data frame. In those circumstances, we can use the following syntax:

# Keep columns "title" through "winery"
select(reviews, title:winery)
# A tibble: 5,000 × 3
   title                                                          variety winery
   <chr>                                                          <chr>   <chr> 
 1 Nicosia 2013 Vulkà Bianco  (Etna)                              White … Nicos…
 2 Quinta dos Avidagos 2011 Avidagos Red (Douro)                  Portug… Quint…
 3 Rainstorm 2013 Pinot Gris (Willamette Valley)                  Pinot … Rains…
 4 St. Julian 2013 Reserve Late Harvest Riesling (Lake Michigan … Riesli… St. J…
 5 Sweet Cheeks 2012 Vintner's Reserve Wild Child Block Pinot No… Pinot … Sweet…
 6 Tandem 2011 Ars In Vitro Tempranillo-Merlot (Navarra)          Tempra… Tandem
 7 Terre di Giurfo 2013 Belsito Frappato (Vittoria)               Frappa… Terre…
 8 Trimbach 2012 Gewurztraminer (Alsace)                          Gewürz… Trimb…
 9 Heinz Eifel 2013 Shine Gewürztraminer (Rheinhessen)            Gewürz… Heinz…
10 Jean-Baptiste Adam 2012 Les Natures Pinot Gris (Alsace)        Pinot … Jean-…
# … with 4,990 more rows
# ℹ Use `print(n = ...)` to see more rows
# Remove columns "title" through "winery"
select(reviews, -c(title:winery))
# A tibble: 5,000 × 7
   country  description                     desig…¹ points price provi…² regio…³
   <chr>    <chr>                           <chr>    <dbl> <dbl> <chr>   <chr>  
 1 Italy    Aromas include tropical fruit,… Vulkà …     87    NA Sicily… Etna   
 2 Portugal This is ripe and fruity, a win… Avidag…     87    15 Douro   <NA>   
 3 US       Tart and snappy, the flavors o… <NA>        87    14 Oregon  Willam…
 4 US       Pineapple rind, lemon pith and… Reserv…     87    13 Michig… Lake M…
 5 US       Much like the regular bottling… Vintne…     87    65 Oregon  Willam…
 6 Spain    Blackberry and raspberry aroma… Ars In…     87    15 Northe… Navarra
 7 Italy    Here's a bright, informal red … Belsito     87    16 Sicily… Vittor…
 8 France   This dry and restrained wine o… <NA>        87    24 Alsace  Alsace 
 9 Germany  Savory dried thyme notes accen… Shine       87    12 Rheinh… <NA>   
10 France   This has great depth of flavor… Les Na…     87    27 Alsace  Alsace 
# … with 4,990 more rows, and abbreviated variable names ¹​designation,
#   ²​province, ³​region_1
# ℹ Use `print(n = ...)` to see more rows

In both cases the : symbol means “through” and allows us to select a series of columns at once.

Renaming Columns

Renaming our columns is easy to do when we use the rename function. We need to provide the new name we want to use first, and then assign the old name to it.

chilean <- rename(chilean, review_points = points)
chilean
# A tibble: 154 × 8
   description                desig…¹ revie…² price provi…³ title variety winery
   <chr>                      <chr>     <dbl> <dbl> <chr>   <chr> <chr>   <chr> 
 1 White flower, lychee and … Estate       86    15 Colcha… Esta… Viogni… Estam…
 2 A berry aroma comes with … <NA>         86     9 Maule … Sund… Merlot  Sunda…
 3 Lightly herbal strawberry… Reserve      85    13 Maipo … Tres… Pinot … Tres …
 4 Caramelized oak and vanil… Specia…      86    12 Rapel … Ares… Carmen… Aresti
 5 A bright nose with green … Single…      87    18 Leyda … Leyd… Chardo… Leyda 
 6 Dry, spicy aromas of toba… The 7t…      91    20 Loncom… G7 2… Cabern… G7    
 7 Composed and structured, … Envero…      88    15 Colcha… Apal… Carmen… Apalt…
 8 Bisquertt usually does we… Casa L…      88    11 Colcha… Viña… Merlot  Viña …
 9 Clean and honest up front… Natura       87    11 Casabl… Emil… Chardo… Emili…
10 Rooty and leafy on the no… Grey […      87    20 Maipo … Vent… Cabern… Venti…
# … with 144 more rows, and abbreviated variable names ¹​designation,
#   ²​review_points, ³​province
# ℹ Use `print(n = ...)` to see more rows

Above, we’ve changed the name of our “points” column to “review_points”.

Adding New Variables

My chilean data set has prices listed in USD, but suppose I also needed to know how many Euros each wine costs. Currently, the exchange rate is 0.85 Euros for every 1 USD. So, I just need to multiply the price column by 0.85, but how can I store the results of my currency conversion in a new column? The mutate() function makes that possible.

chilean <- mutate(chilean, EUR = price * 0.85)

Now I’m going to make several more changes to my data set.

# Rename the "price" column to "USD"
chilean <- rename(chilean, USD = price)

# Reorder the columns so that "USD" and "EUR" are adjacent
chilean <- select(chilean, description:review_points, USD, EUR, province:winery)

#Preview the results
chilean
# A tibble: 154 × 9
   description          desig…¹ revie…²   USD   EUR provi…³ title variety winery
   <chr>                <chr>     <dbl> <dbl> <dbl> <chr>   <chr> <chr>   <chr> 
 1 White flower, lyche… Estate       86    15 12.8  Colcha… Esta… Viogni… Estam…
 2 A berry aroma comes… <NA>         86     9  7.65 Maule … Sund… Merlot  Sunda…
 3 Lightly herbal stra… Reserve      85    13 11.0  Maipo … Tres… Pinot … Tres …
 4 Caramelized oak and… Specia…      86    12 10.2  Rapel … Ares… Carmen… Aresti
 5 A bright nose with … Single…      87    18 15.3  Leyda … Leyd… Chardo… Leyda 
 6 Dry, spicy aromas o… The 7t…      91    20 17    Loncom… G7 2… Cabern… G7    
 7 Composed and struct… Envero…      88    15 12.8  Colcha… Apal… Carmen… Apalt…
 8 Bisquertt usually d… Casa L…      88    11  9.35 Colcha… Viña… Merlot  Viña …
 9 Clean and honest up… Natura       87    11  9.35 Casabl… Emil… Chardo… Emili…
10 Rooty and leafy on … Grey […      87    20 17    Maipo … Vent… Cabern… Venti…
# … with 144 more rows, and abbreviated variable names ¹​designation,
#   ²​review_points, ³​province
# ℹ Use `print(n = ...)` to see more rows

Did I mention you can use select() to reorder columns as well? You can!

Summarising Data

Now that we’ve done some cleaning of our data set, we can start to explore it a bit by taking a look at basic summary statistics with the summarise() function. One thing we need to remember when using summarise() is that missing data can cause a problem, so in many cases we will need to tell R to ignore rows with missing data using the na.rm parameter.

Let’s take a look at the mean review points for all of our Chilean wines. We’ll assign our result to a column named “mean_review_points”.

summarise(chilean, mean_review_points = mean(review_points, na.rm = TRUE))
# A tibble: 1 × 1
  mean_review_points
               <dbl>
1               85.9

A more informative approach might be to look at the mean number of points by province. To do that, we first need to create a new object using the group_by() function.

by_prov <- group_by(chilean, province)
summarise(by_prov, mean_review_points = mean(review_points, na.rm = TRUE))
# A tibble: 20 × 2
   province          mean_review_points
   <chr>                          <dbl>
 1 Aconcagua Costa                 88  
 2 Aconcagua Valley                86.5
 3 Cachapoal Valley                84.6
 4 Casablanca Valley               85.5
 5 Central Valley                  84.7
 6 Chile                           85  
 7 Colchagua Costa                 84  
 8 Colchagua Valley                86.6
 9 Curicó Valley                   85.8
10 Elqui Valley                    88  
11 Leyda Valley                    85.7
12 Limarí Valley                   86  
13 Loncomilla Valley               87.5
14 Lontué Valley                   87  
15 Maipo Valley                    86.3
16 Marchigue                       87  
17 Maule Valley                    85.1
18 Peumo                           87  
19 Rapel Valley                    85.7
20 Rio Claro                       86  

Now we want to arrange the results from highest to lowest review points, but that would require us creating another object to use arrange() on. Instead of creating many different objects that we don’t necessarily need to keep around, let’s use pipes.

Piping

Pipes use the %>% symbol to connect multiple functions without creating multiple objects. To use group_by(), summarise() and arrange() on our data all at once, we use the pipe like so:

chilean %>%
  group_by(province) %>%
  summarise(mean_review_points = mean(review_points, na.rm = TRUE)) %>%
  arrange(desc(mean_review_points))
# A tibble: 20 × 2
   province          mean_review_points
   <chr>                          <dbl>
 1 Aconcagua Costa                 88  
 2 Elqui Valley                    88  
 3 Loncomilla Valley               87.5
 4 Lontué Valley                   87  
 5 Marchigue                       87  
 6 Peumo                           87  
 7 Colchagua Valley                86.6
 8 Aconcagua Valley                86.5
 9 Maipo Valley                    86.3
10 Limarí Valley                   86  
11 Rio Claro                       86  
12 Curicó Valley                   85.8
13 Rapel Valley                    85.7
14 Leyda Valley                    85.7
15 Casablanca Valley               85.5
16 Maule Valley                    85.1
17 Chile                           85  
18 Central Valley                  84.7
19 Cachapoal Valley                84.6
20 Colchagua Costa                 84  

Shortcut for %>%:
CMD + SHIFT + m (Mac)
CTRL + SHIFT + m (PC)

Not only do the pipes prevent us from having to create lots of objects, they also allow us to arrange the code in a way that is easier to read. Notice that we also only needed to name our object once, at the beginning of the sequence, for it to be used in every function onward.

Let’s try another sequence of functions using the pipe to answer the following question: What is the highest price for each variety of wine in each province?

chilean %>%
  group_by(province, variety) %>%
  summarise(max_price = max(USD, na.rm = TRUE)) %>%
  arrange(province, variety, desc(max_price))
`summarise()` has grouped output by 'province'. You can override using the
`.groups` argument.
# A tibble: 78 × 3
# Groups:   province [20]
   province          variety                   max_price
   <chr>             <chr>                         <dbl>
 1 Aconcagua Costa   Chardonnay                       20
 2 Aconcagua Valley  Cabernet Sauvignon               19
 3 Aconcagua Valley  Sauvignon Blanc                  19
 4 Cachapoal Valley  Cabernet Sauvignon               18
 5 Cachapoal Valley  Cabernet Sauvignon-Merlot        10
 6 Cachapoal Valley  Carmenère                        12
 7 Cachapoal Valley  Syrah                            15
 8 Casablanca Valley Chardonnay                       15
 9 Casablanca Valley Gewürztraminer                   15
10 Casablanca Valley Merlot                           17
# … with 68 more rows
# ℹ Use `print(n = ...)` to see more rows

Exercises

Concerning our original “reviews” data set, use the functions we learned this week to answer the following questions:

  1. What is the price and review score for the most expensive French wine?
  2. How many countries have wine that cost more than $500?
  3. To which country and province should you travel to find the cheapest Tempranillo?
  4. You found a website that lets you order any wine for a flat shipping fee of $15! Find the total price for each wine that includes shipping.
  5. The following code produces errors. Correct them all:

reviews %>%
select == (Title, province, country, price) %>%
filter (country = Germany) %>
group_by (province) %>%
Summarise (mean_price == mean(price))