Extra - Predictive Modeling with caret

`caret` Classification and Regression Training

The caret package provides a uniform interface for fitting 237 different models.

We’ll use the built-in dataset Sacramento, containing data on 932 home sales in Sacramento, CA over a five-day period.

data(Sacramento)
str(Sacramento)

## 'data.frame':    932 obs. of  9 variables:
##  $ city     : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ...
##  $ zip      : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
##  $ beds     : int  2 3 2 2 2 3 3 3 2 3 ...
##  $ baths    : num  1 1 1 1 1 1 2 1 2 2 ...
##  $ sqft     : int  836 1167 796 852 797 1122 1104 1177 941 1146 ...
##  $ type     : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
##  $ price    : int  59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
##  $ latitude : num  38.6 38.5 38.6 38.6 38.5 ...
##  $ longitude: num  -121 -121 -121 -121 -121 ...

First we’ll split our dataset into two parts:

A training dataset we’ll use to fit our models.
A testing dataset we’ll set aside for comparison after fitting the models. This helps avoid overfitting the training set by examining how it fits unseen data.

set.seed(12345)
train.select <- createDataPartition(Sacramento$type, p = .8, list = FALSE)
training <- Sacramento[ train.select,]
testing  <- Sacramento[-train.select,]

Many of the more complicated models we can fit with caret need to determine optimal settings for various “tuning” parameters.

We can use Repeated k-fold Cross Validation (among other methods) to determine the best values for the tuning parameters within default ranges. In practice you may want to supply your own grid of possible tuning parameter values. Read more here.

fitControl <- trainControl(## 5-fold Cross Validation
                           method = "repeatedcv",
                           number = 5,
                           ## repeated ten times
                           repeats = 10)

The train function is used to fit models. The full list of models is available here. You can get more informtation about a model and its tuning parameters with getModelInfo(<model name>).

We’ll fit several example models all attempting to predict home price from all of the other variables except zip code and city (these have many unique values and complicate the models).

#Ordinary Least Squares
set.seed(8947) # ensures paired resampling later
lmfit <- train(price ~ .-zip-city, data = training, 
                 method = "lm",
                 trControl = fitControl)
#Robust Linear Model
set.seed(8947) # ensures paired resampling later
robustfit <- train(price ~ .-zip-city, data = training, 
                 method = "rlm", 
                 trControl = fitControl,
                 verbose = FALSE)
#Random Forests
set.seed(8947) # ensures paired resampling later
rffit <- train(price ~ .-zip-city, data = training, 
                 method = "ranger", 
                 trControl = fitControl,
                 verbose = FALSE)
#XGBoost (a refinement of Random Forests) - this is the slowest model!
set.seed(8947) # ensures paired resampling later
xgbfit <- train(price ~ .-zip-city, data = training,
                 method = "xgbTree",
                 trControl = fitControl,
                 verbose = FALSE)
#Support Vector Machine with Linear Kernel
set.seed(8947) # ensures paired resampling later
svmfit <- train(price ~ .-zip-city, data = training, 
                 method = "svmLinear", 
                 trControl = fitControl,
                 verbose = FALSE)

Notes

caret is just a wrapper for fitting models - it does not include functions to fit many of these models. When fitting a model you may see the following message:

1 package is needed for this model and is not installed. (<package-name>). Would you like to try to install it now?
1: yes
2: no

Press 1 and hit enter to install the package and fit the model. Packages needed are listed in the Available models list.

We’ve intentionally made sure a few things are consistent across our models to make comparisons easier:

Each model has the same response variable, price.

To ensure we can compare between models with resampling:

Each model is fit using the same trControl=fitControl setting.
Each model has the seed set in the same way before each train call.
- I’ve used set.seed(8947); the number 8947 is unimportant, it just needs to be consistent.

Resampling

The resamples function considers the models against datasets simulated by sampling from the training set with replacement. You may be familiar with the related concept of “bootstrapping”.

caret gives us three different indices to compare these models:

Mean Absolute Error (MAE)
Root Mean Squared Error (RMSE)
R Squared (See note on calculation here)

These track how well the model fits the data in different ways. Without getting into the details of how they’re calculated, we’ll use the rules of thumb that:

Lower Errors (MAE and RMSE) are better
Higher R-squared values are better

Note: caret provides different metrics (Kappa and accuracy) for classification (i.e. categorical outcomes) tasks.

results <- resamples(list("OLS"=lmfit,"Random.Forest"=rffit,
                          "Robust.LM"=robustfit,"SVM"=svmfit,
                          "xgbTree"=xgbfit))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: OLS, Random.Forest, Robust.LM, SVM, xgbTree 
## Number of resamples: 50 
## 
## MAE 
##                   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## OLS           52269.18 55485.30 58818.95 58441.30 60893.86 65450.93    0
## Random.Forest 49279.82 53766.83 55544.30 55867.46 57880.75 63629.87    0
## Robust.LM     51392.84 55473.35 58250.71 58233.69 60540.08 65537.82    0
## SVM           51448.00 55099.35 58216.62 58035.29 60747.92 65054.30    0
## xgbTree       48263.44 53865.97 55693.54 56146.84 58456.72 63451.97    0
## 
## RMSE 
##                   Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## OLS           65659.34 73381.47 79659.30 80078.43 84591.39 97446.75    0
## Random.Forest 64868.09 73644.40 77216.12 77427.25 81268.66 92233.42    0
## Robust.LM     64088.97 72812.25 79191.32 79970.74 84451.94 97072.50    0
## SVM           64760.41 73762.25 79001.49 80290.13 84719.63 97588.28    0
## xgbTree       62628.90 73273.09 77381.98 77926.27 82772.19 92879.58    0
## 
## Rsquared 
##                    Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## OLS           0.4456435 0.6152435 0.6523681 0.6394884 0.6729048 0.7359156
## Random.Forest 0.5426827 0.6475436 0.6695347 0.6635814 0.6897644 0.7462610
## Robust.LM     0.4513706 0.6185191 0.6554366 0.6416072 0.6758294 0.7365679
## SVM           0.4508157 0.6173829 0.6555446 0.6416658 0.6755272 0.7400575
## xgbTree       0.5183405 0.6376842 0.6667406 0.6600252 0.6950497 0.7487522
##               NA's
## OLS              0
## Random.Forest    0
## Robust.LM        0
## SVM              0
## xgbTree          0

We can also present these results in graphical form:

bwplot(results,scales=list(relation="free"))

Remember that these results are across a number of resamples, hence the boxplots and not a single value per model!

The random forest and xgbTree seem to be doing well here, but it’s not clear that one is clearly outperforming the other.

Out of Sample Performance

Let’s revisit our testing data.

We can use our models to generate predictions with predict, then compare their performance with postResample.

lm.test <- predict(lmfit,testing)
robust.test <- predict(robustfit,testing)
rf.test <- predict(rffit,testing)
xgb.test <- predict(xgbfit,testing)
svm.test <- predict(svmfit,testing)
train.results <- rbind(
  "LM"=postResample(pred=lm.test,obs=testing$price),
  "Robust"=postResample(pred=robust.test,obs=testing$price),
  "Random Forest"=postResample(pred=rf.test,obs=testing$price),
  "SVM"=postResample(pred=svm.test,obs=testing$price),
  "xgbTree"=postResample(pred=xgb.test,obs=testing$price)
)
print(train.results)

##                   RMSE  Rsquared      MAE
## LM            75489.91 0.6212181 53912.04
## Robust        74880.00 0.6205356 52911.52
## Random Forest 72432.66 0.6525539 51233.31
## SVM           75551.70 0.6187436 52421.61
## xgbTree       72000.05 0.6566042 49903.68

Which model seems to do best?

Extra - Predictive Modeling with `caret`

Necessary Packages

What’s predictive modeling?

`caret` Classification and Regression Training

Resampling

Out of Sample Performance

Learn more

Extra - Predictive Modeling with caret

Necessary Packages

What’s predictive modeling?

caret Classification and Regression Training

Resampling

Out of Sample Performance

Learn more

Extra - Predictive Modeling with `caret`

`caret` Classification and Regression Training