# Cross-Validation

## Dec 6, 2017 00:00 · 1034 words · 5 minutes read Machine Learning

I’ll be co-presenting a series on machine learning at the Hawaii Machine Learning Meetup in the beginning of 2018, which will include topics such as lessons learned in Kaggle competitions where I will expand on the code I shared here. If this topic piques your interest, then please feel free to sign-up.

# Introduction

Cross-validation(CV) is used to evaluate a model’s predictive performance and is preferred over measures of fit like $R^2$ and calibration error such as goodness-of-fit. In the case against $R^2$, it is widely understood that one can optimistically inflate the $R^2$ by increasing the degrees of freedom; increasing the degrees of freedom can easily be achieved by adding more covariates. The result is an overfit model that looks good on paper, but performs poorly in practice. In the case against goodness-of-fit such as the Hosmer-Lemeshow test, I have never felt comfortable with the arbitrary binning and found the significance tests to yield inconsistent results by toggling the number of bins.

In regards to Kaggle competitions, relying on a local CV is almost always the better choice. The public leaderboard scores are based on a percentage of the full blinded test set. Many shared solutions may be tuned using this small subset of the test data, which means that the model may not generalize well over the full data set. I personally experienced the hard lessons of ignoring CV in a competition. I retreated into the mountains for several years in shame where I practiced CV self-discipline only returning to society after redeeming myself. OK. Things weren’t that dramatic, but relying on my local CV prevented another slide.

# k-fold cross-validation

In machine learning pop culture, CV often refers to k-fold designs. The $k$ represents subsamples of equal sizes called folds. When fitting models, a fold is held and the $k-1$ remaining folds are used to train the model. The model is then evaluated on the left-out fold. This process is repeated until the model has been evaluated on all $k$ folds. Although CV increases processing time, several benefits may be realized such as the ability to make judgements between models using a quantifiable metric; to know when a model may be overfit; and, to quantify prediction quality on a blend of several predictions.

The example below was conducted using R, but it can be easily replicated in Python. There are also many packages for both R and Python that split the data into folds and/or include cross-validation preparation into the pipeline. The purpose for doing this method manually, however, is to gain a conceptual understanding of how folds are prepared and used in model fitting.

library(data.table)
library(xgboost)
library(knitr)
library(MLmetrics)

## Data

The data can be downloaded from Kaggle’s Porto Seguro’s Safe Driver Prediction competition.

path_dat = 'D:\\Dropbox\\kaggle\\Portobello\\data\\input' # replace with your dir
train <- fread(sprintf("%s/train.csv", path_dat))
##
Read 595212 rows and 59 (of 59) columns from 0.108 GB file in 00:00:05
test <- fread(sprintf("%s/test.csv", path_dat))
##
Read 892816 rows and 58 (of 58) columns from 0.160 GB file in 00:00:06

## Assigning folds

Here the sample() function is used to assign the fold index. A seed is set to ensure reproducibility. An examination of the fold column shows $k=5$ folds.

cv_folds = 5

set.seed(808)
train[, fold := sample(1:cv_folds, nrow(train), replace=TRUE)]

kable(head(train[, c(1,ncol(train),2), with=FALSE], 10), format="markdown")
id fold target
7 5 0
9 1 0
13 2 0
16 4 0
17 5 0
19 3 0
20 1 0
22 5 0
26 2 0
28 2 1

## Setting the loss function

# custom loss function
normalizedGini <- function(preds, dtrain) {
labels <- getinfo(dtrain, "label")
err <- Gini(as.numeric(preds), as.numeric(labels))
return(list(name = "Gini", value = err, higher_better=TRUE))
}

## Iterating over folds

A k-fold cross-validation design is used here with eXtreme Gradient Boosting. The data set is split into dtrain and dtest where the training will be done using the dtrain subset; dtest is the held out fold, which is used by the model to evaluate training performance. After finding the best rounds to iterate, the model is saved so that it can be called later. For this example, however, the number of rounds is set to 100 to iterate through the example quickly. This procedure is repeated once for each number of folds.

xtr=list()

f <- setdiff(names(train), c('id','target','fold'))

for (i in 1:cv_folds) {
# train set is split by fold assignment
dtrain <- xgb.DMatrix(data=as.matrix(train[train$fold!=i, ..f]), label=train[train$fold!=i, ]$target) dtest <- xgb.DMatrix(data=as.matrix(train[train$fold==i, ..f]), label=train[train$fold==i, ]$target)

watchlist <- list(train=dtrain, test=dtest)

xtr[[i]] <- xgb.train(data          = dtrain,
watchlist     = watchlist,
objective     = "reg:logistic",
eta           = 0.1,
nrounds       = 100,
feval         = normalizedGini,
maximize      = TRUE,
early_stopping_round = 10,
verbose = FALSE)
}

## Evaluate

The xtr object contains five models, which will be used to predict on its respective held out fold. In addition to evaluting the model, predictions made on the training set is created as xgb_train, which will be stored for later uses in stacking or blending with other predictions. Although blending is used on test predictions, a blend on the training predictions can be scored against and used to evaluate the blend’s quality.

xtr.score <- train$target for (i in 1:cv_folds) { xtr.score[train$fold==i] <- predict(xtr[[i]], as.matrix(train[train$fold==i, ..f])) } print(MLmetrics::NormalizedGini(xtr.score, train$target))
## [1] 0.2769083

# Conclusion

In general, preparing a CV design is important for evaluating model performance and quality. It is a technique used to benchmark models and quantify improvement or degradation.