Cross Validation

"Is cross validation to model application what dating is to marriage?"

Rawan Hassunah

One of two answers to any data science question, I have come to realize, is cross validation. If you're wondering, the other one is: it depends.

Probably the biggest mistakes a data scientist can make when training a model is not checking to see if it has been overfit or underfit to the data. Overfitting means that the model is mistaking noise for a signal. Think of a scatter plot where the best fit line connects each and every datapoint - an extreme example that's unrealistic and most probably wrong. The other extreme would be completely underfitting the data, which means we did not account for any signal in our data.

Three Phases of Machine Learning

Training: train the model on training data.

Validation: determine how well the model has been trained and estimate its output accuracy.

Application: apply the model on data it hasn't seen before.

Train & Test Split

One of the first things you learn in data science is when presented with a dataset, you should split it into a training set and a testing set, and never ever touch the testing data until you have chosen and tuned your model.

The way you would choose your model and tune it is by running it, calculating whichever score metric you choose to use (generally stick with only one if you're choosing between models - and make sure it's the one that makes sense!) and compare across models. This is quite intuitive, but how do you make sure you're not overfitting or underfitting to your data? Ding, ding ding - the answer is cross validation!

In k-fold cross validation, a model is tested k times to determine how well it performs on data it has never seen before, each time it is trained and then tested on different subsets of the data.

Example

I loaded up one of scikit learn's (sklearn) sample datasets, the iris dataset. I used a simple decision tree classifier to determine the type of iris (there are three options) based on 4 features: sepal length, sepal width, petal length, petal width. I used sklearn's cross_val_score function, which by default is 3-fold, to compute the mean accuracy score of my model. My mean cross validation score was approx 0.92 - not bad!