Model evaluation using cross-validation — Scikit-learn course (2024)

In this notebook, we still use numerical features only.

Here we discuss the practical aspects of assessing the generalizationperformance of our model via cross-validation instead of a singletrain-test split.

Data preparation#

First, let’s load the full adult census dataset.

import pandas as pdadult_census = pd.read_csv("../datasets/adult-census.csv")

We now drop the target from the data we will use to train our predictivemodel.

target_name = "class"target = adult_census[target_name]data = adult_census.drop(columns=target_name)

Then, we select only the numerical columns, as seen in the previousnotebook.

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]data_numeric = data[numerical_columns]

The need for cross-validation#

In the previous notebook, we split the original data into a training set and atesting set. The score of a model in general depends on the way we make such asplit. One downside of doing a single split is that it does not give anyinformation about this variability. Another downside, in a setting where theamount of data is small, is that the data available for training and testingwould be even smaller after splitting.

Instead, we can use cross-validation. Cross-validation consists of repeatingthe procedure such that the training and testing sets are different each time.Generalization performance metrics are collected for each repetition and thenaggregated. As a result we can assess the variability of our measure of themodel’s generalization performance.

Note that there exists several cross-validation strategies, each of themdefines how to repeat the fit/score procedure. In this section, we use theK-fold strategy: the entire dataset is split into K partitions. Thefit/score procedure is repeated K times where at each iteration K - 1partitions are used to fit the model and 1 partition is used to score. Thefigure below illustrates this K-fold strategy.

Note

This figure shows the particular case of K-fold cross-validation strategy.For each cross-validation split, the procedure trains a clone of model on allthe red samples and evaluate the score of the model on the blue samples. Asmentioned earlier, there is a variety of different cross-validationstrategies. Some of these aspects will be covered in more detail in futurenotebooks.

Cross-validation is therefore computationally intensive because it requirestraining several models instead of one.

In scikit-learn, the function cross_validate allows to do cross-validationand you need to pass it the model, the data, and the target. Since thereexists several cross-validation strategies, cross_validate takes a parametercv which defines the splitting strategy.

%%timefrom sklearn.model_selection import cross_validatemodel = make_pipeline(StandardScaler(), LogisticRegression())cv_result = cross_validate(model, data_numeric, target, cv=5)cv_result

CPU times: user 475 ms, sys: 251 ms, total: 726 msWall time: 411 ms

{'fit_time': array([0.05962992, 0.05806112, 0.05891657, 0.05685925, 0.05641007]), 'score_time': array([0.01352239, 0.0138371 , 0.01350832, 0.01330352, 0.01314974]), 'test_score': array([0.79557785, 0.80049135, 0.79965192, 0.79873055, 0.80456593])}

The output of cross_validate is a Python dictionary, which by defaultcontains three entries:

(i) the time to train the model on the training data for each fold,fit_time
(ii) the time to predict with the model on the testing data for each fold,score_time
(iii) the default score on the testing data for each fold, test_score.

Setting cv=5 created 5 distinct splits to get 5 variations for the trainingand testing sets. Each training set is used to fit one model which is thenscored on the matching test set. The default strategy when setting cv=int isthe K-fold cross-validation where K corresponds to the (integer) number ofsplits. Setting cv=5 or cv=10 is a common practice, as it is a goodtrade-off between computation time and stability of the estimated variability.

Note that by default the cross_validate function discards the K modelsthat were trained on the different overlapping subset of the dataset. The goalof cross-validation is not to train a model, but rather to estimateapproximately the generalization performance of a model that would have beentrained to the full training set, along with an estimate of the variability(uncertainty on the generalization accuracy).

You can pass additional parameters tosklearn.model_selection.cross_validateto collect additional information, such as the training scores of the modelsobtained on each round or even return the models themselves instead ofdiscarding them. These features will be covered in a future notebook.

Let’s extract the scores computed on the test fold of each cross-validationround from the cv_result dictionary and compute the mean accuracy and thevariation of the accuracy across folds.

scores = cv_result["test_score"]print( "The mean cross-validation accuracy is: " f"{scores.mean():.3f} ± {scores.std():.3f}")

The mean cross-validation accuracy is: 0.800 ± 0.003

Note that by computing the standard-deviation of the cross-validation scores,we can estimate the uncertainty of our model generalization performance. Thisis the main advantage of cross-validation and can be crucial in practice, forexample when comparing different models to figure out whether one is betterthan the other or whether our measures of the generalization performance ofeach model are within the error bars of one-another.

In this particular case, only the first 2 decimals seem to be trustworthy. Ifyou go up in this notebook, you can check that the performance we get withcross-validation is compatible with the one from a single train-test split.

Notebook recap#

In this notebook we assessed the generalization performance of our model viacross-validation.