Cross-validation framework — Scikit-learn course (2024)

In the previous notebooks, we introduce some concepts regarding the evaluationof predictive models. While this section could be slightly redundant, weintend to go into details into the cross-validation framework.

Before we dive in, let’s focus on the reasons for always having training andtesting sets. Let’s first look at the limitation of using a dataset withoutkeeping any samples out.

To illustrate the different concepts, we will use the California housingdataset.

from sklearn.datasets import fetch_california_housinghousing = fetch_california_housing(as_frame=True)data, target = housing.data, housing.target

In this dataset, the aim is to predict the median value of houses in an areain California. The features collected are based on general real-estate andgeographical information.

Therefore, the task to solve is different from the one shown in the previousnotebook. The target to be predicted is a continuous variable and not anymorediscrete. This task is called regression.

Thus, we will use a predictive model specific to regression and not toclassification.

print(housing.DESCR)

.. _california_housing_dataset:California Housing dataset--------------------------**Data Set Characteristics:**:Number of Instances: 20640:Number of Attributes: 8 numeric, predictive attributes and the target:Attribute Information: - MedInc median income in block group - HouseAge median house age in block group - AveRooms average number of rooms per household - AveBedrms average number of bedrooms per household - Population block group population - AveOccup average number of household members - Latitude block group latitude - Longitude block group longitude:Missing Attribute Values: NoneThis dataset was obtained from the StatLib repository.https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.htmlThe target variable is the median house value for California districts,expressed in hundreds of thousands of dollars ($100,000).This dataset was derived from the 1990 U.S. census, using one row per censusblock group. A block group is the smallest geographical unit for which the U.S.Census Bureau publishes sample data (a block group typically has a populationof 600 to 3,000 people).A household is a group of people residing within a home. Since the averagenumber of rooms and bedrooms in this dataset are provided per household, thesecolumns may take surprisingly large values for block groups with few householdsand many empty houses, such as vacation resorts.It can be downloaded/loaded using the:func:`sklearn.datasets.fetch_california_housing` function... topic:: References - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions, Statistics and Probability Letters, 33 (1997) 291-297

data

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
0	8.3252	41.0	6.984127	1.023810	322.0	2.555556	37.88	-122.23
1	8.3014	21.0	6.238137	0.971880	2401.0	2.109842	37.86	-122.22
2	7.2574	52.0	8.288136	1.073446	496.0	2.802260	37.85	-122.24
3	5.6431	52.0	5.817352	1.073059	558.0	2.547945	37.85	-122.25
4	3.8462	52.0	6.281853	1.081081	565.0	2.181467	37.85	-122.25
...	...	...	...	...	...	...	...	...
20635	1.5603	25.0	5.045455	1.133333	845.0	2.560606	39.48	-121.09
20636	2.5568	18.0	6.114035	1.315789	356.0	3.122807	39.49	-121.21
20637	1.7000	17.0	5.205543	1.120092	1007.0	2.325635	39.43	-121.22
20638	1.8672	18.0	5.329513	1.171920	741.0	2.123209	39.43	-121.32
20639	2.3886	16.0	5.254717	1.162264	1387.0	2.616981	39.37	-121.24

20640 rows × 8 columns

To simplify future visualization, let’s transform the prices from the 100(k$) range to the thousand dollars (k$) range.

target *= 100target

0 452.61 358.52 352.13 341.34 342.2 ... 20635 78.120636 77.120637 92.320638 84.720639 89.4Name: MedHouseVal, Length: 20640, dtype: float64

Note

If you want a deeper overview regarding this dataset, you can refer to theAppendix - Datasets description section at the end of this MOOC.

Training error vs testing error#

To solve this regression task, we will use a decision tree regressor.

from sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor(random_state=0)regressor.fit(data, target)

DecisionTreeRegressor(random_state=0)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After training the regressor, we would like to know its potentialgeneralization performance once deployed in production. For this purpose, weuse the mean absolute error, which gives us an error in the native unit, i.e.k$.

from sklearn.metrics import mean_absolute_errortarget_predicted = regressor.predict(data)score = mean_absolute_error(target, target_predicted)print(f"On average, our regressor makes an error of {score:.2f} k$")

On average, our regressor makes an error of 0.00 k$

We get perfect prediction with no error. It is too optimistic and almostalways revealing a methodological problem when doing machine learning.

Indeed, we trained and predicted on the same dataset. Since our decision treewas fully grown, every sample in the dataset is stored in a leaf node.Therefore, our decision tree fully memorized the dataset given during fitand therefore made no error when predicting.

This error computed above is called the empirical error or trainingerror.

Stability of the cross-validation estimates#

When doing a single train-test split we don’t give any indication regardingthe robustness of the evaluation of our predictive model: in particular, ifthe test set is small, this estimate of the testing error will be unstable andwouldn’t reflect the “true error rate” we would have observed with the samemodel on an unlimited amount of test data.

For instance, we could have been lucky when we did our random split of ourlimited dataset and isolated some of the easiest cases to predict in thetesting set just by chance: the estimation of the testing error would beoverly optimistic, in this case.

Cross-validation allows estimating the robustness of a predictive model byrepeating the splitting procedure. It will give several training and testingerrors and thus some estimate of the variability of the model generalizationperformance.

There are different cross-validationstrategies,for now we are going to focus on one called “shuffle-split”. At each iterationof this strategy we:

randomly shuffle the order of the samples of a copy of the full dataset;
split the shuffled dataset into a train and a test set;
train a new model on the train set;
evaluate the testing error on the test set.

We repeat this procedure n_splits times. Keep in mind that the computationalcost increases with n_splits.

Note

This figure shows the particular case of shuffle-split cross-validationstrategy using n_splits=5.For each cross-validation split, the procedure trains a model on all the redsamples and evaluate the score of the model on the blue samples.

In this case we will set n_splits=40, meaning that we will train 40 modelsin total and all of them will be discarded: we just record theirgeneralization performance on each variant of the test set.

To evaluate the generalization performance of our regressor, we can usesklearn.model_selection.cross_validatewith asklearn.model_selection.ShuffleSplitobject:

from sklearn.model_selection import cross_validatefrom sklearn.model_selection import ShuffleSplitcv = ShuffleSplit(n_splits=40, test_size=0.3, random_state=0)cv_results = cross_validate( regressor, data, target, cv=cv, scoring="neg_mean_absolute_error")

The results cv_results are stored into a Python dictionary. We will convertit into a pandas dataframe to ease visualization and manipulation.

import pandas as pdcv_results = pd.DataFrame(cv_results)cv_results

	fit_time	score_time	test_score
0	0.143051	0.003248	-46.909797
1	0.141657	0.003086	-46.421170
2	0.139043	0.002605	-47.411089
3	0.140519	0.002532	-44.319824
4	0.137265	0.002517	-47.607875
5	0.139470	0.002631	-45.901300
6	0.140000	0.002571	-46.572767
7	0.140633	0.002936	-46.194585
8	0.141011	0.002797	-45.590236
9	0.141972	0.002917	-45.727998
10	0.139513	0.002813	-49.325285
11	0.140003	0.002930	-47.433377
12	0.138902	0.002630	-46.899316
13	0.136996	0.002626	-46.413821
14	0.138215	0.002670	-46.727109
15	0.139554	0.003009	-44.254324
16	0.139238	0.002761	-48.042372
17	0.141658	0.003108	-43.026746
18	0.138718	0.002733	-46.176363
19	0.139775	0.003221	-47.662623
20	0.139451	0.003236	-44.451056
21	0.140207	0.003451	-46.173780
22	0.142535	0.003238	-45.795231
23	0.142743	0.002887	-46.166307
24	0.139133	0.002837	-46.360169
25	0.140266	0.003587	-46.968612
26	0.140109	0.003556	-46.325623
27	0.139976	0.003207	-46.522054
28	0.140647	0.003104	-47.415111
29	0.139968	0.003032	-46.050461
30	0.139763	0.003060	-46.182242
31	0.141241	0.003094	-45.305162
32	0.140289	0.002985	-44.359681
33	0.140270	0.002911	-46.829014
34	0.140017	0.002916	-46.648786
35	0.139999	0.003327	-45.653002
36	0.140152	0.003149	-46.864559
37	0.138818	0.003084	-47.420250
38	0.138378	0.003185	-47.352148
39	0.139128	0.003022	-47.102818

Tip

A score is a metric for which higher values mean better results. On thecontrary, an error is a metric for which lower values mean better results.The parameter scoring in cross_validate always expect a function that isa score.

To make it easy, all error metrics in scikit-learn, likemean_absolute_error, can be transformed into a score to be used incross_validate. To do so, you need to pass a string of the error metricwith an additional neg_ string at the front to the parameter scoring;for instance scoring="neg_mean_absolute_error". In this case, the negativeof the mean absolute error will be computed which would be equivalent to ascore.

Let us revert the negation to get the actual error:

cv_results["test_error"] = -cv_results["test_score"]

Let’s check the results reported by the cross-validation.

cv_results.head(10)

	fit_time	score_time	test_score	test_error
0	0.143051	0.003248	-46.909797	46.909797
1	0.141657	0.003086	-46.421170	46.421170
2	0.139043	0.002605	-47.411089	47.411089
3	0.140519	0.002532	-44.319824	44.319824
4	0.137265	0.002517	-47.607875	47.607875
5	0.139470	0.002631	-45.901300	45.901300
6	0.140000	0.002571	-46.572767	46.572767
7	0.140633	0.002936	-46.194585	46.194585
8	0.141011	0.002797	-45.590236	45.590236
9	0.141972	0.002917	-45.727998	45.727998

We get timing information to fit and predict at each cross-validationiteration. Also, we get the test score, which corresponds to the testing erroron each of the splits.

len(cv_results)

We get 40 entries in our resulting dataframe because we performed 40 splits.Therefore, we can show the testing error distribution and thus, have anestimate of its variability.

import matplotlib.pyplot as pltcv_results["test_error"].plot.hist(bins=10, edgecolor="black")plt.xlabel("Mean absolute error (k$)")_ = plt.title("Test error distribution")

We observe that the testing error is clustered around 47 k$ and ranges from43 k$ to 50 k$.

print( "The mean cross-validated testing error is: " f"{cv_results['test_error'].mean():.2f} k$")

The mean cross-validated testing error is: 46.36 k$

print( "The standard deviation of the testing error is: " f"{cv_results['test_error'].std():.2f} k$")

The standard deviation of the testing error is: 1.17 k$

Note that the standard deviation is much smaller than the mean: we couldsummarize that our cross-validation estimate of the testing error is 46.36 ±1.17 k$.

If we were to train a single model on the full dataset (withoutcross-validation) and then later had access to an unlimited amount of testdata, we would expect its true testing error to fall close to that region.

While this information is interesting in itself, it should be contrasted tothe scale of the natural variability of the vector target in our dataset.

Let us plot the distribution of the target variable:

target.plot.hist(bins=20, edgecolor="black")plt.xlabel("Median House Value (k$)")_ = plt.title("Target distribution")

print(f"The standard deviation of the target is: {target.std():.2f} k$")

The standard deviation of the target is: 115.40 k$

The target variable ranges from close to 0 k$ up to 500 k$ and, with astandard deviation around 115 k$.

We notice that the mean estimate of the testing error obtained bycross-validation is a bit smaller than the natural scale of variation of thetarget variable. Furthermore, the standard deviation of the cross validationestimate of the testing error is even smaller.

This is a good start, but not necessarily enough to decide whether thegeneralization performance is good enough to make our prediction useful inpractice.

We recall that our model makes, on average, an error around 47 k$. With thisinformation and looking at the target distribution, such an error might beacceptable when predicting houses with a 500 k$. However, it would be anissue with a house with a value of 50 k$. Thus, this indicates that ourmetric (Mean Absolute Error) is not ideal.

We might instead choose a metric relative to the target value to predict: themean absolute percentage error would have been a much better choice.

But in all cases, an error of 47 k$ might be too large to automatically useour model to tag house values without expert supervision.

More detail regarding `cross_validate`#

During cross-validation, many models are trained and evaluated. Indeed, thenumber of elements in each array of the output of cross_validate is a resultfrom one of these fit/score procedures. To make it explicit, it ispossible to retrieve these fitted models for each of the splits/folds bypassing the option return_estimator=True in cross_validate.

cv_results = cross_validate(regressor, data, target, return_estimator=True)cv_results

{'fit_time': array([0.1656003 , 0.16090298, 0.16149712, 0.16149282, 0.15634871]), 'score_time': array([0.00256753, 0.00256181, 0.00258613, 0.00303054, 0.00268364]), 'estimator': [DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0)], 'test_score': array([0.26291527, 0.41947109, 0.44492564, 0.23357874, 0.40788361])}

cv_results["estimator"]

[DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0), DecisionTreeRegressor(random_state=0)]

The five decision tree regressors corresponds to the five fitted decisiontrees on the different folds. Having access to these regressors is handybecause it allows to inspect the internal fitted parameters of theseregressors.

In the case where you only are interested in the test score, scikit-learnprovide a cross_val_score function. It is identical to calling thecross_validate function and to select the test_score only (as weextensively did in the previous notebooks).

from sklearn.model_selection import cross_val_scorescores = cross_val_score(regressor, data, target)scores

array([0.26291527, 0.41947109, 0.44492564, 0.23357874, 0.40788361])

Summary#

In this notebook, we saw:

the necessity of splitting the data into a train and test set;
the meaning of the training and testing errors;
the overall cross-validation framework with the possibility to studygeneralization performance variations.

Cross-validation framework — Scikit-learn course (2024)

Training error vs testing error#

Stability of the cross-validation estimates#

More detail regarding cross_validate#

Summary#

More detail regarding `cross_validate`#