Model Evaluation

Book

This IPython notebook follows the book Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido and uses material from its github repository and from the working files of the training course Advanced Machine Learning with scikit-learn. Excerpts taken from the book are displayed in italic letters.

The contents of this Jupyter notebook corresponds in the book Introduction to Machine Learning with Python to:

  • Chapter 5 “Model Evaluation and Improvement”: p. 251 to 270

Python

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import mglearn
# import matplotlib.cbook
# import warnings
# warnings.filterwarnings("ignore",category=matplotlib.cbook.mplDeprecation)

Introduction

Typical Procedure So Far: Example

from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# create a synthetic dataset
X, y = make_blobs(random_state=0)

# split data and labels into a training and a test set
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Instantiate a model and fit it to the training set
logreg = LogisticRegression(solver='lbfgs', multi_class='auto').fit(X_train, y_train)

# evaluate the model on the test set
print(f"Test set score: {logreg.score(X_test, y_test):.2f}")
Test set score: 0.88

Monte Carlo Simulation

Statistik des Test set scores über viele Train-Testdatensplits

runs = 500
lr = LogisticRegression(solver='lbfgs', multi_class='auto')
my_scores = []

for k in range(runs):
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    my_scores.append(lr.fit(X_train, y_train).score(X_test, y_test))

plt.figure(figsize=(6,4))
plt.hist(my_scores, bins = 7)
plt.xlabel("test set scores")
plt.grid(True)
print(f"Mean test set score   : {np.mean(my_scores):.2f}")
print(f"Std. of test set score: {np.std( my_scores):.2f}")
Mean test set score   : 0.90
Std. of test set score: 0.06

Cross-validation

Method

mglearn.plots.plot_cross_validation()

Cross-validation in scikit-learn

from sklearn.datasets        import load_iris
from sklearn.linear_model    import LogisticRegression
from sklearn.model_selection import cross_val_score

iris = load_iris()
logreg = LogisticRegression(solver='liblinear', multi_class='auto')

scores = cross_val_score(logreg, iris.data, iris.target, cv=5)
print(f"Cross validation test scores: {scores}")
Cross validation test scores: [1.         0.96666667 0.93333333 0.9        1.        ]
print(f"Average of cross-validation test score: {scores.mean():.2f}")
print(f"St.dev. of cross-validation test score: {scores.std():.2f}")
Average of cross-validation test score: 0.96
St.dev. of cross-validation test score: 0.04

Benefits of Cross-Validation

  • less randomization effects
  • Each sample will be in the testing set exactly once.
  • Having multiple splits of the data also provides some information about how sensitive the model is to the selection of the training dataset: average and standard deviation of cross-validation scores
  • When using for example 10-fold cross-validation, we use nine-tenths of the data (90%) to fit the model. More data will usually result in more accurate models.

Warnings:

  • increased computational cost
  • Cross-validation is not a way to build a model that can be applied to new data. Cross-validation does not return a model.

Stratified K-Fold cross-validation and other strategies

from sklearn.datasets import load_iris
iris = load_iris()
print(f"Iris labels:\n{iris.target}")
Iris labels:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

As the simple k-fold strategy fails here, scikit-learn does not use it for classification, but rather uses stratified k-fold cross-validation. In stratified cross-validation, we split the data such that the proportions between classes are the same in each fold as they are in the whole dataset.

mglearn.plots.plot_stratified_cross_validation()

For regression, scikit-learn uses the standard k-fold cross-validation by default.

More control over cross-validation

No Stratification

from sklearn.model_selection import KFold

kfold = KFold(n_splits=3)
print(f"Cross-validation scores:\n{cross_val_score(logreg, iris.data, iris.target, cv=kfold)}")
Cross-validation scores:
[0. 0. 0.]
cross_val_score(logreg, iris.data, iris.target, cv=3)
array([0.96, 0.96, 0.94])

Explicit Stratification

from sklearn.model_selection import StratifiedKFold

skfold = StratifiedKFold(n_splits=3)
print(f"Cross-validation scores:\n{cross_val_score(logreg, iris.data, iris.target, cv=skfold)}")
Cross-validation scores:
[0.96 0.96 0.94]

Shuffling Data

Instead of stratifying the folds one can shuffle the data to remove the ordering of the samples by label.

kfold = KFold(n_splits=5, shuffle=True, random_state=0)

print(f"Cross-validation scores:\n{cross_val_score(logreg, iris.data, iris.target, cv=kfold)}")
Cross-validation scores:
[0.96666667 0.9        0.96666667 0.96666667 0.93333333]

Leave-One-Out cross-validation

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
scores = cross_val_score(logreg, iris.data, iris.target, cv=loo)
print(f"Number of cv iterations: {len(scores)}")
print(f"Mean accuracy: {scores.mean():.2f}")
Number of cv iterations: 150
Mean accuracy: 0.95

Shuffle-Split cross-validation

shuffle_split

You can use integers for train_size and test_size to use absolute sizes for these sets, or floating-point numbers to use fractions of the whole dataset.

from sklearn.model_selection import ShuffleSplit

shuffle_split = ShuffleSplit(test_size=.5, train_size=.5, n_splits=10, random_state=0)
scores = cross_val_score(logreg, iris.data, iris.target, cv=shuffle_split)
print(f"Cross-validation scores:\n{scores}")
Cross-validation scores:
[0.84       0.93333333 0.90666667 1.         0.90666667 0.93333333
 0.94666667 1.         0.90666667 0.88      ]

Cross-validation with groups

For example in medical applications, to have groups in the data is very common. You might have multiple samples from the same patient, but are interested in generalizing to new patients. Similarly, in speech recognition, you might have multiple recordings of the same speaker in your dataset, but are interested in recognizing speech of new speakers.

cv_with_groups
from sklearn.model_selection import GroupKFold

# create synthetic dataset
X, y = make_blobs(n_samples=12, random_state=0)

# assume the first three samples belong to the same group,
# then the next four etc.
groups = [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]

print(f'target values:    {y}')
print(f'group membership: {groups}')
target values:    [1 0 2 0 0 1 1 2 0 2 2 1]
group membership: [0, 0, 0, 1, 1, 1, 1, 2, 2, 3, 3, 3]
scores = cross_val_score(logreg, X, y, groups=groups, cv=GroupKFold(n_splits=3))
print(f"Cross-validation scores:\n{scores}")
Cross-validation scores:
[0.75       0.8        0.66666667]