This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
Information
List of notes for this book
- We use dataset MNIST in this chapter = 70K small images of digits handwriten. ← “Hello world” of ML.
- Download from OpenML.org. ← use
sklearn.datasets.fetch_openml
1from sklearn.datasets import fetch_openml
2
3mnist = fetch_openml('mnist_784', as_frame=False)
4# data contains images -> dataframe isn't suitable, so as_frame=False
5X, y = mnist.data, mnist.target
6X.shape # (70000, 784)
sklean.datasets
contains 3 types of functions:fetch_*
functions such asfetch_openml()
to download real-life datasets.load_*
functions to load small toy datasets (no need to download)make_*
functions to generate fake datasets.
- 70K images, 784 features. Each image = 28x28 pixels.
- Plot an image
1import matplotlib.pyplot as plt def plot_digit(image_data):
2
3image = image_data.reshape(28, 28) plt.axis("off")
4plt.imshow(image, cmap="binary")
5some_digit = X[0] plot_digit(some_digit) plt.show()
- MNIST from
fetch_openml()
is already split into a training set (first 60K, already shuffled) and test set (last 10K).
1X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
- Training set is already shuffled ← good for cross-validation (all are similar).
Let’s simplify the problem - “detect only the number 5” ← binary classifier (2 classes, 5 or non-5).
Good to start is stochastic gradient descent (SGD, or stochastic GD) classifier ←
SGDClassifier
← deals with training instances independently, one at a time ← handling large datasets effeciently, well suited for online training.1from sklearn.linear_model import SGDClassifier
2
3sgd_clf = SGDClassifier(random_state=42)
4sgd_clf.fit(X_train, y_train_5)
5
6sgd_clf.predict([some_digit])
Evaluating a classifier is often significantly trickier than evaluating a regressor!
Use
cross_val_score()
← use k-folds.1from sklearn.model_selection import cross_val_score
2cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy")
Wow, get 95% accuracy with SGD but it’s good? → Let’s try
DummyClassifier
← classifies every single image in the most frequent class (non-5) and then use cross_val_score
→ 90% accuracy! Why? It’s because only 10% are 5s! ← If you always guess that an image is not a 5, 90% of the time, you’re right!→ Accuracy isn’t the preferred measure for classifiers, especially with skewed datasets (some classes are much more than others). ← use confusion matrix (CM) 👈 My note: Confusion matrix & f1-score.
Sometimes, you can implement yourself a custom cross-validation to better control the measure ← use
StratifiedKFold
to performs stratified sampling (folds that preserves the percentage of samples for each class).👉 My note: Confusion matrix & f1-score.
General idea: count the number of times instances of class A are classified as class B. Eg. to check how many times the classifier confuses 8s and 0s, check row#8, col#0 of the CM.
1from sklearn.model_selection import cross_val_predict
2from sklearn.metrics import confusion_matrix
3
4y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)
5cm = confusion_matrix(y_train_5, y_train_pred)
cross_val_predict()
returns the predictions made on each test fold.1array([[53892, 687],
2 [ 1891, 3530]])
CM. Each row → actual class, each column → predicted class.
TN = True Negative, FP = False Positive = Type I Error, FN = False Negative, TP = True Positive.
→ A perfect classifier would only have TP and TN (only predict right 5 and non-5 or FP = FN = 0)!
So, a more concise metric: look at the accuracy of the positive predictions ← precision ← How many of what we predict are right?
But what if we always make negative predictions (except the single positive we pretty sure) → precision would be 1/1=100%? ← This classifier isn’t useful because it ignores all but one positive instance. → precision should be used with another metric name recall (also, sensitivity or true positive rate - TPR).
Recall = ratio of positive instaces that are correctly detected by the classifier. ← Do we miss something?
Scikit-learn gives
precision_score
and recall_score
to compute precision and recall.1from sklearn.metrics import precision_score, recall_score
2
3precision_score(y_train_5, y_train_pred) # 0.8370879772350012
4recall_score(y_train_5, y_train_pred) # 0.6511713705958311
When it claims an image represents a 5, it is correct only 83.7% of the time. Moreover, it only detects 65.1% of the 5s.
It’s convenient to combine precision and recall into a single metric called F1 score. It’s the harmonic mean of them.
1from sklearn.metrics import f1_score
2
3f1_score(y_train_5, y_train_pred) # 0.7325171197343846
F1 score is high if both recall and precision are high. However, it’s not always the only one metric you want: in some contexts, you mostly care about precision, and in other contexts you really care about recall.
- Care about precision: detects safe video for kids → a good classifier = keeps only safe videos ↔ precision high ↔ less “wrongly detected” (FP) videos come to children. We don’t care if recall is low in this case (lots of good videos will be missed but no problem).
- Care about recall: detects shoplifters in surveillance images → a good classifier = all shopifiers get caught ↔ recall high ↔ less “allowed passing” (FN). We don’t care if precision is low (a few wrong alerts but we won’t miss bad guys).
→ Unfortunately, increasing precision reduces recall and vice versa. ← precision/recall trade-off.
- Higher recall (lower threshold) → we don’t miss 5s but we allow many not-5s there. Conversely, higher precision (higher threshold) → there aren’t many not-5s but we miss many 5s (lower recall).
- So which threshold should be used? → Figure 3-5.
- Strategy 2: to select precision/recall trade-off → plot preficion against recall.
- The choice of precision/recall trade-off depends on your project!
- Search for the lowest threshold that gives you at least 90% precision.
1idx_for_90_precision = (precisions >= 0.90).argmax()
2threshold_for_90_precision = thresholds[idx_for_90_precision]
- For many application, 48% recall wouldn’t be great.
- ROC = Receiver Operating Characteristic. ← common tool used with binary classifiers.
- Specificity: How many negative results belong to our predictions? ← It is used when we care about TN values and don't want to make false alarms of the FP values (e.g. drug test).
- ROC plots TPR (True Positive Rate) vs FPR (False Positive Rate) = Sensitivity (Recall) vs Specificity.
- Use
roc_curve
.
1from sklearn.metrics import roc_curve
2import matplotlib.pyplot as plt
3%matplotlib inline
4
5fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
6# create plot
7plt.plot(fpr, tpr, label='ROC curve')
8plt.plot([0, 1], [0, 1], 'k--') # Dashed diagonal
9plt.show()
- Trade-off: the higher recall, the more FPR (predict wrong) the classifier produces.
- A good classifier stays as far away from the dotted lines (random classifier) as possible (toward the top-left corner) → Measure the area under the curve (AUC) ←
roc_auc_score
- Perfect classifier will have AUC = 1 (fit the rectangle).
- The purely random classifier (dotted line) will have AUC = 0.5.
- Use precision/recall curve ← when positive class is rare or when you care about FP than FN.
- Otherwise, use ROC.
- For example, Figure 3-7 displays a satisfactory ROC, but the PR curve suggests there is room for model enhancement (the curve could really be closer to the top-right corner).
- The
precision_recall_curve()
expects labels and scores for each instance butRandomForestClassifier
doesn’t havedecision_function()
method. ← use the probability of the positive class as a score.
1y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3, method="predict_proba")
- These are estimated probablities, not actual probabilities ← not so good ← use
sklearn.calibration
to calibrate these estimations.
- Multiclass classifiers = Multinomial classifiers = distinguish between more than 2 classes.
- You can perform MC with multiple binary classifiers (BC).
- one-versus-the-rest (OvR) or one-versus-all (OvA) strategy: instead of classifying 10 classes (0 to 9), we train 10 BC, one for each digit (0-detector, 1-detector,…). Then, take the highest BC score. ← Most of BC likes this.
- one-versus-one (OvO) strategy: train BC for every pair of digits (0vs1, 0vs2,…, 1vs2,…). Then the class winning the most duels will be the class of an image. ← Advantage: only train on a part of the training set containing 2 classes. ← SVM likes this.
- Scikit-Learn auto detects which strategy to use for the chosen BC.
1from sklearn.svm import SVC
2
3svm_clf = SVC(random_state=42)
4svm_clf.fit(X_train[:2000], y_train[:2000]) # y_train, not y_train_5
- ☝ Sometimes, just scale the input can increase the results (discussed in Chap 2).
Assuming you have a promising model, we'll explore ways to enhance it by analyzing its errors.
Plot the confusion matrix of the predictions ← a color diagram of the CM is much easier to analyze.
1from sklearn.metrics import ConfusionMatrixDisplay
2
3y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
4plt.rc('font', size=9) # extra code – make the text smaller
5ConfusionMatrixDisplay.from_predictions(
6 y_train, y_train_pred,
7 sample_weight = (y_train_pred != y_train) # If available -> Fig 3.10
8 normalize="true", values_format=".0%" # If available -> Fig 3.9 right
9)
10plt.show()
From Figure 3-9: Images are mainly diagonal, indicating good results. However, row #5 and col #5 appear darker, not due to poor performance but fewer 5s in the dataset. Solution: Use CM normalization. Result: 82% accuracy.
If you look carefully, you will notice that many digits have been misclassified as 8s, but this is not immediately obvious from this diagram. ← putting zero weight on the correct prediction
→ Think more of reducing the false 8s:
- More data for digits that look like 8s ← classify them from 8s.
- An algo for couting the number of closed loops (8 has 2, 6 has 1, 5 has 1).
We can boost our training dataset via data augmentation, which tweaks images, like shifting or rotating. Other methods are also viable.
- Classifier can output multiple classes for each instance.
- Eg: face-recognition classifier: detects multiple faces in an image. ← it outputs
[True, False, True]
for Alice, Bob, Charlie in the image ← multilabel classification (outputs multiple binary tags)
- Eg:
KNeighborsClassifier
to classify each image in MNIST into 2 labels — large (7,8,9) or odd.
1import numpy as np
2from sklearn.neighbors import KNeighborsClassifier
3
4y_train_large = (y_train >= '7')
5y_train_odd = (y_train.astype('int8') % 2 == 1)
6y_multilabel = np.c_[y_train_large, y_train_odd]
7
8knn_clf = KNeighborsClassifier()
9knn_clf.fit(X_train, y_multilabel)
10
11knn_clf.predict([some_digit])
- To evaluate, one way: measure F1 score of each label and then compute the average score.
ChainClassifier
arranges binary classifiers into a chain, where each model predicts using input features and previous models' predictions.
- Multioutputmulticlass classification = Multioutput Classification.
- Each label can be multiclass (has more than 2 possible values).
- Eg: A systems removes noise from images. Output: multilabel (one label per pixel) and each label can have multiple values (0 to 255).