Reading: Hands-On ML - Chap 2: End-to-End ML Project

Anh-Thi Dinh
⚠️
This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.

Information

List of notes for this book


📔
Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

Main steps

In this chapter you will work through an example project end to end.

Working with Real Data

In this chapter, we use California Housing Prices dataset (or download it from the author’s repository).
Fig 2-1. California housing prices
This data includes metrics as the population, median income, median housing price for each block group (called “district” for short).

Look at the Big Picture

Your model should learn from this data → predict the median housing price in any district.
You should pull out this ML project checklist (Appendix A in the book) for each project.

Frame the Problem

Ask questions to find the methods.
  • Question: What exactly the business objective is? (find a model isn’t a final goal) → Business objective: Whether it’s worth to invest in a given area?
    • Fig 2-2. A machine learning pipeline for real estate investments
  • Question: What the current solution looks like (if any)? ← a ref for performance → currently estimated manually by experts. ← Their estimates were off by more than 30%.
Pipeline = a sequence of data processing components is called a data pipeline. Each component is handled by a team. The whole process is robust.
  • Question: What kind of training supervision the model (supervised, unsupervised, semi-supervised, self-supervised of reinforcement)? Classification / Regression / ? Use batch learning / online learning?
    • supervised ← model trained with labeled examples.
    • multiple regression ← predict a value, use multiple features.
    • univariate regression ← predict a single value for each district. If we want to predict multiple values → multivariate regression.
    • Batch learning ← no continuous flow data, no need to adjust data, data is small.
    • If data were huge → split batch learning across multiple servers (use MapReduce technique) or online learning.

Select a Performance Measure

  • A typical measure for regression: Root Mean Square Error (RMSE)
    • It’s corresponding to Euclidean norm (or norm, noted or just ).
      This is more sensitive to outlier than below MAE. If outliers are rare (bell-shaped curve data) → RMSE performs well!
  • If there is any outlier district, we can use Mean Absolute Error (MAE = Average Absolute Deviation)
    • It’s corresponding to norm, noted or Manhattan norm (it measures the distance between two points in a city if you can only travel along orthogonal city blocks).
  • norm of a vector containing elements:
    • = number of nonzero elements in the vector.
      = maximum absolute value.

Check the Assumptions

It is beneficial to communicate with other teams in the pipeline to understand the assumptions regarding the overall problems. If there are any changes, adjust your methods accordingly to adapt to them.

Get the Data

It’s time for the codes. Check the offcial jupyter notebooks here. In this chapter, we run these notebooks using Google Colab at this link.
Thi: I ignore some sections related to the usage of Google Colab and Jupyter Notebook. The codes in this note are just snippets.
📔
Jupyter notebook for this chapter: on Github, on Colab, on Kaggle.

Take a Quick Look at the Data Structure

Fig 2-6. Top 5 rows ← housing.head()
→ 10 attributes.
  • 20640 → small (vs ML standard)
  • total_bedrooms has missing values.
  • ocean_proximity isn’t numeric ← it’s categorical attribute (check Fig 3.
housing.info()
Fig 2-7. housing.describe()
Check with histograms.
1import matplotlib.pyplot as plt
2
3housing.hist(bins=50, figsize=(12, 8))
4save_fig("attribute_histogram_plots")  # extra code
5plt.show()
Fig 2-8. A histogram for each numerical attribute (y-axis) vs value range (x-axis) in bins.
Some remarks:
  • median_income isn’t a normal US$. → ask the team collecting data → it’s scaled (1 unit = 10k$) and capped (for >15 and <0.5 into 1 bin). ← Thi: dồn lại.
  • housing_median_age & median_house_value were capped too but there is a problem with median_house_value because it’s effects directly we want to predict. → ask client team to see if they want exact predict beyond 500k? → collect more label for them (>500k) or remove those districts.
  • Attribues have very diff scales → (later) feature scaling.
  • Many histograms are skewed right → hard to detect patterns → need to be transformed (more symmetrical / bell-shaped)

Create a Test Set

  • Why now? → your brain is an amazing pattern detection system → overfitting (by you) ← called data snooping bias
  • Check the codes in the notebook. We split 20% data for test set.
  • Problem: If we use random to split the test set → it will change at each run → not perfect
    • Solution: save on the 1st run, use it in subsequent runs OR use random seed ← weakness: broken when we have new data.
    • → We should use id of each instance (eg. compute its hash) → be sure that test set is consistent across multiple runs.
      1from zlib import crc32
      2
      3def is_id_in_test_set(identifier, test_ratio):
      4    return crc32(np.int64(identifier)) < test_ratio * 2**32
      5
      6def split_data_with_id_hash(data, test_ratio, id_column):
      7    ids = data[id_column]
      8    in_test_set = ids.apply(lambda id_: is_id_in_test_set(id_, test_ratio))
      9    return data.loc[~in_test_set], data.loc[in_test_set]
  • Above code needs an “id” column → check and use attributes to generate a consistent id for each instance! Eg. longitude and latitude.
  • Use scikit-learn
    • 1from sklearn.model_selection import train_test_split
      2
      3train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
      There are more methods in sklearn.model_selection.
  • Stratified sampling: the population is divided into subgroups called strata, and a representative sample is taken from each stratum to ensure the test set represents the entire population.
    • Make sure male vs female in the test set is representative.
    • Suppose some expert told that median_income (Fig 2-8) is very important → make sure the test set considers it important too.
    • Use pd.cut() to categorize an attribute.
    • Fig 2-10. Sampling bias comparison of stratified versus purely random sampling. → Test set generated using stratified sampling almost identical to full dataset (overall)

Explore and Visualize the Data to Gain Insights

  • Make sure to put the test set aside and explore only the training set.
  • Can make a exploration set if training set is large. Just for exploring the data.
  • Make a copy to work with: housing = strat_train_set.copy().

Visualizing Geographical Data

Use scatterplot to visualizing all districts.
1housing.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)
2plt.show()
Fig 2-12. alpha=0.2 to better see the density.
1housing.plot(
2	kind="scatter", x="longitude", y="latitude", grid=True,
3	s=housing["population"] / 100, label="population",
4	c="median_house_value", cmap="jet", colorbar=True,
5	legend=True, sharex=False, figsize=(10, 7)
6)
7plt.show()
California housing prices: red is expensive, blue is cheap, larger circles indicate areas with a larger population
 

Look for Correlations

  • Standard correlation coefficient or Pearson’s r between every pair of attributes.
  • The correlation coefficient . ~1 → strong positive correlation. 0 → no linear correlation. ← The correlation coefficient only measures linear correlations
  • Use Pandas scatter_matrix() to check the correlation between attributes → plot every numerical attribute against every other numerical attribute.
    • Fig 2-14. This scatter matrix plots every numerical attribute against every other numerical attribute, plus a histogram of each numerical attribute’s values on the main diagonal (top left to bottom right)
  • From Fig 2-14, the most promising attribute to predict the median house value is the median income. → zoom at it
      • The correlation is quite strong.
      • Points aren’t too dispersed.
      • visible as a horizontal line at $500k
      • There are other “less obvious” straight lines arount 450k, 350k, 280k ← remove these districts to prevent your algo to learn something bad.
Fig 2-16. Standard correlation coefficient of various datasets. Source.

Experiment with Attribute Combinations

  • If some attributes have a skewed-right distribution → transform them (eg. computing their logarithm or square root).
  • Last thing before preparing data for pipeline → try out various attribute combinations.
  • eg: #rooms per household = #rooms & #household, #bed rooms / #rooms, population/household,…
→ Found that bedrooms_ratio is good to use (its corr = -0.256397) → house with a lower bedroom/room ratio tend to be more expensive.
1housing["bedrooms_ratio"] = housing["total_bedrooms"] / housing["total_rooms"]

Prepare the Data for Machine Learning Algorithms

  • You should write functions to perform this step instead of doing it manually!
  • You should also separate the predictors and the labels, since you don’t necessarily want to apply the same transformations to the predictors and the target values

Clean the Data

  • ML algo cannot work with missing features. Eg. total_bedrooms has missing values,
      1. Remove corresponding districts.
      1. Remove whole attribute.
      1. Set missing value to some value (zero, mean, median,…) ← called imputation sklearn.impute
      → Use Pandas’ dropna(), drop(), fillna()

Handling Text and Categorical Attributes

  • ocean_proximity → text attribute → check → categorical attribute
  • Most ML algos prefer to work with numbers → convert categories to numbers.
 
 
  • Suppose we use OrdinalEncoder to convert categories to 0, 1, 2, 3,… ← Issue: ML assumes 2 nearby values are similar! ← It may be fine if categories has order like “bad” < “average” < “good” < “excellent” but not for ocean_proximity!
    • Solution: use One-hot encoding = create one binary attribute per category! ← output of OneHotEncoder is a SciPy sparse matrix (very efficient matrices that contain mostly zeros ← save memory and speed up computations).
  • Pandas has also pd.get_dummies(df) which has the same functionality as sk’s OneHotEncoder but we prefer the latter (smarter) because it remembers which categories it was trained on.
    • OneHotEncoder can also detects unknown categories and rase an exception whereas get_dummies() cannot (it creates a new column)!
  • If attribute has large of categories → not good → replace the categorical attribute by numerical ones.

Feature Scaling and Transformation

  • One of the most important transformations you need to apply to your data is feature scaling.
  • ML algo don’t perform well when numerical attributes have very different scales.
    • Eg. #rooms where median income → model is bias, it will focus more on #rooms.
  • 2 common ways: min-max scaling (or normalization) & standardization (z-score normalization)
    • min-max scaling: values scaled to [0,1] (or other range) ← MinMaxScaler
    • standardization: ( is mean, is ’s standard deviation). It doesn’t restrict values to a specific range but it’s much less affected by outliers. ← StandardScaler
  • Warning: never use fit() or fit_transform() for anything else than the training set ← you can use scaler for other sets later!
  • If feature has heavy tail (ie. values far from the mean are not exponentially rare) → need to shrink the heavy tail first, then scale.
    • Heavy tail to the right ← replace the feature with square root.
    • Power law distribution ← replace the feature with its logarithm ← Fig 2-17.
    • Another approach: bucketizing the feature (chopping its distribution into roughly equal-sized buckets, replacing each feature value with the index of the bucket it belongs to)
An example of heavy taild distribution. Source.
An example power-law graph that demonstrates ranking of popularity. Source.
Fig 2-17. Transforming a feature to make it closer to a Gaussian distribution.
  • A feature has a multimodal distribution (i.e., with two or more clear peaks, called modes) → Strategies:
    • Method 1: Bucketize it, but this time treating the bucket IDs as categories, rather than as numerical values.
      Method 2: Add a feature for each mode, representing the similarity between the housing median age and that mode, using a radial basis function (RBF). The most common type of RBF is Gaussian RBF, where the output value decays exponentially as the input value moves away from the fixed point.
      The parameter determines how quickly the similarity measure decays as moves away from 35.
      Fig 2-18. Gaussian RBF feature measuring the similarity between the housing median age and 35
      1from sklearn.metrics.pairwise import rbf_kernel
  • The target values may also need to be transformed too. ← then use inverse_transform() method to get the desired values from the predicted-transformed value.
    • Use TransformedTargetRegressor ← give it a regression model & label transformer then fit training set with unscaled labels. After that, just use .predict() as normal.

Custom Transformers

1from sklearn.preprocessing import FunctionTransformer
2
3log_transformer = FunctionTransformer(np.log, inverse_func=np.exp)
4log_pop = log_transformer.transform(housing[["population"]])
A log-transformer
1rbf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[[35.]], gamma=0.1))
2
3age_simil_35 = rbf_transformer.transform(housing[["housing_median_age"]])
A transformer computes the same Gaussian RBF similarity measure
1sf_coords = 37.7749, -122.41
2sf_transformer = FunctionTransformer(rbf_kernel, kw_args=dict(Y=[sf_coords], gamma=0.1))
3sf_simil = sf_transformer.transform(housing[["latitude", "longitude"]])
How to add a feature that will measure the geographic similarity between each district and San Francisco
Custom transformers are useful to combine features too,
1ratio_transformer = FunctionTransformer(lambda X: X[:, [0]] / X[:, [1]])
2ratio_transformer.transform(np.array([[1., 2.], [3., 4.]]))
Transformer computes the ratio between the input features 0 and 1
You can create your own custom transformer with methods such as fit(), transform(), and fit_transform(). These are the only methods you need to implement.
  • Use TransformerMixin as a base class → you have fit_transform() for free.
  • Use BaseEstimator as a base class (and avoid using *args and **kwargs in constructor) → you have get_params(), set_params()
A custom transformer can (and often does) use other estimators in its implementation.
Check whether your custom estimator respects Scikit-Learn’s API by passing an instance to check_estimator().
ClusterSimilarity ← Transformer uses k-means to locate the clusters, then measures Gaussian RBF similarity between each district and all cluster centers. (Figure 2-19)
Figure 2-19. Gaussian RBF similarity to the nearest cluster center

Transformation Pipelines

  • Many transform steps need to be executed in order → scikit-learn has Pipeline to help.
    • 1from sklearn.pipeline import Pipeline
      2
      3num_pipeline = Pipeline([
      4    ("impute", SimpleImputer(strategy="median")),
      5    ("standardize", StandardScaler()),
      6])
      An example of using Pipeline
  • Pipelines = list of name/estimator pair.
    • name = any not containing __
    • estimator = all be transformed (must have fit_transform()) ← except the last one which can be anything!
  • If you don’t want to name the transformers → use make_pipeline() instead.
    • 1num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())
  • Pipeline calls methods inside the line, sequentially and it exposes the same methods as the final estimator. ← eg. Above code → pipeline acts as a transformer (its last estimator is StandardScaler)
  • Pipeline supports indexing, ie. pipeline[1].
  • Use ColumnTransformer to have a single transformer → handle all columns (numerical, categorical…) ← apply appropriate transformer to each column.
    • 1num_attribs = [...]
      2cat_attribs = [...]
      3
      4cat_pipeline = make_pipeline(
      5    SimpleImputer(strategy="most_frequent"),
      6    OneHotEncoder(handle_unknown="ignore"))
      7
      8preprocessing = ColumnTransformer([
      9    ("num", num_pipeline, num_attribs),
      10    ("cat", cat_pipeline, cat_attribs),
      11])
      Use make_column_transformer and make_column_selector if you don’t want to name columns.
      1preprocessing = make_column_transformer(
      2    (num_pipeline, make_column_selector(dtype_include=np.number)),
      3    (cat_pipeline, make_column_selector(dtype_include=object)),
      4)

Sum up

  • Missing values: numerical features ← median, categorical features ← most frequent category.
  • Category feature ← one-hot encoded.
  • Ratio feature are needed ← bedrooms_ratio, rooms_per_house, people_per_house.
  • Cluster similarity features will also be added.
  • Features with a long tail will be replaced by their logarithm.
  • All numerical features will be standardized

Select and Train a Model

Train and Evaluate on the Training Set

  • Try with simple linear regression first (use LinearRegression) than use mean_squared_error to measure the performance. ← result: score 68k (where median_housing_values range is between 120k and 265k) ← not very satisfying (of course)! ← underfitting! ← features don’t provide enough info or model isn’t good enough!
    • Options to improve: (1) more powerful model, (2) better features, (3) reduce constraints on the model.
  • Try DecisionTreeRegressormore powerful to find complex nonlinear relationship! (Chapter 6) ← result: 0 score (RSME) ← overfitting! ← How to be sure? → split training into training/validation tests

Better Evaluation Using Cross-Validation

  • Option 1: Use train_test_split() to split training set into training/validation tests → train again with smaller training and validate using validation set.
  • Option 2: Use k-fold cross-validation (cross_val_score) → randomly splits training set into 10 nonoverlapping subsets (folds) → trains&evaluate 10 times (pick one for validation and train on 9 other folds) ← result: 66.8K±2K ← bad!
  • Remark: Score of cross-validation is greater is better (opposite to a cost function which is lower is better)
  • If training error is low but validation error is high → overfitting!
  • Option 3: use RandomForestRegressor (Chapter 7) = train many decision trees on random subsets of features, then average their predictions. ← ensembles model result: 47K±1K (really better) ← However, if train RandomForest on training set + measure RSME → 17K (much lower than 47K) → there is still overfitting! ← Solution: regularize model, more data,…
  • In this stage, try serveral model → goal: a shortlist (2 to 5) of promising models.

Fine-Tune Your Model

After having a shortlist, this stage is to fine-tune them!

Grid Search

  • You can but shouldn’t play with hyperparameters manually until you find a great combination → use GridSearchCV instead (it searches for you)!
  • Given which hyperparameters + which values to try out → it uses cross-validation.
  • TIP: Using a Scikit-Learn pipeline for preprocessing lets you adjust preprocessing and model hyperparameters simultaneously. If pipeline fitting is costly, set the pipeline's memory to a cache directory path.
  • Sample codes
    • 1from sklearn.model_selection import GridSearchCV
      2
      3full_pipeline = Pipeline([
      4    ("preprocessing", preprocessing),
      5    ("random_forest", RandomForestRegressor(random_state=42)),
      6])
      7param_grid = [
      8    {'preprocessing__geo__n_clusters': [5, 8, 10],
      9     'random_forest__max_features': [4, 6, 8]},
      10    {'preprocessing__geo__n_clusters': [10, 15],
      11     'random_forest__max_features': [6, 8, 10]},
      12]
      13grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,
      14                           scoring='neg_root_mean_squared_error')
      15grid_search.fit(housing, housing_labels)
    • 2 dictionaries in param_grid → 3x3 + 3x2 = 15 combinations.
    • Train pipeline 3 times per combination (cv=3)
    • → Total: 3x15=45 rounds of training.

Randomized Search

  • RandomizedSearchCV is often preferable, especially when the hyperparameter search space is large.
  • It evaluates a fixed number of combinations, selecting a random value for each hyperparameter at every iteration.
  • Each hyperprameter → provide either list of values or a prob distribution.
  • There are also HalvingRandomSearchCV and HalvingGridSearchCV ← use computational resources more efficiently ← Idea: from the beginning rounds, they use limit resources (eg. part of training data) to find the params, then the best candidates go to the next rounds (more resources).

Ensemble Methods

  • Combine the models that perform best. “Many” is better than “individual”. Check more in Chapter 7.

Analyzing the Best Models and Their Errors

  • Now is also a good time to ensure that your model not only works well on average, but also on all categories of districts.

Evaluate Your System on the Test Set

  • You are ready to evaluate the final model on the test set.
  • You need to know how precise the error estimate gives from the test ← 95% confidence interval for the generalization error using scipy.stats.t.interval().
  • Hyperparameter tuning might decrease performance due to overfitting on validation data. Resist tweaking hyperparameters for test set improvements, as they may not apply to new data.

Launch, Monitor, and Maintain Your System

  • Save and load the model, use joblib!
  • An example of deploy your model
    • Figure 2-20. A model deployed as a web service and used by a web application
  • After deployment, you have to monitor the system and model too. ← To see if it’s still working or needed to be improved.
  • You should probably automate the whole process as much as possible.
  • You should trigger alerts when something goes wrong.
  • Make sure you keep backups of the models + having a rollback process to previous model. Backups are w.r.t model versions, dataset versions,…
  • ML involves a log of infrastructure (MLOps - ML Operations) → Chapter 19.

Try It Out!

  • Much of the work is in the data preparation step.
  • Understanding the overall process and mastering a few machine learning algorithms can be more beneficial than solely focusing on exploring advanced algorithms.
  • Kaggle is a good place for you to start an A-Z project.