Note for couse DL 3: Structuring ML Projects

Anh-Thi Dinh
This is my note for the course (Structuring Machine Learning Projects). The codes in this note are rewritten to be more clear and concise.
This course will give you some strategies to help analyze your problem to go in a direction that will help you get better results.

Introduction to ML Strategy

Why ML strategy?

  • "ML strategy" = How to structure your ML project?
  • Ideas to improve your ML systems:
      1. Collect more data.
      1. Collect more diverse training set.
      1. Train algorithm longer with gradient descent.
      1. Try different optimization algorithm (e.g. Adam).
      1. Try bigger network.
      1. Try smaller network.
      1. Try dropout.
      1. Add L2 regularization.
      1. Change network architecture (activation functions, # of hidden units, etc.)
  • However, don't spend too much time to do one of above things, we need to go right direction!

Orthogonalization

  • In orthogonalization, you have some controls, but each control does a specific task and doesn't affect other controls.
  • Chain of assumptions in ML:
      1. You'll have to fit training set well on cost function (near human level performance if possible).
    • If it's not achieved you could try bigger network, another optimization algorithm (like Adam)...
      1. Fit dev set well on cost function.
    • If its not achieved you could try regularization, bigger training set...
      1. Fit test set well on cost function.
    • If its not achieved you could try bigger dev. set...
      1. Performs well in real world.
    • If its not achieved you could try change dev. set, change cost function...

Setting up your goal

Single number evaluation metric

  • Advice: It's better and faster to set a single number evaluation metric for your project before you start it.
  • Example: instead of using both precision and recall, just use f1. Check this note.
  • Dev set + single row number evaluation metric → enough to make a choice!

Satisfying and Optimizing metric

  • It's difficult to set all parameters to a single row number evaluation metric → set up (many) satisfying + (one) optimizing matrix.
    • Satisfying (use threshold): satisfying this is enough.
    • Optimizing: more important, it's accuracy!
  • Example: call "Hi Siri",
    • Accuracy: is it awoken? → optimizing
    • False positive: it's awoken but we don't call it! → set the satisfying as less then 1 false positive per day!

Train/dev/test distributions

  • The way we set the distribution of train / dev / test sets can impact much on the running time.
  • Dev set = developement set / hold out cross validation set.
  • Advice: Make dev set and test set come from the same distribution!

Size of the dev and test sets

  • Old (less data, <100000): 70% train - 30% test or something like that.
  • Now (big data): 98% - 1% - 1%.
  • Test set: set your test set to be big enough to give high confiance in the overall performance of your system.

When to change dev/test sets and metrics

  • Sometimes, we put our target a wrong place → should change metric!
  • Example: cat classification,
    • Algo A: 3% error but contains porn → train / test error like this!
    • Algo B: 5% error but no porn → human like this!
  • This is actually an example of an orthogonalization where you should take a machine learning problem and break it into distinct steps:
      1. Figure out how to define a metric that captures what you want to do (place the target).
      1. Worry about how to actually do well on this metric (how to aim/shoot accurately at the target).
  • Conclusion: if doing well on your metric + dev/test set doesn't correspond to doing well in your application, change your metric and/or dev/test set.

Comparing to human-level performance

Why human-level performance?

  • Reasons:
      1. ML algos are now work better & easier (than the past) → only them is not enough → need human-level performance (HLP).
      1. Workflow of building ML system → wanna more efficient? → try to do something that human can also do.
  • Bayes error = best possible error (theory).
  • After surpassing HLP, it's slow down, why?
      1. HLP is very closed to Bayes optimal error. Ex: we can recognize things in blur.
      1. Whenever under HLP, there are certain tools to use to improve the performance but there is no tool to do after surpassing HLP.
  • So long as ML is worse than HLP, you can:
    • Get labeled data from human.
    • Gain insight from manual error analysis: why did a person get this right?
    • Better analysis of bias / variance.

Avoidable bias

  • Sometimes we don't want algo works TOO WELL on the training set → use HLP.
  • Example: cat recognition gives 2 different results (but the same gap between train & test)
      1. Big gap between train and human. → focus on reducing bias (bigger NN, run training longer,...) → Underfitting!
        1. Humans
          1%
          1%
          Training error
          8%
          8%
          Dev Error
          10%
          10%
      1. Small gap between train and human. → focus on reducing varianceOverfitting!
        1. Humans
          1%
          7.5%
          Training error
          8%
          8%
          Dev Error
          10%
          10%
  • Based on the human error → decide whether high/low error → bias / variance reduction!
  • Gap between human & training → Avoidable bias.
  • Gap between training & test → Variance!

Understanding human-level performance

  • Use the nearest value to Bayes error as a human level error! (the smallest)
  • The way we choose HL error sometimes can impact the way we improve our algo (bias or variance reduction)
  • Use human level error as a proxy of Bayes error!

Surpassing human-level performance

  • When training error less than human error, it's difficile to decide what's avoidable bias!
  • In some problems, deep learning has surpassed human-level performance. Example: Online advertising, Product recommendation, Loan approval. → Structured data.
  • In natural perception tasks (speech recognition, NLP,...): ML surpasses human!
  • In short:
    • Machine > human → structured data.
    • Machine > One person → some natural perception tasks.
    • Machine > human → natural perception tasks.

Improving your model performance

  • The two fundamental asssumptions of supervised learning:
      1. You can fit the training set pretty well. This is roughly saying that you can achieve low avoidable bias.
      1. The training set performance generalizes pretty well to the dev/test set. This is roughly saying that variance is not too bad.
  • To improve your deep learning supervised system follow these guidelines:
      1. Look at the difference between human level error and the training error - avoidable bias.
      1. Look at the difference between the dev/test set and training set error - Variance.
      1. If avoidable bias is large you have these options:
          • Train bigger model.
          • Train longer/better optimization algorithm (like Momentum, RMSprop, Adam).
          • Find better NN architecture/hyperparameters search.
      1. If variance is large you have these options:
          • Get more training data.
          • Regularization (L2, Dropout, data augmentation).
          • Find better NN architecture/hyperparameters search.

Error Analysis

Carrying out error analysis

  • Error Analysis = manually examining mistakes that your algorithm is making can give you insight what to do next.
  • Example: in cat recognition, there are some factors affecting → Consider a table ERROR ANALYSIS → Evaluate multiple ideas in parallel,
    • Image
      Dog
      Great Cats
      blurry
      Instagram filters
      Comments
      1
      Pitbull
      2
      3
      Rainy day at zoo
      4
      ....
      % totals
      8%
      43%
      61%
      12%
      We focus on Great Cats and blurry (have much influence).
  • To carry out error analysis → you should find a set of mislabeled examples in dev set → look at mislabeled: False Positive or False Negative (number of errors) → decide if to create a new category?

Cleaning up incorrectly labaled data

  • In training: DL algo is robust to random errors → we can ignore them!
    • However, DL algo is LESS robust to systematic errors.
  • Solutions: Using table Error Analysis to decide what types of error to focus in the next step (base of their fraction of errors in the total of errors).
  • (Recall) The purpose of dev set is to help you select between 2 classifier A and B.
  • If you decide to fix labels:
      1. Apply the same process to dev and test sets and make sure they come from the same distribution!
      1. Examine also on examples got right (not only on the ones got wrong) → otherwise, we have overfitting problem!
      1. Train vs dev/test may have different distribution → No need to be corrected mislabeled on training set!

When starting a new project?

Advice: Build your first system quickly and then iterate!!
  1. Quickly set up dev/test sets + metric.
  1. Build initial system quickly.
  1. Check bias analysis and Error analysis → Priopritize the next step!

Mismatched training and dev/test set

Training & testing on different distribution

  • Example: training (photos from internet, 200K), dev & test (photos from phone, 4K).
  • Shouldn't: Shufflt all 204K and split into train/dev/test!
  • Should:
    • Train - 200K (web) + 2K (mobile).
    • Dev = Test = 0.5K (mobile).

Bias and Variance with mismatched data dist

  • Sometimes, dev err > training err → (possibly) the data in dev is more difficult to predict than the data in training.
  • When comming from training err to dev err:
      1. The algo saw data in training set but not in dev set.
      1. The distribution of data in dev set is different!
  • IDEA: create a new "train-dev" set which has the same distribution as training data but not used for training.
  • Keys to be considered: Human error, Train error, Train-dev error, Dev error, Test error:
    • Avoidable bias = train - human.
    • Variance problem = train-dev - train
    • Data mismatch = dev - train-dev
    • Overfitting to dev set = test - dev
  • If there is a huge gap between dev & test err → overtune to the dev set → may need to find a bigger dev set!
  • Example 1: A high variance problem! (train/train-dev → big, train-dev/dev → small)
    • Human error: 0%
    • Train error: 1%
    • Train-dev error: 9%
    • Dev error: 10%
  • Example 2: data mismatch problem (train/train-dev → small, train-dev/dev → big)
    • Human error: 0%
    • Train error: 1%
    • Train-dev error: 1.5%
    • Dev error: 10%
  • Example 3: avoidable bias problem (because training err is much worse than human level, others are small)
    • Human error: 0%
    • Train error: 10%
    • Train-dev error: 11%
    • Dev error: 12%
  • Example 4: Avoidable bias problem and mismatch problem. (human/train → big, train-dev/dev → big)
    • Human error: 0%
    • Train error: 1%
    • Train-dev error: 1.5%
    • Dev error: 10%
  • Remark: most of the time, the errs are decreasing from human to test. However if (sometimes) dev > train-dev, we rewrite all above errors in to a new table like this,
    • Error table. Image from the course.
      We find by hand 2 6% errors to consider the quality of dev/test err. In the figure, your figure is infact GOOD!

Addressing data mismatch

  • Addressing data mismatch (don't garantee it will work but you can try):
    • Carry out manually error analysis → try to understand difference between training and dev/test errors.
    • Make the training data more similar or collect more data similar to dev/test set.
  • Artificial data synthesis:
    • "the quick brown fox jumps over the lazy dog" → shortest sentence contains all A-Z letters in English.
    • Create manually data (combine 2 different types of data). However, BE CAREFUL if one of 2 data is much smaller to the other. It may be overfitting to the smaller!

Learning from multiple tasks

Transfer learning

  • IDEA: already trained on 1 task (Task A) + don't have enough data on the current task (Task B) → we can apply the trained network on the current one.
    • Transfer learning. Image from the course.
  • To do transfer learning, delete the last layer of NN and it's weights and:
      1. Option 1: if you have a small data set - keep all the other weights as a fixed weights. Add a new last layer(-s) and initialize the new layer weights and feed the new data to the NN and learn the new weights.
      1. Option 2: if you have enough data you can retrain all the weights.
  • Pretraining = training on task A.
  • Fine-tuning = using pretrained weights + use new data to train task B.
  • This idea is useful because some of layers of trained NN contain helpful information for the new problem.
  • Transfer learning makes sense when (e.g. from A to B):
      1. Task A and B have the same input X.
      1. You have a lot more data for task A than task B.
      1. Low level features from A could be helpful for learning B.

Multi-task learning

  • 1 NN do several things at the same time and each of these tasks helps hopefully all of the other tasks!
  • Example: Autonomous driving example → Detect several things (not only 1) at the same time like: pedestrians, other cars, stop signs, traffic lights,...
    • Multi-task learning. Image from the course.
  • We use Logistic Regression for the last layer. It's DIFFERENT from softmax regression because in this case, we need to determine more than 2 labels!
  • If there are some infos unclear in Y (e.g. don't know if there is a traffic light or not?), we consider only the rest and just ignore the unclear!
  • Multi-task makes sense when (e.g. from A to B):
      1. Training on a set of tasks that could benefit from having shred lover-level features.
      1. Usually: amount of data you have for each task is quite similar.
      1. Can train a big enough NN to do well on all the task.
  • In general (have ENOUGH DATA), multi-task gives better performances!
  • Other remarks:
    • Multi-task learning (usually) works good in object detection.
    • (Usually) transfer learning is USED MORE OFTEN (an better) than multi task learning!

End-to-end deep learning

  • There have been some data processing system require multiple stages of processing. → End-to-end does take all of them into 1 NN.
  • Example: speech recognition from English to French: this case, end-to-end works better than separated problems because it has enough data!
    • End-to-end learning. Image from the course.
  • Example: auto open gate system: in this case, separated task is better end-to-end.
      1. Determine the head.
      1. Determine the name.
  • When end-to-end works, it works very well!
  • Pros & Cons:
    • Pros:
      • Let the data speaks.
      • Let hand-designing of components needed.
    • Cons:
      • May need a lot of data.
      • Excludes potentially useful hand-designing components.
  • If having enough data → can think of using end-to-end!
  • Advice: carefully choose what type of mapping → depends on what tasks you can get data for!