Reading: Hands-On ML - Quick notes (Chapter 10 — 13)

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!
⚠️
This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.

Information

List of notes for this book

Chapter 10. Introduction to Artificial Neural Networks with Keras

  • The Perceptron: one of the simplest ANN architectures (ANN = Artificial Neural Networks)
    • Figure 10-4. TLU (threshold logic unit): an artificial neuron that computes a weighted sum of its inputs , plus a bias term b, then applies a step function
  • Most common step function is Heaviside step function, sometimes sign function is used.
  • How is a perceptron trained? → follows Hebb’s rule. “Cells that fire together, wire together” (the connection weight between two neurons tends to increase when they fire simultaneously.)
  • perceptrons has limit (eg. cannot solve XOR problem) → use multiplayer perceptron (MLP)
  • perceptrons do not output a class probability → use logistic regression instead.
  • When an ANN contains a deep stack of hidden layers → deep neural network (DNN)
  • Thời xưa, máy tính chưa mạnh → train MLPs is a problem kể cả khi dùng gradient descent.
  • Backpropagation : an algo to minimize the cost function of MLPs.
    • Forward propagation: from X to compute the cost J
    • Backward propagation: compute derivaties and optimize the params → update params
    • → Read this note (DL course 1).
From this, I've decided to browse additional materials to deepen my understanding of Deep Learning. I found that the book has become more generalized than I expected, so I'll explore other resources before returning to finish it.
  • It’s important to initialize all the hidden layers connection weights randomly!
  • Replace step function in MLP by sigmoid function because sigmoid function has a well-defined nonzero derivative everywhere!
  • The ReLU activation (The rectified linear unit function) is continous but not differentiable at 0. In practice, it works very well and fast to compute, so it becomes the default.
  • Some popular activations
  • Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
  • Regression MLPs → use MLPs for regression tasks, MLPRegressor
  • gradient descent does not converge very well when the features have very different scales
  • softplus activation (a smooth variant of ReLU): softplus(z) = log(1 + exp(z))
  • If you are not satisfied with the training result: try tune the hyperparameters (eg. learning rate) and then fine tune the model hyperparameters (number of layers, #neurons per layer,…)
  • 3 ways to build Keras model: Sequential API (clean and straightforward), Functional API (multiple inputs/outputs), Subclassing API (to build dynamic models)
    • 1model = tf.keras.Sequential([
      2	tf.keras.layers.Flatten(input_shape=[28, 28]),
      3	tf.keras.layers.Dense(300, activation="relu"),
      4	tf.keras.layers.Dense(100, activation="relu"),
      5	tf.keras.layers.Dense(10, activation="softmax")
      6])
      Sequential API
       
      Figure 10-13. Wide & Deep neural network
      1input_wide = tf.keras.layers.Input(shape=[5]) # features 0 to 4
      2input_deep = tf.keras.layers.Input(shape=[6]) # features 2 to 7
      3norm_layer_wide = tf.keras.layers.Normalization()
      4norm_layer_deep = tf.keras.layers.Normalization()
      5norm_wide = norm_layer_wide(input_wide)
      6norm_deep = norm_layer_deep(input_deep)
      7hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
      8hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
      9concat = tf.keras.layers.concatenate([norm_wide, hidden2])
      10output = tf.keras.layers.Dense(1)(concat)
      11model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])
      Functional API
  • Sequential API is clean and straightforward. Need more complex topologies/multiple inputs or outputs → functional API.
  • Functional API
    • One example of a nonsequential neural network is a Wide & Deep neural network.
    • Each Dense layer is created and called on the same line. This is a common practice
      • 1norm_deep = norm_layer_deep(input_deep)
        2
        3# instead of
        4hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
        5hidden1 = hidden_layer1(norm_deep)
        6
        7# do
        8hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
    • Multiple outputs: You could train one neural network per task, but in many cases you will get better results on all tasks by training a single neural network with one output per task.
    • you may want to add an auxiliary output in a neural network architecture (see Figure 10-15) to ensure that the underlying part of the network learns something useful on its own, without relying on the rest of the network.
    • Figure 10-15. Handling multiple outputs, in this example to add an auxiliary output for regularization
    • Each output need its own loss function
      • 1model.compile(
        2	loss=("mse", "mse"), loss_weights=(0.9, 0.1),
        3	# ...
        4)
  • Cả sequential API và functional API đều là dạng “declarative” (dễ debug) nhưng nhược điểm là chúng là dạng “static” (graph of layers to use). Nếu chúng ta cần dạng loops, vary shapes, conditional branching,… → need Subclassing API (tf.keras.Model)
  • Using TensorBoard for Visualization
  • Fine tuning NN Hyperparameters
    • One option is to convert your Keras model to a Scikit-Learn estimator, and then use GridSearchCV or RandomizedSearchCV to fine-tune the hyperparameters, as you did in Chapter 2.
    • better way: you can use the Keras Tuner library, which is a hyperparameter tuning library for Keras models.
    • Number of hidden layers
    • Transfer learning
    • Number of Neurons per hidden layer
      • Ngày xưa thì là giảm dần nhưng sau đó cái này ko còn đúng, chuyển thành giống nhau (khi ấy chỉ có 1 hyperparameter để tune thui).
      • (1 đứa ở Google) Cứ dùng nhiều neurons ban đầu rùi giảm dần, điều này sẽ tránh được tình trạng có 1 layer nào đó ko capture đủ information, dẫn đến việc các layer sau đó cho dù nhiều neurons đến đâu cũng ko thê capture inform bị mất.
      • In general you will get more bang for your buck by increasing the number of layers instead of the number of neurons per layer.
    • The learning rate is arguably the most important hyperparameter.
    • Optimizer → chap 11
    • Batch size
      • The batch size can have a significant impact on your model’s performance and training time.
      • Use largest batch size that can fit in GPU RAM. ← weak: large batch sizes often lead to training instabilities, especially at the beginning of training, and the resulting model may not generalize as well as a model trained with a small batch size
      • using small batches (from 2 to 32) was preferable because small batches led to better models in less training time
      • one strategy is to try to using a large batch size, with learning rate warmup, and if training is unstable or the final performance is disappointing, then try using a small batch size instead.
    • The optimal learning rate depends on the other hyperparameters—especially the batch size
    • [1803.09820] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Chapter 11 — Training Deep Neural Networks

The Vanishing/Exploding Gradients Problems

  • The exploding gradients problems (thường gặp ở recurrent neural networks)
  • Cách khắc phục thường kết hợp giữa activation và initialization techniques.
  • Glorot and He Initialization
  • ReLU isn’t perfect → dying ReLUs problems: during training, some neurons effectively “die”, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. → use leaky ReLU
  • Leaky ReLU
    • Setting (huge leak) seemed to result in better performace than (small leak)
    • There is also randomized leaky ReLU (RReLU) ( picked randomly) → reducing overfitting.
    • parametric leaky ReLU (PReLU) → PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.
    • ReLU, leaky ReLU, and PReLU all suffer from the fact that they are not smooth functions: their derivatives abruptly change (at z = 0)
  • ELU (exponential linear unit), SELU (scaled ELU)
    • SELU activation function may outperform other activation functions for MLPs, especially deep ones. But it requires some conditions to happen (check page 469).
  • Self-normalize: the output of each layer will tend to preserve a mean of 0 and a standard deviation of 1 during training, which solves the vanishing/exploding gradients problem.
  • GELU, Swish, and Mishconsistently on most task
    • Mish overlaps almost perfectly with Swish when z is negative, and almost perfectly with GELU when z is positive.
  • Which one to use?
    • ReLU remains a good default for simple tasks
    • Swish is probably a better default for more complex tasks
    • If you care a lot about runtime latency, then you may prefer leaky ReLU,
    • deep MLPs, give SELU a try
  • Batch Normalization
    • The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zerocenters and normalizes each input, then scales and shifts the result
    • if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set.
    • It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “batch normalization”)
    • So during training, BN standardizes its inputs, then rescales and offsets them. Good! What about at test time? Well, it’s not that simple. Indeed, we may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input’s mean and standard deviation.
    • batch normalization acts like a regularizer, reducing the need for other regularization techniques
    • Batch normalization does, however, add some complexity to the model. the neural network makes slower predictions due to the extra computations
    • Tổng thời gian train nếu dùng BN là ngắn hơn dù cho nó phức tạp hơn. Lý do là bởi nó dùng fewer epochs.
    • The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after (as we just did). ← có vài tranh cãi
    • Batch normalization has become one of the most-used layers in deep neural networks, especially deep convolutional neural networks\
  • Gradient Clipping
    • mitigate the exploding gradients problem is to clip the gradients during backprop so that they never exceed some threshold

Reusing Pretrained Layers

  • Transfer learning ← It will not only speed up training considerably, but also requires significantly less training data.
  • transfer learning will work best when the inputs have similar low-level features.
  • The more similar the tasks are, the more layers you will want to reuse (starting with the lower layers). For very similar tasks, try to keep all the hidden layers and just replace the output layer.
  • How many number of layers to reuse? → try freezing all the reused layers first. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves.
    • The more training data you have, the more layers you can unfreeze.
    • After unfreezing the reused layers, it is usually a good idea to reduce the learning rate
    • You must always compile your model after you freeze or unfreeze layers.
  • When a paper just looks too positive, you should be suspicious ← so many results in science can never be reproduced.
  • It turns out that transfer learning does not work very well with small dense networks
    • small → few patterns
    • dense → very specific patterns (not useful for other tasks)
  • Unsupervised Pretraining
    • don’t have much labeled training data, but unfortunately you cannot find a model trained on a similar task.
    • you can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network (GAN)
    • Unsupervised pretraining (today typically using autoencoders or GANs rather than RBMs) is still a good option when you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data.
    • Figure 11-6. In unsupervised training, a model is trained on all data, including the unlabeled data, using an unsupervised learning technique, then it is fine-tuned for the final task on just the labeled data using a supervised learning technique; the unsupervised part may train one layer at a time as shown here, or it may train the full model directly
  • Pretraining on an Auxiliary Task
    • If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.
    • Self-supervised learning is when you automatically generate the labels from the data itself

Faster Optimizers

So far, to speed up training:
  1. A good initialization
  1. A good activation function
  1. Using Batch Normlization
  1. Use pretrained network (transfer learning)
  1. Faster optimizer
  • Momentum optimization
    • Idea: bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity
    • vs regular gradient descent: small steps when the slope is gentle and big steps when the slope is steep, but it will never increase speed.
    • A typical momentum value is 0.9
    • momentum optimization will roll down the valley faster and faster until it reaches the bottom (the optimum). In deep neural networks that don’t use batch normalization, the upper layers will often end up having inputs with very different scales, so using momentum optimization helps a lot.
    • one drawback of momentum optimization is that it adds yet another hyperparameter to tune.
  • Nesterov Accelerated Gradient (NAG)
    • Figure 11-7. Regular versus Nesterov momentum optimization: the former applies the gradients computed before the momentum step, while the latter applies the gradients computed after
  • AdaGrad
    • Figure 11-8. AdaGrad versus gradient descent: the former can correct its direction earlier to point to the optimum
    • elongated bowl problem: gradient descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley.
    • ideal of the algorithm could correct its direction earlier to point a bit more toward the global optimum
    • AdaGrad frequently performs well for simple quadratic problems
    • you should not use it to train deep neural networks (it may be efficient for simpler tasks such as linear regression, though)
  • RMSProp
    • AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum. The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations
    • this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around
  • Adam (adaptive moment estimation)
    • combines the ideas of momentum optimization and RMSProp: it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients
    • The momentum decay hyperparameter β 1is typically initialized to 0.9, while the scaling decay hyperparameter β 2is often initialized to 0.999. often use the default value η = 0.001
    • three variants of Adam: AdaMax, Nadam, AdamW
      • In practice, this can make AdaMax more stable than Adam, but it really depends on the dataset, and in general Adam performs better.
  • Adaptive optimization methods: RMSProp, Adam, AdaMax, Nadam, and AdamW optimization
  • when you are disappointed by your model’s performance (by using Adaptive optimization), try using NAG instead
* is bad, ** is average, and *** is good

Learning Rate Scheduling

  • LR too high → training may diverge, too low → take very long time to converge
  • Limit budget? → interrupt training before it has converged properly
    • Learning curves for various learning rates η
  • Power scheduling: first drops quickly, then more and more slowly
  • Exponential scheduling: exponential scheduling keeps slashing it by a factor of 10 every s steps
  • Piecewise constant scheduling: constant learning rate for a number of epochs. This solution can work very well.
  • Performance scheduling: validation error every N steps.
  • 1cycle scheduling (ko có sẵn trong Keras) ← có thể converge rất nhanh.
  • To sum up, exponential decay, performance scheduling, and 1cycle can considerably speed up convergence, so give them a try!

Avoiding Overfitting Through Regularization

  • early stopping: one of the best regularization techniques in Chapter 10: early stopping
  • l1 and l2 Regularization
  • Dropout: Dropout is one of the most popular regularization techniques for deep neural networks.
      • many state-of-the-art neural networks use dropout, as it gives them a 1%–2% accuracy boost.
      • at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out”
      • closer to 20%–30% in recurrent neural nets , and closer to 40%–50% in convolutional neural networks
      • Neurons trained with dropout cannot coadapt with their neighboring neurons; they have to be as useful as possible on their own.
      • In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).
    • Warning: Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So, make sure to evaluate the training loss without dropout (e.g., after training).
    • If you observe that the model is overfitting, you can increase the dropout rate and vice versa.
    • many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.
  • Monte Carlo (MC) Dropout:
    • MC dropout tends to improve the reliability of the model’s probability estimates. This means that it’s less likely to be confident but wrong, which can be dangerous: just imagine a self-driving car confidently ignoring a stop sign. It’s also useful to know exactly which other classes are most likely.
  • Max-Norm Regularization: can also help alleviate the unstable gradients problems (if you are not using batch normalization).

Summary and Practical Guidelines

network is a simple stack of dense layers, then it can self-normalize
  • Don’t forget to normalize the input features
  • Should try to use pretrained NN
  • Use unsupervised pretraining if you have a lot of unlabeled data
  • Use pretraining on an auxiliary task.
  • If you need sparse model, use l1 regularization.
  • If you need a low-latency model → fast activation
  • Need risk-sensitive application → MC dropout.

Chapter 12 — Custom Models and Training with TensorFlow

Chương nảy không có đọc (kỹ). Chủ yếu nó giới thiệu về các hàm và cách sử dụng cũng như modify theo ý muốn khi dùng TensorFlow. Sau này khi nào làm rồi đọc lại sẽ hay hơn.
Bên dưới là vài ý chính trong chapter:
  • 95% of the use cases you will encounter will not require anything other than Keras
  • TF → Its core is very similar to NumPy, but with GPU support.
  • Lowest level, implemented using C++.
  • Có API cho cả C++, Java, Swift và JavaScript.
  • TensorFlow’s API revolves around tensors, which flow from operation to operation—hence the name TensorFlow.
  • Tensors play nice with NumPy.
  • NumPy uses 64-bit precision by default, while TensorFlow uses 32-bit. → when you create a tensor from a NumPy array, make sure to set dtype=tf.float32.
  • tf.Tensor values are immutable → use tf.Variable nếu muốn variables có thể modify được.
  • Những data structures khác (Appendix C): Sparse tensors, Tensor array, Ragged tensors, String tensors, Sets, Queues.
  • Customizing Models and Training Algorithms:
    • Custom Loss functions
    • Saving and Loading Models That Contain Custom Components
    • Custom Activation Functions, Initializers, Regularizers, and Constraints
    • Custom Metrics
    • Custom Layers
    • Custom Models
    • Losses and Metrics Based on Model Internals
    • Computing Gradients Using Autodiff
    • Custom Training Loops
  • TensorFlow Functions and Graphs
    • AutoGraph and Tracing
    • TF Function Rules

Chapter 13 — Loading and Preprocessing Data with TensorFlow

Giống chương 2, chương này đọc qua loa.