Note for couse DL 5: Sequence models

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

Week 1 - RNN (Recurrent Neural Networks)

Why sequence models?
  • 1 of the most exciting areas in DL
  • Models: RNN transforns speech recognition, NLP,...
  • Examples
    • Different/Equal length, X and/or Y is a sequence,...
Notations
  • : index of position of a word in the sequences
  • : length of sequence
  • : index of training example
  • Representing words → based on Vocabulary (built based on occurent words in the sequences or some online already-built vocabs) → a common vector of all words
    • Each words represents by a one-hot vector based on vocabulary vector
    • If some words are not in the vocab, we use "<UNK>" (Unknown)
RNN Model
  • Why don't use a standard networks?
      1. Input and output can be different lengths in diff examples () even if you could use padding to the max length of all texts but it's not a good representation!
      1. Doesn't share features learned across diff positions of text (ex: word "Harry" in some position and other positions give some info abt person's name)
          • Like in CNN, something learnt from 1 part of the image can be generalized quickly to other part of the image.
      1. Reduce #params in model ← we don't want very large input layer (with one-hot vector)
  • RNN (Unidirectional)
    • at time step 2, it uses not only the input but also the info from time step 1 (activation )
    • - The right version is rolled one but the same meaning with the left one (it appears in some textbook but unclear/difficult to implement, Andrew doesn't use it in the course)
      - This is "Unidirectional RNN" which means that we can only use the info of the previous words!!! → not very strong because (ex:)
      He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person
      He said, "Teddy bears are on sale!" → Teddy is not a name of a person!
      - We use notations
      to indicate the params
  • Forward propagation
    • Use to compute
Backpropagation through time (red arrows in below fig)
  • Going backward in time
Different types of RNNs
Language model and sequence generation
  • Speech recognition system → propability(sequences of words)
    • Ouput 10K weight softmax output (10K is the number of words in the dictionary, i.e. corpus) → prob of each word, which one is highest → that one is the word the user said!
Sampling novel sequences
  • After training, we have activations , we then use them to sample a "novel" sequence. → word-level RNN (based on vocaabulary)
An important aspect to be explored, once a Language Model has been trained, is how well it can generate new or novel sequences.
  • Character-level language model → not used too much today
    • Pros: Don't worry about unknown words (word not appears in your vocabulary)
    • Cons: much longer sentences!! → computational
Vanishing gradients with RNNs
  • language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. Ex:
    • The cat, which already ate...., was full.
    • The cats, which ...................., were full.
    • → RNN is not good to capture very logn term dependencies. ← because of vanishing gradient
      → the basic RNN model has many local influences
  • There are also the problem of "exploding gradients" (increasing by the depth of NN) → there are many NaNs values in the output! ← solution: gradient clipping (rescale some of gradient vectors)
    → vanishing gradient is much harder to solve!
Gated Recurrent Unit (GNU) → solution for "vanishing gradient" problem → capture much longer range dependencies
  • Compared with an RNN unit
  • Notations: = memory cell, in this GRU, (activation) but in LSTM, they're different!
  • Intuition, Gate function is either 0 or 1 (by using sigmoid). "u" stands for "update"
  • With ~ 0 → ~ → maintained through very long sequence → help to solve vanishing problem! (green color in above fig)
Long Short Term Memory (LSTM)
  • It's even more powerful (more general) than GRU. However, in the history, LSTM came first.
  • LSTM is default first thing to try.
  • Paper is really diffcult to read
  • We don't have the case like in the case of GRU
  • Gates: update (), forget () and output ().
    LSTM has 3 gates instead of 2 in GRU.
Unidirectional
Bidirectional RNN (BRNN)
  • At a point in time to take the info of earlier and later in the sequence.
  • Come back to the example of Teddy:
    - He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person
    - He said, "Teddy bears are on sale!" → Teddy is not a name of a person!
  • Acyclic graph: forward prop contains 2 directions (violet and green in the fig below)
    • Ex: To get from both sides:
      • From (violet way)
      • From (yellow way)
  • Cons: We DO need the entire sequence of data before you can make prediction anywhere.
    • Ex: speech recognition → wait for person to stop talking (so that we have the entite sentence) and then we can make the prediction → not so good in real time
    • → use other techniques!
Deep RNNs
  • Stack multiple layers of RNN togehter!
  • Notation
  • For normal NN → many layers means deep
  • For RNN → 3 layers is a lot!
  • Sometimes, we see that instead of output directly , we can connect with some normal NN like in below fig.
→ We don't see deep RNNs quite otfen because of their computational cost!

Week 2 - NLP & Word Embeddings

Introduction to Word Embeddings
  • Word Representation → word embedding
    • 1 of the most important idea in NLP
    • If we use one-hot, it's difficult for ML algo to learn from each other words because of their representation in the vocabulary,
      eg. Apple (456) is very far away from Orange (6257) → cannot learn from "I want a glass of orange
      juice" to "I want a glass of apple juice".
      ← because the inner product between any 2 vector is 0.
    • → We use Featurized representation instead!
      300 dimensional vector or 300 dimensional embedding
    • We can embed 300 D → 2 D for visualizing ← using t-SNE or UMAP
  • Using word embeddings
    • There are already "pre-trained" very large text corpuses on the internet (~1B words, 100B words)
      → apply to your tasks with much smaller 100K words.
      → allowed to carry out
      Transfer Learning (learn from 1B and transfer to 100K)
      → (optinal) continue to finetune the word embedding with new data
      → Use BRNN (bidirectional RNN) instead of simple RNN.
    • Word embedding have relation to face encoding
      • The word "embedding" and "encoding" are used interchangeable
  • Properties of word embeddings
    • It can help with analogy reasoning
      • "sim" means "similarity"
    • Without t-SNE, it's better to find the analogies. After using something like t-SNE to embed to smaller dimension → the similarity is not sure to true.
    • Cosine similarity → common way to measure the similarity between 2 word embeddings
  • Embedding matrix
    • When you implement an algo to learn word embedding → you end up with an Embedding matrix
    • In practice, we use "Embedding layer" instead of mulplication matrix like above! ← more efficient!
Learning Word Embedding → some cocrete algos: Word2Vec & GloVe
  • Learning word embeddings (go from more complicated but more intuitive → simplier)
  • Word2Vec Skip-grams model → any nearby 1-word, eg. "orange" or "glass" or "my",...
    • Take 1 word, skip some word between → gives the target
    • Cons: Computational cost on the softmax ← because of → use hierarchical softmax
    • In practice, we use (c=context words) different for common words (the, of, a, and, to,...) with uncommon (but more important) words (orange, apple, durian,...)
  • Negative Sampling → modified learning problem → to be able to train much bigger training example
    • positive sampling: orange (context) - juice (word) → 1 (target)
    • negative sampling: orange (context) - kind (word, chosen randomly from vocab) → 0 (target)
    • using formal binary classification (1 vs 0)
    • How do you choose negative examples?
      • Sample based on "how often words appear in the corpus" ( → cons: very high representation of "the, of, and,..."
      • use 1/size of corpus → not representative
      • Usually use
  • GloVe word vectors (= Global Vectors for word representation)
    • has some momentum in the NLP community; not used as much as word2vec or skip-gram model but enthusiast
    • → how often do words i and j appear close each other. (i,j like c,t ↔ context and target words)
    • What GloVe does is optimize
Applications using Word Embeddings
  • Sentiment Classification → based on piece of text and tell that someone likes/dislike something
    • Challenge → not have huge labeled training set for it
    • Simple sentiment classification model
      • Cons: ignore word orders, e.g. the sentence in the fig is negative but there are many "good" → use RNN instead!
    • RNN for sentiment classification
  • Debiasing word embeddings
    • "bias" → not the "bias" in "bias variant" → means "gender, ethnicity, sexual...bias
    • The problem of bias in word embeddings:
    • Addressing bias in word embeddings
      • More explanation:
        - step 1: xác định hướng của bias và non-bias
        - step 2: những từ neutral phải được tịnh tiến để lam mất đi bias ("babysister" và "doctor" project to Oy để làm mất đi bias "babaysister"→female, "doctor"→male
        - step 3: làm cho các khoảng cách bằng nhau: ví dụ khoảng cách của "babysister" gần "grandmother" hơn → ko đúng → phải làm cho nó ngang bằng giữa "grandmother" và "grandfather"
      • How you decide which words to neutralize? (step 2)
        • train the classifier → figure out what words are definitional or not ← a linear classifier can tell you
        • most of words in english are not definitional (like babysister and doctor)
 

Week 3 - Sequence models & Attention mechanism

Various sequence to sequence architectures
  • Basic Sequence to sequence model ← translation
    • Encoder networks: (RNN, GRU, LSTM) input french words → 1 word at a time → output vector represents the input setence
    • Decoder networks: take the output of encoder → train 1 word at the time → output
    • This model works: given enough pair of french-english sentences → works well
    • An architecture very similar to above also works for image cationing ← describe an image
      • Use ConvNet (eg. pretrained AlexNet in the pig) → instead of softmax, we feed it into a RNN network
  • Picking the most likely sentence
    • The similarity between sequence-to-sequence and language model (week 1)
      • Consider machine translation as building a conditional language model
        • Language mode gives the probability of a sentence, give nouvel sentences.
        • Machine translation model: 2 parts - encoder (green), decoder (vilot) where decoder looks alike language model
    • When you use this model for machine translation, you not try to sample at random this distribution! → instead, you wanna maximize ← using Beam Search!!!
  • Beam Search
    • Why not Greedy Search?
      • Maximum từng cái , max cái này tới max cái khác thay vì toàn bộ
      • For French sentence ""Jane visite l'Afrique en Septembre" → if use Beam Search, câu trên, if use Greedy Search, câu dưới → ko quá chính xác!
      • Another reason, the number of words is huge → word by word is not good → use approximate translation is better!
    • Beam Search algo (video explains)
      • Step 1: Use "beam width" (eg. B=3) → for the 1st word (french) → try to choose most likely 3 words
      • Step 2: Each of above 3 words → try to choose the most likely 3 next words for the 2nd word (french)
    • If B=1 → Beam Search becomes Greedy Search!
  • Refinements to Beam Search
    • Length normalization
      • Sometime max (A) when A very small → we choose max (log(A)) instead!!!
        • gives the same result as
    • Beam width → the larger width, the more posibilities considering, the better result → but the more computationaly expensive your algo is
      • Try 1→3→10, 100, 1000, 3000 → be careful on production / commercial
    • Doesn't like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search run faster but is not guaranteed to find exact maximum.
  • Error analysis in beam search ← what if beam search makes a mistake?
    • What fraction of errors due to Beam Search or RNN model?
      • If Beam Search? → increase the beam width
      • If RNN? → deepeer layer analysis, regularization, more training size, netwok architecture,...
  • Bleu Score ← multiple english translations are equally to french sentences?
    • Bleu score gives you an automatical way to evaluate your algo → speed up
    • "bleu" = bilingual evaluation understudy
    • paper is readable
    • Unigram
      Bigrams → (general) n-grams
  • Attention model intuition → look at a part of a sentence at the time
    • A modification of encoder-decoder → attention model makes all of this work much better → 1 of the most influential ideas in deep learning
    • the longer sentence → the lower bleu score ⇒ because it's difficult for NN to memorize
    • How much you should pay attention to a piece of a sentence
    • tells that when you trying to generate english word → how much you should pay attention to the french words ⇒ this allows on every time step, look only within a local window of french sentence to pay attention when generating a specific english word.
  • Attention model ← how to implement? → this video
Speech recognition - audio data
  • Speech recognition ← how sequence-to-sequence model applied to audio data
    • audio (Ox=time, Oy=air pressure) → frequency (spectrogram) (Ox=time, Oy=frequency, different colors=amount of energy) ← need preprocessing step.
    • Speech recognition usually uses "phonemes" (âm vị)
      • phonemes = In linguistics, the smallest unit of speech that distinguishes one word sound from another. Phonemes are the elements on which computer speech is based.
    • dataset → academy (300h, 3000h), commercial (100000h)
    • Using Attention model
    • Using CTC cost (Connectionist temporal classification)
      • The number of time steps is really large! (eg. 10s of audio → 1000 inputs = 100 Hz*10s) → #input large → but the output cannot be many like that!
  • Trigger word detection systems (like Alexa, Google Home, Apple Siri, Baidu DuerOS)
    •