Why sequence models?
- 1 of the most exciting areas in DL
- Models: RNN transforns speech recognition, NLP,...
- Examples
Notations
- : index of position of a word in the sequences
- : length of sequence
- : index of training example
- Representing words → based on Vocabulary (built based on occurent words in the sequences or some online already-built vocabs) → a common vector of all words
- Each words represents by a one-hot vector based on vocabulary vector
- If some words are not in the vocab, we use "<UNK>" (Unknown)
RNN Model
- Why don't use a standard networks?
- Input and output can be different lengths in diff examples () even if you could use padding to the max length of all texts but it's not a good representation!
- Doesn't share features learned across diff positions of text (ex: word "Harry" in some position and other positions give some info abt person's name)
- Like in CNN, something learnt from 1 part of the image can be generalized quickly to other part of the image.
- Reduce #params in model ← we don't want very large input layer (with one-hot vector)
- RNN (Unidirectional)
- at time step 2, it uses not only the input but also the info from time step 1 (activation )
- Forward propagation
- Use to compute
Backpropagation through time (red arrows in below fig)
- Going backward in time
Different types of RNNs
Language model and sequence generation
- Speech recognition system → propability(sequences of words)
- Ouput 10K weight softmax output (10K is the number of words in the dictionary, i.e. corpus) → prob of each word, which one is highest → that one is the word the user said!
Sampling novel sequences
- After training, we have activations , we then use them to sample a "novel" sequence. → word-level RNN (based on vocaabulary)
- Character-level language model → not used too much today
- Pros: Don't worry about unknown words (word not appears in your vocabulary)
- Cons: much longer sentences!! → computational
Vanishing gradients with RNNs
- language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. Ex:
- The cat, which already ate...., was full.
- The cats, which ...................., were full.
→ RNN is not good to capture very logn term dependencies. ← because of vanishing gradient
→ the basic RNN model has many local influences
- There are also the problem of "exploding gradients" (increasing by the depth of NN) → there are many NaNs values in the output! ← solution: gradient clipping (rescale some of gradient vectors)
→ vanishing gradient is much harder to solve!
Gated Recurrent Unit (GNU) → solution for "vanishing gradient" problem → capture much longer range dependencies
- Compared with an RNN unit
- Notations: = memory cell, in this GRU, (activation) but in LSTM, they're different!
- Intuition, Gate function is either 0 or 1 (by using sigmoid). "u" stands for "update"
- With ~ 0 → ~ → maintained through very long sequence → help to solve vanishing problem! (green color in above fig)
Long Short Term Memory (LSTM)
- It's even more powerful (more general) than GRU. However, in the history, LSTM came first.
- LSTM is default first thing to try.
- Paper is really diffcult to read
- We don't have the case like in the case of GRU
- Gates: update (), forget () and output ().
→ LSTM has 3 gates instead of 2 in GRU.
Bidirectional RNN (BRNN)
- At a point in time to take the info of earlier and later in the sequence.
- Come back to the example of Teddy:
- He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person
- He said, "Teddy bears are on sale!" → Teddy is not a name of a person!
- Acyclic graph: forward prop contains 2 directions (violet and green in the fig below)
- Ex: To get from both sides:
- From (violet way)
- From (yellow way)
- Cons: We DO need the entire sequence of data before you can make prediction anywhere.
- Ex: speech recognition → wait for person to stop talking (so that we have the entite sentence) and then we can make the prediction → not so good in real time
→ use other techniques!
Deep RNNs
- Stack multiple layers of RNN togehter!
- Notation
- For normal NN → many layers means deep
- For RNN → 3 layers is a lot!
- Sometimes, we see that instead of output directly , we can connect with some normal NN like in below fig.
→ We don't see deep RNNs quite otfen because of their computational cost!
Introduction to Word Embeddings
- Word Representation → word embedding
- 1 of the most important idea in NLP
- If we use one-hot, it's difficult for ML algo to learn from each other words because of their representation in the vocabulary,
eg. Apple (456) is very far away from Orange (6257) → cannot learn from "I want a glass of orange juice" to "I want a glass of apple juice".
← because the inner product between any 2 vector is 0. - We can embed 300 D → 2 D for visualizing ← using t-SNE or UMAP
→ We use Featurized representation instead!
- Using word embeddings
- There are already "pre-trained" very large text corpuses on the internet (~1B words, 100B words)
→ apply to your tasks with much smaller 100K words.
→ allowed to carry out Transfer Learning (learn from 1B and transfer to 100K)
→ (optinal) continue to finetune the word embedding with new data
→ Use BRNN (bidirectional RNN) instead of simple RNN. - Word embedding have relation to face encoding
- The word "embedding" and "encoding" are used interchangeable
- Properties of word embeddings
- It can help with analogy reasoning
- Without t-SNE, it's better to find the analogies. After using something like t-SNE to embed to smaller dimension → the similarity is not sure to true.
- Embedding matrix
- When you implement an algo to learn word embedding → you end up with an Embedding matrix
Learning Word Embedding → some cocrete algos: Word2Vec & GloVe
- Learning word embeddings (go from more complicated but more intuitive → simplier)
- Word2Vec Skip-grams model → any nearby 1-word, eg. "orange" or "glass" or "my",...
- Take 1 word, skip some word between → gives the target
- Cons: Computational cost on the softmax ← because of → use hierarchical softmax
- In practice, we use (c=context words) different for common words (the, of, a, and, to,...) with uncommon (but more important) words (orange, apple, durian,...)
- Negative Sampling → modified learning problem → to be able to train much bigger training example
- positive sampling: orange (context) - juice (word) → 1 (target)
- negative sampling: orange (context) - kind (word, chosen randomly from vocab) → 0 (target)
- How do you choose negative examples?
- Sample based on "how often words appear in the corpus" ( → cons: very high representation of "the, of, and,..."
- use 1/size of corpus → not representative
- Usually use
- GloVe word vectors (= Global Vectors for word representation)
- has some momentum in the NLP community; not used as much as word2vec or skip-gram model but enthusiast
- → how often do words i and j appear close each other. (i,j like c,t ↔ context and target words)
- What GloVe does is optimize
Applications using Word Embeddings
- Sentiment Classification → based on piece of text and tell that someone likes/dislike something
- Challenge → not have huge labeled training set for it
- Simple sentiment classification model
- RNN for sentiment classification
- Debiasing word embeddings
- "bias" → not the "bias" in "bias variant" → means "gender, ethnicity, sexual...bias
- The problem of bias in word embeddings:
- Addressing bias in word embeddings
- How you decide which words to neutralize? (step 2)
- train the classifier → figure out what words are definitional or not ← a linear classifier can tell you
- most of words in english are not definitional (like babysister and doctor)
Various sequence to sequence architectures
- Basic Sequence to sequence model ← translation
- Encoder networks: (RNN, GRU, LSTM) input french words → 1 word at a time → output vector represents the input setence
- Decoder networks: take the output of encoder → train 1 word at the time → output
- This model works: given enough pair of french-english sentences → works well
- An architecture very similar to above also works for image cationing ← describe an image
- Picking the most likely sentence
- The similarity between sequence-to-sequence and language model (week 1)
- Consider machine translation as building a conditional language model
- Language mode gives the probability of a sentence, give nouvel sentences.
- Machine translation model: 2 parts - encoder (green), decoder (vilot) where decoder looks alike language model
- When you use this model for machine translation, you not try to sample at random this distribution! → instead, you wanna maximize ← using Beam Search!!!
- Beam Search
- Why not Greedy Search?
- Maximum từng cái , max cái này tới max cái khác thay vì toàn bộ
- Another reason, the number of words is huge → word by word is not good → use approximate translation is better!
- Beam Search algo (video explains)
- Step 1: Use "beam width" (eg. B=3) → for the 1st word (french) → try to choose most likely 3 words
- Step 2: Each of above 3 words → try to choose the most likely 3 next words for the 2nd word (french)
- If B=1 → Beam Search becomes Greedy Search!
- Refinements to Beam Search
- Length normalization
- Sometime max (A) when A very small → we choose max (log(A)) instead!!!
- gives the same result as
- Beam width → the larger width, the more posibilities considering, the better result → but the more computationaly expensive your algo is
- Try 1→3→10, 100, 1000, 3000 → be careful on production / commercial
- Doesn't like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search run faster but is not guaranteed to find exact maximum.
- Error analysis in beam search ← what if beam search makes a mistake?
- What fraction of errors due to Beam Search or RNN model?
- If Beam Search? → increase the beam width
- If RNN? → deepeer layer analysis, regularization, more training size, netwok architecture,...
- Bleu Score ← multiple english translations are equally to french sentences?
- Bleu score gives you an automatical way to evaluate your algo → speed up
- "bleu" = bilingual evaluation understudy
- paper is readable
- Attention model intuition → look at a part of a sentence at the time
- A modification of encoder-decoder → attention model makes all of this work much better → 1 of the most influential ideas in deep learning
- the longer sentence → the lower bleu score ⇒ because it's difficult for NN to memorize
- How much you should pay attention to a piece of a sentence
- Attention model ← how to implement? → this video
Speech recognition - audio data
- Speech recognition ← how sequence-to-sequence model applied to audio data
- audio (Ox=time, Oy=air pressure) → frequency (spectrogram) (Ox=time, Oy=frequency, different colors=amount of energy) ← need preprocessing step.
- Speech recognition usually uses "phonemes" (âm vị)
- phonemes = In linguistics, the smallest unit of speech that distinguishes one word sound from another. Phonemes are the elements on which computer speech is based.
- dataset → academy (300h, 3000h), commercial (100000h)
- Using Attention model
- Using CTC cost (Connectionist temporal classification)
- The number of time steps is really large! (eg. 10s of audio → 1000 inputs = 100 Hz*10s) → #input large → but the output cannot be many like that!
- Trigger word detection systems (like Alexa, Google Home, Apple Siri, Baidu DuerOS)