Note for couse DL 5: Sequence models

Week 1 - RNN (Recurrent Neural Networks)

Why sequence models?

1 of the most exciting areas in DL

Models: RNN transforns speech recognition, NLP,...

Examples

Different/Equal length, X and/or Y is a sequence,...

Notations

: index of position of a word in the sequences

: length of sequence

: index of training example

Representing words → based on Vocabulary (built based on occurent words in the sequences or some online already-built vocabs) → a common vector of all words

Each words represents by a one-hot vector based on vocabulary vector
If some words are not in the vocab, we use "<UNK>" (Unknown)

RNN Model

Why don't use a standard networks?

Input and output can be different lengths in diff examples () even if you could use padding to the max length of all texts but it's not a good representation!

Doesn't share features learned across diff positions of text (ex: word "Harry" in some position and other positions give some info abt person's name)

Like in CNN, something learnt from 1 part of the image can be generalized quickly to other part of the image.

Reduce #params in model ← we don't want very large input layer (with one-hot vector)

RNN (Unidirectional)

at time step 2, it uses not only the input but also the info from time step 1 (activation )

- The right version is rolled one but the same meaning with the left one (it appears in some textbook but unclear/difficult to implement, Andrew doesn't use it in the course)
- This is "Unidirectional RNN" which means that we can only use the info of the previous words!!! → not very strong because (ex:)
He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person
He said, "Teddy bears are on sale!" → Teddy is not a name of a person!
- We use notations to indicate the params

Forward propagation

Use to compute

Backpropagation through time (red arrows in below fig)

Going backward in time

Different types of RNNs

Language model and sequence generation

Speech recognition system → propability(sequences of words)

Ouput 10K weight softmax output (10K is the number of words in the dictionary, i.e. corpus) → prob of each word, which one is highest → that one is the word the user said!

Sampling novel sequences

After training, we have activations , we then use them to sample a "novel" sequence. → word-level RNN (based on vocaabulary)

An important aspect to be explored, once a Language Model has been trained, is how well it can generate new or novel sequences.

Character-level language model → not used too much today

Pros: Don't worry about unknown words (word not appears in your vocabulary)
Cons: much longer sentences!! → computational

Vanishing gradients with RNNs

language can have very long-term dependencies, where it worked at this much earlier can affect what needs to come much later in the sentence. Ex:

The cat, which already ate...., was full.
The cats, which ...................., were full.

→ RNN is not good to capture very logn term dependencies. ← because of vanishing gradient

→ the basic RNN model has many local influences

There are also the problem of "exploding gradients" (increasing by the depth of NN) → there are many NaNs values in the output! ← solution: gradient clipping (rescale some of gradient vectors)
→ vanishing gradient is much harder to solve!

Gated Recurrent Unit (GNU) → solution for "vanishing gradient" problem → capture much longer range dependencies

Compared with an RNN unit

Notations: = memory cell, in this GRU, (activation) but in LSTM, they're different!

Intuition, Gate function is either 0 or 1 (by using sigmoid). "u" stands for "update"

With ~ 0 → ~ → maintained through very long sequence → help to solve vanishing problem! (green color in above fig)

Long Short Term Memory (LSTM)

It's even more powerful (more general) than GRU. However, in the history, LSTM came first.

LSTM is default first thing to try.

Paper is really diffcult to read

We don't have the case like in the case of GRU

Gates: update (), forget () and output ().
→ LSTM has 3 gates instead of 2 in GRU.

Unidirectional

Bidirectional RNN (BRNN)

At a point in time to take the info of earlier and later in the sequence.

Come back to the example of Teddy:
- He said, "Teddy Roosevelt was a great President" → Teddy is a name of a person
- He said, "Teddy bears are on sale!" → Teddy is not a name of a person!

Acyclic graph: forward prop contains 2 directions (violet and green in the fig below)

Ex: To get from both sides:

From (violet way)
From (yellow way)

Cons: We DO need the entire sequence of data before you can make prediction anywhere.

Ex: speech recognition → wait for person to stop talking (so that we have the entite sentence) and then we can make the prediction → not so good in real time

→ use other techniques!

Deep RNNs

Stack multiple layers of RNN togehter!

Notation

For normal NN → many layers means deep

For RNN → 3 layers is a lot!

Sometimes, we see that instead of output directly , we can connect with some normal NN like in below fig.

→ We don't see deep RNNs quite otfen because of their computational cost!

Week 2 - NLP & Word Embeddings

Introduction to Word Embeddings

Word Representation → word embedding

1 of the most important idea in NLP
If we use one-hot, it's difficult for ML algo to learn from each other words because of their representation in the vocabulary,
eg. Apple (456) is very far away from Orange (6257) → cannot learn from "I want a glass of orange juice" to "I want a glass of apple juice".
← because the inner product between any 2 vector is 0.

→ We use Featurized representation instead!

300 dimensional vector or 300 dimensional embedding

We can embed 300 D → 2 D for visualizing ← using t-SNE or UMAP

Using word embeddings

There are already "pre-trained" very large text corpuses on the internet (~1B words, 100B words)
→ apply to your tasks with much smaller 100K words.
→ allowed to carry out Transfer Learning (learn from 1B and transfer to 100K)
→ (optinal) continue to finetune the word embedding with new data
→ Use BRNN (bidirectional RNN) instead of simple RNN.
Word embedding have relation to face encoding

The word "embedding" and "encoding" are used interchangeable

Properties of word embeddings

It can help with analogy reasoning

"sim" means "similarity"

Without t-SNE, it's better to find the analogies. After using something like t-SNE to embed to smaller dimension → the similarity is not sure to true.

Cosine similarity → common way to measure the similarity between 2 word embeddings

Embedding matrix

When you implement an algo to learn word embedding → you end up with an Embedding matrix

In practice, we use "Embedding layer" instead of mulplication matrix like above! ← more efficient!

Learning Word Embedding → some cocrete algos: Word2Vec & GloVe

Learning word embeddings (go from more complicated but more intuitive → simplier)

Word2Vec Skip-grams model → any nearby 1-word, eg. "orange" or "glass" or "my",...

Take 1 word, skip some word between → gives the target

Cons: Computational cost on the softmax ← because of → use hierarchical softmax

In practice, we use (c=context words) different for common words (the, of, a, and, to,...) with uncommon (but more important) words (orange, apple, durian,...)

Negative Sampling → modified learning problem → to be able to train much bigger training example

positive sampling: orange (context) - juice (word) → 1 (target)
negative sampling: orange (context) - kind (word, chosen randomly from vocab) → 0 (target)

using formal binary classification (1 vs 0)

How do you choose negative examples?

Sample based on "how often words appear in the corpus" ( → cons: very high representation of "the, of, and,..."
use 1/size of corpus → not representative
Usually use

GloVe word vectors (= Global Vectors for word representation)

has some momentum in the NLP community; not used as much as word2vec or skip-gram model but enthusiast
→ how often do words i and j appear close each other. (i,j like c,t ↔ context and target words)

What GloVe does is optimize

Applications using Word Embeddings

Sentiment Classification → based on piece of text and tell that someone likes/dislike something

Challenge → not have huge labeled training set for it
Simple sentiment classification model

Cons: ignore word orders, e.g. the sentence in the fig is negative but there are many "good" → use RNN instead!

RNN for sentiment classification

Debiasing word embeddings

"bias" → not the "bias" in "bias variant" → means "gender, ethnicity, sexual...bias
The problem of bias in word embeddings:

Addressing bias in word embeddings

More explanation:
- step 1: xác định hướng của bias và non-bias
- step 2: những từ neutral phải được tịnh tiến để lam mất đi bias ("babysister" và "doctor" project to Oy để làm mất đi bias "babaysister"→female, "doctor"→male
- step 3: làm cho các khoảng cách bằng nhau: ví dụ khoảng cách của "babysister" gần "grandmother" hơn → ko đúng → phải làm cho nó ngang bằng giữa "grandmother" và "grandfather"

How you decide which words to neutralize? (step 2)

train the classifier → figure out what words are definitional or not ← a linear classifier can tell you
most of words in english are not definitional (like babysister and doctor)

Week 3 - Sequence models & Attention mechanism

Various sequence to sequence architectures

Basic Sequence to sequence model ← translation

Encoder networks: (RNN, GRU, LSTM) input french words → 1 word at a time → output vector represents the input setence
Decoder networks: take the output of encoder → train 1 word at the time → output
This model works: given enough pair of french-english sentences → works well

An architecture very similar to above also works for image cationing ← describe an image

Use ConvNet (eg. pretrained AlexNet in the pig) → instead of softmax, we feed it into a RNN network

Picking the most likely sentence

The similarity between sequence-to-sequence and language model (week 1)

Consider machine translation as building a conditional language model

Language mode gives the probability of a sentence, give nouvel sentences.
Machine translation model: 2 parts - encoder (green), decoder (vilot) where decoder looks alike language model

When you use this model for machine translation, you not try to sample at random this distribution! → instead, you wanna maximize ← using Beam Search!!!

Beam Search

Why not Greedy Search?

Maximum từng cái , max cái này tới max cái khác thay vì toàn bộ

For French sentence ""Jane visite l'Afrique en Septembre" → if use Beam Search, câu trên, if use Greedy Search, câu dưới → ko quá chính xác!

Another reason, the number of words is huge → word by word is not good → use approximate translation is better!

Beam Search algo (video explains)

Step 1: Use "beam width" (eg. B=3) → for the 1st word (french) → try to choose most likely 3 words
Step 2: Each of above 3 words → try to choose the most likely 3 next words for the 2nd word (french)

If B=1 → Beam Search becomes Greedy Search!

Refinements to Beam Search

Length normalization

Sometime max (A) when A very small → we choose max (log(A)) instead!!!

gives the same result as

Beam width → the larger width, the more posibilities considering, the better result → but the more computationaly expensive your algo is

Try 1→3→10, 100, 1000, 3000 → be careful on production / commercial

Doesn't like BFS (Breadth First Search) or DFS (Depth First Search), Beam Search run faster but is not guaranteed to find exact maximum.

Error analysis in beam search ← what if beam search makes a mistake?

What fraction of errors due to Beam Search or RNN model?

If Beam Search? → increase the beam width
If RNN? → deepeer layer analysis, regularization, more training size, netwok architecture,...

Bleu Score ← multiple english translations are equally to french sentences?

Bleu score gives you an automatical way to evaluate your algo → speed up
"bleu" = bilingual evaluation understudy
paper is readable

Unigram

Bigrams → (general) n-grams

Attention model intuition → look at a part of a sentence at the time

A modification of encoder-decoder → attention model makes all of this work much better → 1 of the most influential ideas in deep learning
the longer sentence → the lower bleu score ⇒ because it's difficult for NN to memorize
How much you should pay attention to a piece of a sentence

tells that when you trying to generate english word → how much you should pay attention to the french words ⇒ this allows on every time step, look only within a local window of french sentence to pay attention when generating a specific english word.

Attention model ← how to implement? → this video

Speech recognition - audio data

Speech recognition ← how sequence-to-sequence model applied to audio data

audio (Ox=time, Oy=air pressure) → frequency (spectrogram) (Ox=time, Oy=frequency, different colors=amount of energy) ← need preprocessing step.
Speech recognition usually uses "phonemes" (âm vị)

phonemes = In linguistics, the smallest unit of speech that distinguishes one word sound from another. Phonemes are the elements on which computer speech is based.

dataset → academy (300h, 3000h), commercial (100000h)

Using Attention model

Using CTC cost (Connectionist temporal classification)

The number of time steps is really large! (eg. 10s of audio → 1000 inputs = 100 Hz*10s) → #input large → but the output cannot be many like that!

Trigger word detection systems (like Alexa, Google Home, Apple Siri, Baidu DuerOS)