NLP-C2-W3: Autocomplete and Language Models

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/3

N-gram and Sequence Probabilities

N-gram $(w_1, ..., w_n)$ probability is defined as conditional probability

p(w_n|w_1,...,w_{n-1},w)= \frac{Count(w_1,...,w_{n-1},w_n)}{Count(w_1,...,w_{n-1})}

Subsequently, we can define the sequence bigram probability as (N-gram follows the same structure)

p(w_1,...,w_n)=p(w_1)*p(w_2|w_1)*...*p(w_{n-1}|w_n)

In general, add start token <s> until there is enough length to start and 1 end token </s>. This fixes:

Getting the correct count for N-gram probability calculation
Make the sum of the probability of all possible sentences with all possible lengths to be 1 instead for each certain length, the probability sum of all possible sentence at that length to be 1 (enable generalization)

Operationalization

Count Matrix:

row: unique corpus (N-1)-grams
columns: unique corpus words
cell value: the count of N-gram (row, column)

Probability matrix

Divide each cell by its row sum

Other consideration

use log to convert multiplication to sumation
use <UNK> to mark out of vocabulary word
- vocabulary can be constructed through
  - min word frequency
  - max vocab size betermined by frequency
  - expertise / requirement (e.g.: no swear word)

Smoothing

Add-one smoothing (Laplacian smoothing): When calculating probability, add 1 to the numerator count and add $|V|$ to the denominator
Add-k smoothing: add $k$ to the numerator and $k * |V|$ to the denominator
More advanced: Kneser-Ney smoothing, Good-Turing smoothing

Backoff

If an N-gram is missing, use the N-i gram until a non-zero probability is reached.

"stupid" backoff: multiply the probability by a constant to discount

Interpolation

Define an N-gram probability as the weighted probability of all order of N-gram, e.g.:

\hat{p}(w_3|w_1,w_2)=\lambda_1*p(w_3|w_1,w_2)+\lambda_2*p(w_2|w_1)+\lambda_3*p(w_1)

note all lambda should sum to 1

Language Model Evaluation - Perplexity

Train-Validation-Test split are constructed by splitting continuous text (e.g.: article) or by random short sequences (e.g.: part of sentences).

Perplexity is defined as the probability of all sentences (multiplied together) raised to the -1 over the number of words (not including the start token, but including end token)

perplexity = p(s_1,...,s_m)^{-1/m}

A good model has low perplexity, such as 60-20 for English or 5.3-5.9 for log perplexity.

Completed Notebook

Autocomplete

Calculating N-gram probabilities

PreviousNLP-C2-W2: PoS Tagging and HMM NextNLP-C2-W4:Word Embeddings with Neural Networks

Last updated 1 year ago