NLP-C2-W3: Autocomplete and Language Models
https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/3
N-gram and Sequence Probabilities
N-gram probability is defined as conditional probability
Subsequently, we can define the sequence bigram probability as (N-gram follows the same structure)
In general, add start token <s> until there is enough length to start and 1 end token </s>. This fixes:
Getting the correct count for N-gram probability calculation
Make the sum of the probability of all possible sentences with all possible lengths to be 1 instead for each certain length, the probability sum of all possible sentence at that length to be 1 (enable generalization)
Operationalization
Count Matrix:
row: unique corpus (N-1)-grams
columns: unique corpus words
cell value: the count of N-gram (row, column)
Probability matrix
Divide each cell by its row sum
Other consideration
use log to convert multiplication to sumation
use <UNK> to mark out of vocabulary word
vocabulary can be constructed through
min word frequency
max vocab size betermined by frequency
expertise / requirement (e.g.: no swear word)
Smoothing
Add-one smoothing (Laplacian smoothing): When calculating probability, add 1 to the numerator count and add to the denominator
Add-k smoothing: add to the numerator and to the denominator
More advanced: Kneser-Ney smoothing, Good-Turing smoothing
Backoff
If an N-gram is missing, use the N-i gram until a non-zero probability is reached.
"stupid" backoff: multiply the probability by a constant to discount
Interpolation
Define an N-gram probability as the weighted probability of all order of N-gram, e.g.:
note all lambda should sum to 1
Language Model Evaluation - Perplexity
Train-Validation-Test split are constructed by splitting continuous text (e.g.: article) or by random short sequences (e.g.: part of sentences).
Perplexity is defined as the probability of all sentences (multiplied together) raised to the -1 over the number of words (not including the start token, but including end token)
A good model has low perplexity, such as 60-20 for English or 5.3-5.9 for log perplexity.
Completed Notebook
Calculating N-gram probabilities
Last updated