NLP-C2-W3: Autocomplete and Language Models

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/3

N-gram and Sequence Probabilities

N-gram (w1,...,wn)(w_1, ..., w_n) probability is defined as conditional probability

p(wnw1,...,wn1,w)=Count(w1,...,wn1,wn)Count(w1,...,wn1)p(w_n|w_1,...,w_{n-1},w)= \frac{Count(w_1,...,w_{n-1},w_n)}{Count(w_1,...,w_{n-1})}

Subsequently, we can define the sequence bigram probability as (N-gram follows the same structure)

p(w1,...,wn)=p(w1)p(w2w1)...p(wn1wn)p(w_1,...,w_n)=p(w_1)*p(w_2|w_1)*...*p(w_{n-1}|w_n)

In general, add start token <s> until there is enough length to start and 1 end token </s>. This fixes:

  1. Getting the correct count for N-gram probability calculation

  2. Make the sum of the probability of all possible sentences with all possible lengths to be 1 instead for each certain length, the probability sum of all possible sentence at that length to be 1 (enable generalization)

Operationalization

Count Matrix:

  • row: unique corpus (N-1)-grams

  • columns: unique corpus words

  • cell value: the count of N-gram (row, column)

Probability matrix

  • Divide each cell by its row sum

Other consideration

  • use log to convert multiplication to sumation

  • use <UNK> to mark out of vocabulary word

    • vocabulary can be constructed through

      • min word frequency

      • max vocab size betermined by frequency

      • expertise / requirement (e.g.: no swear word)

Smoothing

  • Add-one smoothing (Laplacian smoothing): When calculating probability, add 1 to the numerator count and add V|V| to the denominator

  • Add-k smoothing: add kk to the numerator and kVk * |V| to the denominator

  • More advanced: Kneser-Ney smoothing, Good-Turing smoothing

Backoff

If an N-gram is missing, use the N-i gram until a non-zero probability is reached.

  • "stupid" backoff: multiply the probability by a constant to discount

Interpolation

Define an N-gram probability as the weighted probability of all order of N-gram, e.g.:

p^(w3w1,w2)=λ1p(w3w1,w2)+λ2p(w2w1)+λ3p(w1)\hat{p}(w_3|w_1,w_2)=\lambda_1*p(w_3|w_1,w_2)+\lambda_2*p(w_2|w_1)+\lambda_3*p(w_1)

note all lambda should sum to 1

Language Model Evaluation - Perplexity

Train-Validation-Test split are constructed by splitting continuous text (e.g.: article) or by random short sequences (e.g.: part of sentences).

Perplexity is defined as the probability of all sentences (multiplied together) raised to the -1 over the number of words (not including the start token, but including end token)

perplexity=p(s1,...,sm)1/mperplexity = p(s_1,...,s_m)^{-1/m}

A good model has low perplexity, such as 60-20 for English or 5.3-5.9 for log perplexity.

Completed Notebook

Autocomplete

  • Calculating N-gram probabilities

Last updated