NLP-C2-W2: PoS Tagging and HMM

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/2

Document Processing

Dealing with Unknown Words When Processing a Document

We can replace unknown words with different unknown tokens, such as "--unk_digit--", "--unk_punct--" and etc. Notebook practice

PoS Transition HMM

Setup:

Hidden nodes: Part of Speech, e.g.: verb, noun

Observable nodes: Actual words, e.g.: like, use

Smoothing in Calculating Transition Probabilities

Original transition probability: (tit_i is the tag at location ii )

P(titi1)=Count(ti1,ti)j=1NCount(ti1,tj) \begin{align*} P(t_i|t_{i-1}) = \frac{Count(t_{i-1}, t_i)}{\sum_{j=1}^N Count(t_{i-1}, t_j)} \end{align*}

We can add smoothing to deal with cases of 0, which can cause 1) a division by 0 problems in probability calculation and 2) a probability of 0, which doesn't generalize well. So, we calculate transition probability as follows:

P(titi1)=Count(ti1,ti)+ϵj=1NCount(ti1,tj)+Nϵ \begin{align*} P(t_i|t_{i-1}) = \frac{Count(t_{i-1}, t_i) + \epsilon}{\sum_{j=1}^N Count(t_{i-1}, t_j) + N * \epsilon} \end{align*}
  • NN is the total number of tags

Smoothing in Calculating Emission Probabilities

Following the same principle, we can calculate emission probabilities as

p(witi)=Count(ti,wi)+ϵj=1VCount(ti,wj)+Nϵ\begin{align*} p(w_i|t_i) = \frac{Count(t_i, w_i) + \epsilon}{\sum_{j=1}^V Count(t_i, w_j) + N * \epsilon} \end{align*}
  • NN is the total number of words

Deepdive into Hidden Markov Models

Code

Counter

// Some code
Counter('abracadabra').most_common (3)

Completed Notebook

Part of Speech Tagging

  • Clear structure in pre-processing and actual modeling

  • Viterbi algorithm implementation

Last updated