NLP-C2-W2: PoS Tagging and HMM

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/2

Document Processing

Dealing with Unknown Words When Processing a Document

We can replace unknown words with different unknown tokens, such as "--unk_digit--", "--unk_punct--" and etc. Notebook practice

PoS Transition HMM

Setup:

Hidden nodes: Part of Speech, e.g.: verb, noun

Observable nodes: Actual words, e.g.: like, use

Smoothing in Calculating Transition Probabilities

Original transition probability: ( $t_i$ is the tag at location $i$ )

\begin{align*} P(t_i|t_{i-1}) = \frac{Count(t_{i-1}, t_i)}{\sum_{j=1}^N Count(t_{i-1}, t_j)} \end{align*}

We can add smoothing to deal with cases of 0, which can cause 1) a division by 0 problems in probability calculation and 2) a probability of 0, which doesn't generalize well. So, we calculate transition probability as follows:

\begin{align*} P(t_i|t_{i-1}) = \frac{Count(t_{i-1}, t_i) + \epsilon}{\sum_{j=1}^N Count(t_{i-1}, t_j) + N * \epsilon} \end{align*}

$N$ is the total number of tags

Smoothing in Calculating Emission Probabilities

Following the same principle, we can calculate emission probabilities as

\begin{align*} p(w_i|t_i) = \frac{Count(t_i, w_i) + \epsilon}{\sum_{j=1}^V Count(t_i, w_j) + N * \epsilon} \end{align*}

$N$ is the total number of words

Deepdive into Hidden Markov Models

Code

Counter

// Some code
Counter('abracadabra').most_common (3)

Completed Notebook

Part of Speech Tagging

Clear structure in pre-processing and actual modeling
Viterbi algorithm implementation

PreviousNLP-C2-W1: Autocorrect NextNLP-C2-W3: Autocomplete and Language Models

Last updated 1 year ago