NLP-C2-W1: Autocorrect

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/1

Learning theme

  • Probabilistic models

Autocorrect with Minimum Edit Distance

Key Idea

  1. Identify a misspelled word

  2. Find strings nnedit distance away, normally 1 - 3

  3. Filter candidates (only restrict to correctly spelled word)

  4. Calculate wor probability p(w)=count(w)Vp(w) = \frac{count(w)}{V} where VVis the size of the corpus. A more advanced version would be to track two words co-occurring and use the word before a misspelled word to identify the correct autocorrect.

Find Minimum Edit Distance with Dynamic Programming

Given word xx (source) and yy (target), construct a matrix where the columns are alphabets of word xx and rows are alphabets of yy. Both should include a blank / starting line character. Cell i,ji,j means the minimum edit distance that goes from wx[:i]w_x[:i]to wy[:j]w_y[:j]. Notice here wx[:j]w_x[:j] refers to all the characters include ii. Therefore, the minimum edit distance problem can be decomposed into

D[i,j]=min{D[i1,j]+del_costD[i,j1]+ins_costD[i1,j1]+{rep_costwx[i]wy[j]0else \begin{equation*} D[i,j] = min \begin{cases} D[i-1,j] + del\_cost \\ D[i,j-1] + ins\_cost \\ D[i-1, j-1] + \begin{cases} rep\_cost & w_x[i] \neq w_y[j] \\ 0 & else \end{cases} \end{cases} \end{equation*}

We can look at example below transforming from "play" to "stay" with insert and delete cost 1 and replace cost of 2.

#stay#01234p12345l23456a34545y45654\begin{align*} \begin{matrix} & \# & s & t & a & y \\ \# & 0 & 1 & 2 & 3 & 4 \\ p & 1 & 2 & 3 & 4 & 5 \\ l & 2 & 3 & 4 & 5 & 6 \\ a & 3 & 4 & 5 & 4 & 5 \\ y & 4 & 5 & 6 & 5 & 4 \\ \end{matrix} \end{align*}

We can pick cell (a,s) as an example, which means we want to know the edit distance from "pla" to "s":

  1. If we are at cell (l, s), we know the edit distance from "pl" to "s". So we can do a deletion to go from "pla" to "pl", then use the knowledge to go from "pl" to "s"/

  2. If we are at cell (a, #), we know the edit distance from "pla" to "#". So we can do an insertion of "s" after going from "pla" to "#"

  3. If we are at cell (l, #), we know the edit distance from "pl" to "#". Since the target is "pla" to "s", and "a" is not same as "s", we have to do a replacement. Imagine if we are going from "pls" to "s", then it would be the same cost as going from "pl" to "#".

Backtracing

As we construct the table, we need to keep a pointer in each cell on how I get to the cell.

Completed Notebook

Autocorrection

  • Dynamic Programming

  • String Edit (deletion, insertion, replacement) as List Comprehension (very interesting)

  • Using Counter object

Last updated