NLP-C2-W1: Autocorrect
https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/1
Learning theme
Probabilistic models
Autocorrect with Minimum Edit Distance
Key Idea
Identify a misspelled word
Find strings edit distance away, normally 1 - 3
Filter candidates (only restrict to correctly spelled word)
Calculate wor probability where is the size of the corpus. A more advanced version would be to track two words co-occurring and use the word before a misspelled word to identify the correct autocorrect.
Find Minimum Edit Distance with Dynamic Programming
Given word (source) and (target), construct a matrix where the columns are alphabets of word and rows are alphabets of . Both should include a blank / starting line character. Cell means the minimum edit distance that goes from to . Notice here refers to all the characters include . Therefore, the minimum edit distance problem can be decomposed into
We can look at example below transforming from "play" to "stay" with insert and delete cost 1 and replace cost of 2.
We can pick cell (a,s) as an example, which means we want to know the edit distance from "pla" to "s":
If we are at cell (l, s), we know the edit distance from "pl" to "s". So we can do a deletion to go from "pla" to "pl", then use the knowledge to go from "pl" to "s"/
If we are at cell (a, #), we know the edit distance from "pla" to "#". So we can do an insertion of "s" after going from "pla" to "#"
If we are at cell (l, #), we know the edit distance from "pl" to "#". Since the target is "pla" to "s", and "a" is not same as "s", we have to do a replacement. Imagine if we are going from "pls" to "s", then it would be the same cost as going from "pl" to "#".
Backtracing
As we construct the table, we need to keep a pointer in each cell on how I get to the cell.
Completed Notebook
Dynamic Programming
String Edit (deletion, insertion, replacement) as List Comprehension (very interesting)
Using Counter object
Last updated