Negative and Positive Frequency Representation (W1 material)
Features: For each word, identify its positive and negative frequency by counting how many times it appears in positive and negative cases. The final feature for is a vector of length three: a bias unit (normally 1), the sum of deduped words’ positive frequency, and the sum of deduped words’ negative frequency
Algorithm: Logistic regression on the features to classify as positive or negative sentiment.
Naive Bayes Frequency Ratio Representation
Features: For each word, derive its positive and negative frequency
p(wi∣pos)=freq(pos)freq(wi,pos)freq(wi,pos):Total count of word i in all positive textfreq(pos):Total count of all words in all positive text
Algorithms: Naive Bayes
I(p(neg)p(pos)i=1∏np(wi∣neg)p(wi∣pos)≥1)p(neg)p(pos):The ratio between postiive and negative document.
Laplacian Smoothing: The feature above suffers the problem of multiplying/dividing by 0 if a certain word only shows up in one class. Laplacian smoothing replaces the the conditional probability for word i with the following definition.
p(wi∣pos)=Npos+Vfreq(wi,pos)+1Npos:Sum of word frequency in positive classV:Number of unique word in the entire text (both classes)
Log sum Trick: Direct likelihood is generally a very small number, and multiplying it contains the risk of underflow. So, we use the log of the number to transform multiplication into addition for safer calculations.
Notice the threshold for the indicator function becomes 0 instead of 1.
Confidence Ellipse
A confidence ellipse is a 2-D generalization of a confidence interval. It draws an ellipsoid around points to capture a certain percentage of them. It can be drawn with matplotlib tutorial or R packages.