NLP-C1-W2: Sentiment Analysis with Navïe Bayes
https://www.coursera.org/learn/classification-vector-spaces-in-nlp/home/week/2
Learning Theme
Feature extraction from a document
Confidence ellipse for visualizing Naive Bayes
Negative and Positive Frequency Representation (W1 material)
Features: For each word, identify its positive and negative frequency by counting how many times it appears in positive and negative cases. The final feature for is a vector of length three: a bias unit (normally 1), the sum of deduped words’ positive frequency, and the sum of deduped words’ negative frequency
Algorithm: Logistic regression on the features to classify as positive or negative sentiment.
Naive Bayes Frequency Ratio Representation
Features: For each word, derive its positive and negative frequency
Algorithms: Naive Bayes
Laplacian Smoothing: The feature above suffers the problem of multiplying/dividing by 0 if a certain word only shows up in one class. Laplacian smoothing replaces the the conditional probability for word i with the following definition.
Log sum Trick: Direct likelihood is generally a very small number, and multiplying it contains the risk of underflow. So, we use the log of the number to transform multiplication into addition for safer calculations.
Notice the threshold for the indicator function becomes 0 instead of 1.
Confidence Ellipse
A confidence ellipse is a 2-D generalization of a confidence interval. It draws an ellipsoid around points to capture a certain percentage of them. It can be drawn with matplotlib tutorial or R packages.
Completed Notebook
Logistic Regression Sentiment Analysis Notebook
Stemming, Removing Stopwords
Manual Implementation of Logistic Regression and loss-derivation
Naive Bayes Sentiment Analysis Notebook
Naive Bayes Sentiment Analysis
Last updated