NLP Street Fighting
Notes from Stanford Psych 290 taught by Johannes Eichstaedt
NLP Research Street Knowledge
Conferences:
Computational Social Scientists conferences:
DLATK
Lingo & Component
Type: an element of the vocabulary (e.g., "dog fighting dog" has two types)
Token: an instance of that type in text (e.g., "dog fighting dog" has three tokens)
case folding: reduce all letters to lower case (sometimes for sentiment analysis, information extraction, machine translation, case sensitive is good)
Lemmatization: Reduce inflections or variant forms to base form (e.g., am, are, is -> be)
Stemming: reduces terms to their root form (See this blog post about the difference). In reality, with enough data, this becomes a secondary concern
Language Features
Dictionaries:
Top 5 most frequent words for GI, DICTION, and LIWC 2015
Using dictionaries, we should always see the most frequent words correlate, and annotate for precision.
For LIWC, understand LIWC correlation patterns, read the documentation.
Good Dictionaries:
Curated: LIWC,
Annotation-based dictionaries: Warriner's new ANEW (affective), LabMT (they capture how words make people feel)
Sentiment / ML-based: Sentistrength, VADER, SwissChocolate, NRC (recommended)
Component of DLATK
Takes in a message table
Produce a feature table
Be careful how this impacts the calculation. For example, if I have 1 gram, and want to know the average percentage of use (the group norm) of a single word, we can't just take the mean (missing all the people who did not use that word)
Uses lexicon tables to store dictionaries
Use outcome tables for correlations
Lexicon tables:
stored in the central dlatk_lexica database
contains term (word), category (lexicon), and weight, sparse encoded
All feature table names created with -add_lex_table contain "cat_"
# 1 gram extraction
# Result table breaks the 1 gram stats at the coorel_field level
# The group_norm is the value divided by the total number of features for this group
# !!!The result is sparse coded, so if a feature does not exist for a user, there is no entry.
!dlatkInterface.py \
--corpdb eich\ #-- database name
--corptable msgs\ #-- message table
--correl_field user_id \ #-- group by what
--add_ngrams -n 1 #-- doing what (do 1 gram) # do -n 1 2 3 to get 1-3 gram
# Adding lexicon information to 1-gram
# It will give you a table that counts each user's different categories (e.g., positive emotion).
! dlatkInterface.py \
--corpdb eich \
--corptable msgs\
--correl_field user_id \
--add_lex_table -l LIWC2015 # add --weighted_lexicon to enable weights
# Correlate LIWC against Personality
# --rmatrix produces a correlation matrix in HTML format,
# --csv produces correlation matrix in csv format
# --sort append correlations matrix with another one that is sorted by effect size.
# --group_freq_thresh limits the population (only for correlations and predictions)
!dlatkInterface.py \
--corpdb eich \
--corptable msgs\
--correl_field user_id \
--group_freq_thresh 500\
--correlate\
--rmatrix --csv --sort\
---feat_table 'feat$cat_mini_LIWC2015$msgs$user_id$1gra' \
-- outcome_table blog_outcomes --outcomes age gender \
-- controls control_var \
-- output_name ~/mini_liwc_age_gender
# Correlation against categorical variables
!dlatkInterface.py \
--corpdb {corpdb} \
--corptable {msgs_table} \
--correl_field user_id \
--correlate \
--rmatrix --csv --sort \
--feat_table {feat_miniliwc_table} \
--outcome_table {outcomes_table} \
--categories_to_binary occu \
--outcomes occu \
--output_name ~/mini_liwc_occu
# Creating 1-3 gram clouds
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_1to3gram_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
--tagcloud --make_wordclouds \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets
# Topic clouds
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_topic_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
--topic_tagcloud --make_topic_wordclouds \
--topic_lexicon topics_fb2k_freq \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets
Quick NLP Stuff
Language is Weird:
Language follows the Zif's law. The probability of a word with a rank to show up in text is
Minimum Data Intuition
Minimal: hundreds of words from hundreds of people
Medium: hundreds of words from thousands of people, or thousands of words from hundreds of people
Good: thousands of words from thousands of people
A fundamental difficulty of language
Many processes map to a single outcome (e.g., use of a singular pronoun), but knowing the outcome is hard to match to a specific process.
Scientific Storytelling with Language Analyses (for details check Lecture 8)
Testing a priori hypotheses in language
Testing specific language correlations/effects. Example paper: Narcissism and the use of personal pronouns revisited
Prediction
Predicts X from text, e.g., What Twitter Profile and Posted Images Reveal about Depression and Anxiety
Imputing estimates where there are none
Using the prediction error as an estimate of something else, e.g., : Authentic self-expression on social media is associated with greater subjective well-being
Exploratory correlates
Papers that show "the language of X"
Show "dose response" / "XY, "IV-DV" patterns
Construct elaboration and refinement
Exploring the nomological network of a new construct
Construct differentiation through language
A conceivable avenue for item discovery in survey design
A complement to grounded theory and other idiographic approaches (e.g., Comparing grounded theory and topic modeling: Extreme divergence or unlikely convergence?)
Differentiating in the language space
Measuring Within-person change
e.g., The secret life of pronouns: flexibility in writing style and physical health
Exploiting semantic distances
Given a set of constructs, what are their semantic distances
NLP Project Intuition
Basic Power Analysis
Tips:
It's often better to get fewer words per observation (100+) and get more observations.
Sample Size Intuition
Words per group intuition
General Rule of Thumbs:
If your datasets are limited, reduce your language dimensions
reduce features with occurrence filtering (down to 3k to 10k 1grams), pointwise mutual information threshold (down to <10k 2-3 grams)
Select language features based on the literature
project into a lower-dimensional space (sentiment, valence/arousal, LIWC)
model topics
Ballpark
No discovery, but sentiment, etc: 30+ words from 100 groups
Minimal: 100s of words from 100s of groups
Medium: 100s / 1000s from 1000s/100s of groups
Good: 1000s of words from 1000s of groups
Code Snippets
Running Database in Colab
# connects the extension to the database file
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite://sqlite_data/{databse}.db?charset=utf8mb4")
# connect the extension to the database
%sql tutorial_db_engine
# Reload the database
%reload_ext sql
R in Colab & DLATK Functions
# write from R to db
dbWriteTable(db_con, "table_name", df, overwrite = True, row.names = False)
# query from db to R
df <- dbGetQuery(db_con, "sql_query")
# functions to check df
checkDf2(feat_meta)
# Functions to convert feature table to wide format
feat_meta_wide <- importFeat(feat_meta)
# Graveyard
Working with the Dictionary
LIWC
Correlation pattern: review April 17th lecture notes
Annotation-based emotion dictionaries
Affective Norms of the English Language (ANEW) by Bradley & Young
Captures the ways in which words are perceived
The impact that emotional features have on the processing and memory of words
zqq
ML-based dictionaries
generally better than annotation-based
ML & AI Analysis Annotate
table 3, comparing lv2 and lv3 result
Language Analysis in Science
Testing a priori hypotheses in language
Prediction
Exploratory correlates
Measuring within-person change
Exploiting semantic distances
Embedding
distance to probed points vs. factor analysis
Open Vocabulary Analysis
feat_occ_filter --set_p_occ 0.05: set occurance to be 5%, so only features that have shown in more than 5% of the user (notice here the feature table only uses one dash: -f ) — see the option 1 filter down the table we have, in slides
Extracting 2-grams
add 2-gram extraction method
Last updated