NLP Street Fighting

Notes from Stanford Psych 290 taught by Johannes Eichstaedt

NLP Research Street Knowledge

Conferences:

DLATK

Lingo & Component

  • Type: an element of the vocabulary (e.g., "dog fighting dog" has two types)

  • Token: an instance of that type in text (e.g., "dog fighting dog" has three tokens)

  • case folding: reduce all letters to lower case (sometimes for sentiment analysis, information extraction, machine translation, case sensitive is good)

  • Lemmatization: Reduce inflections or variant forms to base form (e.g., am, are, is -> be)

  • Stemming: reduces terms to their root form (See this blog post about the difference). In reality, with enough data, this becomes a secondary concern

Language Features

Dictionaries:

  • Top 5 most frequent words for GI, DICTION, and LIWC 2015

  • Using dictionaries, we should always see the most frequent words correlate, and annotate for precision.

  • For LIWC, understand LIWC correlation patterns, read the documentation.

  • Good Dictionaries:

    • Curated: LIWC,

    • Annotation-based dictionaries: Warriner's new ANEW (affective), LabMT (they capture how words make people feel)

    • Sentiment / ML-based: Sentistrength, VADER, SwissChocolate, NRC (recommended)

Component of DLATK

  • Takes in a message table

  • Produce a feature table

    • Be careful how this impacts the calculation. For example, if I have 1 gram, and want to know the average percentage of use (the group norm) of a single word, we can't just take the mean (missing all the people who did not use that word)

  • Uses lexicon tables to store dictionaries

  • Use outcome tables for correlations

Lexicon tables:

  • stored in the central dlatk_lexica database

  • contains term (word), category (lexicon), and weight, sparse encoded

  • All feature table names created with -add_lex_table contain "cat_"


# 1 gram extraction
# Result table breaks the 1 gram stats at the coorel_field level
# The group_norm is the value divided by the total number of features for this group 
# !!!The result is sparse coded, so if a feature does not exist for a user, there is no entry.
 
!dlatkInterface.py \
    --corpdb eich\  #-- database name
    --corptable msgs\ #-- message table
    --correl_field user_id \  #-- group by what
    --add_ngrams -n 1  #-- doing what (do 1 gram) # do -n 1 2 3 to get 1-3 gram 

# Adding lexicon information to 1-gram 
# It will give you a table that counts each user's different categories (e.g., positive emotion). 
! dlatkInterface.py \
    --corpdb eich \
    --corptable msgs\
    --correl_field user_id \
    --add_lex_table -l LIWC2015 # add --weighted_lexicon to enable weights 
    
# Correlate LIWC against Personality
# --rmatrix produces a correlation matrix in HTML format, 
# --csv produces correlation matrix in csv format 
# --sort append correlations matrix with another one that is sorted by effect size. 
# --group_freq_thresh limits the population (only for correlations and predictions)
!dlatkInterface.py \
    --corpdb eich \
    --corptable msgs\
    --correl_field user_id \
    --group_freq_thresh 500\
    --correlate\
    --rmatrix --csv --sort\
    ---feat_table 'feat$cat_mini_LIWC2015$msgs$user_id$1gra' \
    -- outcome_table blog_outcomes --outcomes age gender \
    -- controls control_var \
    -- output_name ~/mini_liwc_age_gender
    
# Correlation against categorical variables 
!dlatkInterface.py \
    --corpdb {corpdb} \
    --corptable {msgs_table} \
    --correl_field user_id \
    --correlate \
    --rmatrix --csv --sort \
    --feat_table {feat_miniliwc_table} \
    --outcome_table {outcomes_table} \
    --categories_to_binary occu \
    --outcomes occu \
    --output_name ~/mini_liwc_occu

# Creating 1-3 gram clouds 
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_1to3gram_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
--tagcloud --make_wordclouds \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets

# Topic clouds
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_topic_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
 --topic_tagcloud --make_topic_wordclouds \
 --topic_lexicon topics_fb2k_freq \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets


Quick NLP Stuff

Language is Weird:

  • Language follows the Zif's law. The probability of a word with a rank rr to show up in text is

    p(Wr)=0.1rp(W_r)= \frac{0.1}{r}

Minimum Data Intuition

  • Minimal: hundreds of words from hundreds of people

  • Medium: hundreds of words from thousands of people, or thousands of words from hundreds of people

  • Good: thousands of words from thousands of people

A fundamental difficulty of language

  • Many processes map to a single outcome (e.g., use of a singular pronoun), but knowing the outcome is hard to match to a specific process.

Scientific Storytelling with Language Analyses (for details check Lecture 8)

Testing a priori hypotheses in language

Testing specific language correlations/effects. Example paper: Narcissism and the use of personal pronouns revisited

Prediction

Predicts X from text, e.g., What Twitter Profile and Posted Images Reveal about Depression and Anxiety

Imputing estimates where there are none

Using the prediction error as an estimate of something else, e.g., : Authentic self-expression on social media is associated with greater subjective well-being

Exploratory correlates

Papers that show "the language of X"

Show "dose response" / "XY, "IV-DV" patterns

Construct elaboration and refinement

  • Exploring the nomological network of a new construct

Construct differentiation through language

Measuring Within-person change

e.g., The secret life of pronouns: flexibility in writing style and physical health

Exploiting semantic distances

Given a set of constructs, what are their semantic distances

NLP Project Intuition

Basic Power Analysis

Tips:

  1. It's often better to get fewer words per observation (100+) and get more observations.

Sample Size Intuition

Feature Type vs. Demographic for Sample Size Count
Feature Type vs. Personality Outcome for Sample Size Count

Words per group intuition

Words per user for samples of N = 1,000 and N = 5,000
Words per user to discover significant correlations

General Rule of Thumbs:

  • If your datasets are limited, reduce your language dimensions

    • reduce features with occurrence filtering (down to 3k to 10k 1grams), pointwise mutual information threshold (down to <10k 2-3 grams)

    • Select language features based on the literature

    • project into a lower-dimensional space (sentiment, valence/arousal, LIWC)

    • model topics

  • Ballpark

    • No discovery, but sentiment, etc: 30+ words from 100 groups

    • Minimal: 100s of words from 100s of groups

    • Medium: 100s / 1000s from 1000s/100s of groups

    • Good: 1000s of words from 1000s of groups

Code Snippets

Running Database in Colab

# connects the extension to the database file 
from sqlalchemy import create_engine 
tutorial_db_engine = create_engine(f"sqlite://sqlite_data/{databse}.db?charset=utf8mb4")

# connect the extension to the database 
%sql tutorial_db_engine 

# Reload the database  
%reload_ext sql 

R in Colab & DLATK Functions

# write from R to db 
dbWriteTable(db_con, "table_name", df, overwrite = True, row.names = False)

# query from db to R
df <- dbGetQuery(db_con, "sql_query")

# functions to check df
checkDf2(feat_meta)

# Functions to convert feature table to wide format 
feat_meta_wide <- importFeat(feat_meta)

# Graveyard

Working with the Dictionary

LIWC

Correlation pattern: review April 17th lecture notes

Annotation-based emotion dictionaries

  • Affective Norms of the English Language (ANEW) by Bradley & Young

    • Captures the ways in which words are perceived

    • The impact that emotional features have on the processing and memory of words

      • zqq

ML-based dictionaries

  • generally better than annotation-based

ML & AI Analysis Annotate

  • table 3, comparing lv2 and lv3 result

Language Analysis in Science

Testing a priori hypotheses in language

Prediction

Exploratory correlates

Measuring within-person change

Exploiting semantic distances

Embedding

distance to probed points vs. factor analysis

Open Vocabulary Analysis

feat_occ_filter --set_p_occ 0.05: set occurance to be 5%, so only features that have shown in more than 5% of the user (notice here the feature table only uses one dash: -f ) — see the option 1 filter down the table we have, in slides

Extracting 2-grams

  • add 2-gram extraction method

Last updated