githubEdit

NLP Street Fighting

Notes from Stanford Psych 290 taught by Johannes Eichstaedt

NLP Research Street Knowledge

Conferences:

DLATK

Lingo & Component

  • Type: an element of the vocabulary (e.g., "dog fighting dog" has two types)

  • Token: an instance of that type in text (e.g., "dog fighting dog" has three tokens)

  • case folding: reduce all letters to lower case (sometimes for sentiment analysis, information extraction, machine translation, case sensitive is good)

  • Lemmatization: Reduce inflections or variant forms to base form (e.g., am, are, is -> be)

  • Stemming: reduces terms to their root form (See this blog postarrow-up-right about the difference). In reality, with enough data, this becomes a secondary concern

Language Features

Dictionaries:

  • Top 5 most frequent words for GI, DICTION, and LIWC 2015arrow-up-right

  • Using dictionaries, we should always see the most frequent words correlate, and annotate for precision.

  • For LIWC, understand LIWC correlation patterns, read the documentation.

  • Good Dictionaries:

    • Curated: LIWC,

    • Annotation-based dictionaries: Warriner's new ANEW (affective), LabMT (they capture how words make people feel)

    • Sentiment / ML-based: Sentistrength, VADER, SwissChocolate, NRC (recommended)

Component of DLATK

  • Takes in a message table

  • Produce a feature table

    • Be careful how this impacts the calculation. For example, if I have 1 gram, and want to know the average percentage of use (the group norm) of a single word, we can't just take the mean (missing all the people who did not use that word)

  • Uses lexicon tables to store dictionaries

  • Use outcome tables for correlations

Lexicon tables:

  • stored in the central dlatk_lexica database

  • contains term (word), category (lexicon), and weight, sparse encoded

  • All feature table names created with -add_lex_table contain "cat_"

Quick NLP Stuff

Language is Weird:

  • Language follows the Zif's law. The probability of a word with a rank rr to show up in text is

    p(Wr)=0.1rp(W_r)= \frac{0.1}{r}

Minimum Data Intuition

  • Minimal: hundreds of words from hundreds of people

  • Medium: hundreds of words from thousands of people, or thousands of words from hundreds of people

  • Good: thousands of words from thousands of people

A fundamental difficulty of language

  • Many processes map to a single outcome (e.g., use of a singular pronoun), but knowing the outcome is hard to match to a specific process.

Scientific Storytelling with Language Analyses (for details check Lecture 8)

Testing a priori hypotheses in language

Testing specific language correlations/effects. Example paper: Narcissism and the use of personal pronouns revisitedarrow-up-right

Prediction

Predicts X from text, e.g., What Twitter Profile and Posted Images Reveal about Depression and Anxietyarrow-up-right

Imputing estimates where there are none

Using the prediction error as an estimate of something else, e.g., : Authentic self-expression on social media is associated with greater subjective well-beingarrow-up-right

Exploratory correlates

Papers that show "the language of X"

Show "dose response" / "XY, "IV-DV" patterns

Construct elaboration and refinement

  • Exploring the nomological network of a new construct

Construct differentiation through language

Measuring Within-person change

e.g., The secret life of pronouns: flexibility in writing style and physical healtharrow-up-right

Exploiting semantic distances

Given a set of constructs, what are their semantic distances

NLP Project Intuition

Basic Power Analysis

Tips:

  1. It's often better to get fewer words per observation (100+) and get more observations.

Sample Size Intuition

Feature Type vs. Demographic for Sample Size Count
Feature Type vs. Personality Outcome for Sample Size Count

Words per group intuition

Words per user for samples of N = 1,000 and N = 5,000
Words per user to discover significant correlations

General Rule of Thumbs:

  • If your datasets are limited, reduce your language dimensions

    • reduce features with occurrence filtering (down to 3k to 10k 1grams), pointwise mutual information threshold (down to <10k 2-3 grams)

    • Select language features based on the literature

    • project into a lower-dimensional space (sentiment, valence/arousal, LIWC)

    • model topics

  • Ballpark

    • No discovery, but sentiment, etc: 30+ words from 100 groups

    • Minimal: 100s of words from 100s of groups

    • Medium: 100s / 1000s from 1000s/100s of groups

    • Good: 1000s of words from 1000s of groups

Code Snippets

Running Database in Colab

R in Colab & DLATK Functions

# Graveyard

Working with the Dictionary

LIWC

Correlation pattern: review April 17th lecture notes

Annotation-based emotion dictionaries

  • Affective Norms of the English Language (ANEW) by Bradley & Young

    • Captures the ways in which words are perceived

    • The impact that emotional features have on the processing and memory of words

      • zqq

ML-based dictionaries

  • generally better than annotation-based

ML & AI Analysis Annotate

  • table 3, comparing lv2 and lv3 result

Language Analysis in Science

Testing a priori hypotheses in language

Prediction

Exploratory correlates

Measuring within-person change

Exploiting semantic distances

Embedding

distance to probed points vs. factor analysis

Open Vocabulary Analysis

feat_occ_filter --set_p_occ 0.05: set occurance to be 5%, so only features that have shown in more than 5% of the user (notice here the feature table only uses one dash: -f ) — see the option 1 filter down the table we have, in slides

Extracting 2-grams

  • add 2-gram extraction method

Last updated