Be careful how this impacts the calculation. For example, if I have 1 gram, and want to know the average percentage of use (the group norm) of a single word, we can't just take the mean (missing all the people who did not use that word)
Uses lexicon tables to store dictionaries
Use outcome tables for correlations
Lexicon tables:
stored in the central dlatk_lexica database
contains term (word), category (lexicon), and weight, sparse encoded
All feature table names created with -add_lex_table contain "cat_"
Quick NLP Stuff
Language is Weird:
Language follows the Zif's law. The probability of a word with a rank r to show up in text is
p(Wr)=r0.1
Minimum Data Intuition
Minimal: hundreds of words from hundreds of people
Medium: hundreds of words from thousands of people, or
thousands of words from hundreds of people
Good: thousands of words from thousands of people
A fundamental difficulty of language
Many processes map to a single outcome (e.g., use of a singular pronoun), but knowing the outcome is hard to match to a specific process.
Scientific Storytelling with Language Analyses (for details check Lecture 8)
Given a set of constructs, what are their semantic distances
NLP Project Intuition
Basic Power Analysis
Tips:
It's often better to get fewer words per observation (100+) and get more observations.
Sample Size Intuition
Feature Type vs. Demographic for Sample Size Count
Feature Type vs. Personality Outcome for Sample Size Count
Words per group intuition
Words per user for samples of N = 1,000 and N = 5,000
Words per user to discover significant correlations
General Rule of Thumbs:
If your datasets are limited, reduce your language dimensions
reduce features with occurrence filtering (down to 3k to 10k 1grams), pointwise mutual information threshold (down to <10k 2-3 grams)
Select language features based on the literature
project into a lower-dimensional space (sentiment, valence/arousal, LIWC)
model topics
Ballpark
No discovery, but sentiment, etc: 30+ words from 100 groups
Minimal: 100s of words from 100s of groups
Medium: 100s / 1000s from 1000s/100s of groups
Good: 1000s of words from 1000s of groups
Code Snippets
Running Database in Colab
R in Colab & DLATK Functions
# Graveyard
Working with the Dictionary
LIWC
Correlation pattern: review April 17th lecture notes
Annotation-based emotion dictionaries
Affective Norms of the English Language (ANEW) by Bradley & Young
Captures the ways in which words are perceived
The impact that emotional features have on the processing and memory of words
zqq
ML-based dictionaries
generally better than annotation-based
ML & AI Analysis Annotate
table 3, comparing lv2 and lv3 result
Language Analysis in Science
Testing a priori hypotheses in language
Prediction
Exploratory correlates
Measuring within-person change
Exploiting semantic distances
Embedding
distance to probed points vs. factor analysis
Open Vocabulary Analysis
feat_occ_filter --set_p_occ 0.05: set occurance to be 5%, so only features that have shown in more than 5% of the user (notice here the feature table only uses one dash: -f ) — see the option 1 filter down the table we have, in slides
# 1 gram extraction
# Result table breaks the 1 gram stats at the coorel_field level
# The group_norm is the value divided by the total number of features for this group
# !!!The result is sparse coded, so if a feature does not exist for a user, there is no entry.
!dlatkInterface.py \
--corpdb eich\ #-- database name
--corptable msgs\ #-- message table
--correl_field user_id \ #-- group by what
--add_ngrams -n 1 #-- doing what (do 1 gram) # do -n 1 2 3 to get 1-3 gram
# Adding lexicon information to 1-gram
# It will give you a table that counts each user's different categories (e.g., positive emotion).
! dlatkInterface.py \
--corpdb eich \
--corptable msgs\
--correl_field user_id \
--add_lex_table -l LIWC2015 # add --weighted_lexicon to enable weights
# Correlate LIWC against Personality
# --rmatrix produces a correlation matrix in HTML format,
# --csv produces correlation matrix in csv format
# --sort append correlations matrix with another one that is sorted by effect size.
# --group_freq_thresh limits the population (only for correlations and predictions)
!dlatkInterface.py \
--corpdb eich \
--corptable msgs\
--correl_field user_id \
--group_freq_thresh 500\
--correlate\
--rmatrix --csv --sort\
---feat_table 'feat$cat_mini_LIWC2015$msgs$user_id$1gra' \
-- outcome_table blog_outcomes --outcomes age gender \
-- controls control_var \
-- output_name ~/mini_liwc_age_gender
# Correlation against categorical variables
!dlatkInterface.py \
--corpdb {corpdb} \
--corptable {msgs_table} \
--correl_field user_id \
--correlate \
--rmatrix --csv --sort \
--feat_table {feat_miniliwc_table} \
--outcome_table {outcomes_table} \
--categories_to_binary occu \
--outcomes occu \
--output_name ~/mini_liwc_occu
# Creating 1-3 gram clouds
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_1to3gram_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
--tagcloud --make_wordclouds \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets
# Topic clouds
!dlatkInterface.py \
-d {corpdb} -t {msgs_table} -c user_id \
--correlate --csv\
--feat_table '{feat_topic_table}' \
--outcome_table {outcomes_table} --outcomes age_bins \
--category_to_binary age_bins \
--topic_tagcloud --make_topic_wordclouds \
--topic_lexicon topics_fb2k_freq \
--output_name {OUTPUT_FOLDER}/1to3gram_ageBuckets
# connects the extension to the database file
from sqlalchemy import create_engine
tutorial_db_engine = create_engine(f"sqlite://sqlite_data/{databse}.db?charset=utf8mb4")
# connect the extension to the database
%sql tutorial_db_engine
# Reload the database
%reload_ext sql
# write from R to db
dbWriteTable(db_con, "table_name", df, overwrite = True, row.names = False)
# query from db to R
df <- dbGetQuery(db_con, "sql_query")
# functions to check df
checkDf2(feat_meta)
# Functions to convert feature table to wide format
feat_meta_wide <- importFeat(feat_meta)