NLP-C2-W4:Word Embeddings with Neural Networks
https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/4
Common Embedding method
Basic word embeddings
word2vec (Google 2013)
Continuous bag-of-words (CBOW): Use context words around the center word to predict the center word
Continuous skip-gram / Skip-gram with negative sampling (SGNS)
Global Vectors (GloVe) (Stanford, 2014)
fastText (Facebook, 2016)
Supports out-of-vocabulary words
Can average word embeddings to get embedding for sentences
Deep learning-based embeddings
Bert (Google, 2018)
ELMo( Allen Institute for AI, 2018)
GPT-2 (OpenAI, 2018)
CBOW Architecture
Initialization: one-hot vector for word embedding
Model Architecture:
Input layer:
Takes in the average of all context words embedding (size V)
ReLU activation
Recall that input is stacked vertically as where is the number of input.
Hidden layer:
1 layer, size is the embedding dimensions
softmax activation
Output layer :
The probability distribution over vocabulary. Pick the arg-max to be the actual predicted word.
Cost function:
Cross-entropy loss b/t softmax output and one-hot vector of the actual word
Options for obtaining word embedding:
The weight matrix of the input layer is where is the embedding dimension. So, each column can be seen as the embedding to the corresponding word
The weight matrix of the hidden layer is , so we can use each row as the corresponding word vector
Take the average of options 1 and 2
Model Evaluation
Intrinsic evaluation: test relationships between words
analogy
semantic: e.g., "France" is to "Paris" as "Italy" is to ?
syntactic: e.g., "Seen" is to "saw" as "been" is to ?
clustering algorithms and comparing it to thesaurus
visualization
Extrinsic evaluation: test word embeddings on an external task
named entity recognition
POS tagging
...
Completed Notebook
Manual coding of gradient descent
Text pre-processing
Last updated