NLP-C2-W4:Word Embeddings with Neural Networks

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/4

Common Embedding method

word2vec (Google 2013)

Continuous bag-of-words (CBOW): Use context words around the center word to predict the center word
Continuous skip-gram / Skip-gram with negative sampling (SGNS)

Global Vectors (GloVe) (Stanford, 2014)

fastText (Facebook, 2016)

Bert (Google, 2018)

ELMo( Allen Institute for AI, 2018)

GPT-2 (OpenAI, 2018)

Initialization: one-hot vector for word embedding

Model Architecture:

Input layer:
- Takes in the average of all context words embedding (size V)
- ReLU activation
- Recall that input is stacked vertically as $V \times n$ where $n$ is the number of input.
Hidden layer:
- 1 layer, size is the embedding dimensions
- softmax activation
Output layer :
- The probability distribution over vocabulary. Pick the arg-max to be the actual predicted word.
Cost function:
- Cross-entropy loss b/t softmax output and one-hot vector of the actual word

Options for obtaining word embedding:

The weight matrix of the input layer is $d \times V$ where $d$ is the embedding dimension. So, each column can be seen as the embedding to the corresponding word
The weight matrix of the hidden layer is $V \times d$ , so we can use each row as the corresponding word vector
Take the average of options 1 and 2

Model Evaluation

Intrinsic evaluation: test relationships between words
- analogy
  - semantic: e.g., "France" is to "Paris" as "Italy" is to ?
  - syntactic: e.g., "Seen" is to "saw" as "been" is to ?
- clustering algorithms and comparing it to thesaurus
- visualization
Extrinsic evaluation: test word embeddings on an external task
- named entity recognition
- POS tagging
- ...

Last updated 1 year ago