NLP-C2-W4:Word Embeddings with Neural Networks

https://www.coursera.org/learn/probabilistic-models-in-nlp/home/week/4

Common Embedding method

Basic word embeddings

word2vec (Google 2013)

  • Continuous bag-of-words (CBOW): Use context words around the center word to predict the center word

  • Continuous skip-gram / Skip-gram with negative sampling (SGNS)

Global Vectors (GloVe) (Stanford, 2014)

fastText (Facebook, 2016)

  • Supports out-of-vocabulary words

  • Can average word embeddings to get embedding for sentences

Deep learning-based embeddings

Bert (Google, 2018)

ELMo( Allen Institute for AI, 2018)

GPT-2 (OpenAI, 2018)

CBOW Architecture

Initialization: one-hot vector for word embedding

Model Architecture:

  • Input layer:

    • Takes in the average of all context words embedding (size V)

    • ReLU activation

    • Recall that input is stacked vertically as V×nV \times n where nn is the number of input.

  • Hidden layer:

    • 1 layer, size is the embedding dimensions

    • softmax activation

  • Output layer :

    • The probability distribution over vocabulary. Pick the arg-max to be the actual predicted word.

  • Cost function:

    • Cross-entropy loss b/t softmax output and one-hot vector of the actual word

Options for obtaining word embedding:

  1. The weight matrix of the input layer is d×Vd \times V where dd is the embedding dimension. So, each column can be seen as the embedding to the corresponding word

  2. The weight matrix of the hidden layer is V×dV \times d, so we can use each row as the corresponding word vector

  3. Take the average of options 1 and 2

Model Evaluation

  • Intrinsic evaluation: test relationships between words

    • analogy

      • semantic: e.g., "France" is to "Paris" as "Italy" is to ?

      • syntactic: e.g., "Seen" is to "saw" as "been" is to ?

    • clustering algorithms and comparing it to thesaurus

    • visualization

  • Extrinsic evaluation: test word embeddings on an external task

    • named entity recognition

    • POS tagging

    • ...

Completed Notebook

CBOW Embedding Training

  • Manual coding of gradient descent

  • Text pre-processing

Last updated