Implementation of Pre-Trained (GloVe) Word Embeddings on Dataset

Published in

Artificial Intelligence in Plain English

4 min readJan 8, 2021

Source: https://venturebeat.com/2019/02/14/openai-let-us-generate-text-with-an-ai-model-that-achieves-state-of-the-art-performance-in-several-nlp-tasks/

In this article, you will learn about GloVe, a very powerful word vector learning technique. In this work we present a step-by-step implementation of training a Language Model (LM), using Long Short-Term Memory (LSTM) and pre-trained GloVe word embeddings.

Language Modeling is an important task in many Natural Language Processing (NLP) applications. These applications include clustering, information retrieval, machine translation, spelling and grammatical errors correction. In general, a language model defined as a function that puts a probability measure over strings drawn from some vocabulary.

The GloVe Model

The statistics of word occurrences in a corpus is the primary source of information available to all unsupervised methods for learning word representations, and although many such methods now exist, the question still remains as to how meaning is generated from these statistics, and how the resulting word vectors might represent that meaning.

GloVe observes that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. Take the example from StanfordNLP (Global Vectors for Word Representation), to consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary:

· As one might expect, ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid.

· Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently.

Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam.

Word2Vec vs GloVe

Word vectors put words to a nice vector space, where similar words cluster together and different words repel. The advantage of GloVe is that, unlike Word2vec, GloVe does not rely just on local statistics (local context information of words), but incorporates global statistics (word co-occurrence) to obtain word vectors.

Imports

We’re going to need to use, Numpy, Scipy, Matplotlib, and Sklearn for this project.
If you need to install any of these, you can run the following:

Data Cleaning

The following steps were performed:

1) Removing URLs

2) Removing HTML tags

3) Removal of Punctuations

Creating Corpus

Creating corpus in which for loop is made with tqdm (used for progress bar) text data column, and lowering out the words (tweets), and tokenizing the sentence.

With the condition continues until the last word present.

Load the Vectors

To load the pre-trained vectors, we must first create a dictionary that will hold the mappings between words, and the embedding vectors of those words:

I am using glove.60B.100d.txt. which contains 6B Tokens, 400K vocab and you can find over here: https://nlp.stanford.edu/projects/glove/

Once we are done with with-statement, we need to loop through each line in the file, and split the line by every space, into each of its components. After splitting the line, we assume the word does not have any spaces in it, and set it equal the first (or zeroth) element of the split line. Then we can take the rest of the line, and convert it into a Numpy array. This is the vector of the word’s position.

At the end, we can update our dictionary with the new word and its corresponding vector:

Padding Sentences

Allocating 50 word as length for ever sentence. Tokenising every word of the corpus and later padding the sentence to the MAX_LEN allocated i.e. 50 words per sentence.

Truncating means removal of rest of words remaining in the sentence of the corpus and padding the left words:

with the remaining unique words left in the corpus:

Embedding GloVe on Dataset

Every word present in the dataset is embedded with the GloVe downloaded text vectors and an embedding matrix is created containing words with their respective vectors: