Skip to main content

Word Embeddings and RNNs

One of the simplest ways to convert words from a natural language into mathematical tensors is to simply represent them as one-hot vectors where the length of these vectors is equal to the size of the vocabulary from where these words are fetched.

For example, if we have a vocabulary of size 8 containing the words:

"a", "apple", "has", "matrix", "pineapple", "python", "the", "you"

the word "matrix" can be represented as: [0,0,0,1,0,0,0,0]

Obviously, this approach will become a pain when we have a huge vocabulary of words (say millions), and have to train models with these representations as inputs. But apart from this issue, another problem with this approach is that there is no built-in mechanism to convey semantic similarity of words. eg. in the above example, apple and pineapple can be considered to be similar (as both are fruits), but their vector representations don't convey that.

Word Embeddings let us represent words or phrases as vectors of real numbers, where these vectors actually retain the semantic relationships between the original words. Instead of representing words as one-hot vectors, word embeddings map words to a continuous vector space with a much lower dimension.

This mathematical embedding is achieved by a matrix called "embedding matrix". This EM is a matrix of shape (maximum_number_of_words_to_represented,embed_size). ` is the length of the embedded vectors.

In Keras, the embedding matrix is represented as a layer, and maps positive integers (indices corresponding to words) into dense vectors of fixed size (the embedding vectors)

Let's see how word embeddings work.

In [0]:
# colab stuff

# from google.colab import drive
# drive.mount('/content/gdrive')

# DRIVE_BASE_PATH = "/content/gdrive/My\ Drive/Colab\ Notebooks/"
# !pip install -q kaggle

# !mkdir -p ~/.kaggle
# !cp {DRIVE_BASE_PATH}kaggle.json ~/.kaggle/kaggle.json
# !chmod 600 /root/.kaggle/kaggle.json

I'll make use of the data provided in this Kaggle competition where the goal is to classify comments based on how toxic they are.

In [0]:
DATA_PATH = 'data/toxic_comments/'
In [0]:
!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge -p {DATA_PATH}
Downloading sample_submission.csv.zip to data/toxic_comments
  0% 0.00/1.39M [00:00<?, ?B/s]
100% 1.39M/1.39M [00:00<00:00, 102MB/s]
Downloading test.csv.zip to data/toxic_comments
 21% 5.00M/23.8M [00:00<00:00, 38.3MB/s]
100% 23.8M/23.8M [00:00<00:00, 116MB/s] 
Downloading train.csv.zip to data/toxic_comments
 93% 25.0M/26.7M [00:00<00:00, 43.0MB/s]
100% 26.7M/26.7M [00:00<00:00, 92.4MB/s]
Downloading test_labels.csv.zip to data/toxic_comments
  0% 0.00/1.46M [00:00<?, ?B/s]
100% 1.46M/1.46M [00:00<00:00, 239MB/s]
In [0]:
import numpy as np
import pandas as pd
In [0]:
df_train = pd.read_csv(DATA_PATH+'train.csv.zip', compression='zip')
In [2]:
# df_train.head()

Dataframe

Let's check out some of the comments.

In [37]:
for comment in df_train.head(n=5).comment_text:
  print(f'{comment}\n-----\n')
Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
-----

D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)
-----

Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.
-----

"
More
I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.

There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport  "
-----

You, sir, are my hero. Any chance you remember what page that's on?
-----

If you go through this data, you'll find that it contains a lot of toxic comments, and the training data is highly unbalanced. Anyways, that's not my focus for now. I just want to make use of this text data to understand (and explain) how word embeddings can help with training models on natural language data.

Next step is to create an embedding matrix. We can start with a randomly initialized matrix and have the model train it, but we don't have the necessary volume of data here to train it. In a lot of cases it makes sense to use pre-trained word vectors. I'll use the glove pre-trained word vectors for this task. Glove offers word vectors trained on different kinds of datasets, like Wikipedia, Twitter, etc. After doing some comparisons I've chosen to use the vectors trained on twitter data. (It sorta makes sense because the comments in the data use informal language similar to tweets)

In [0]:
GLOVE_DIR = 'data/glove/'
In [0]:
!kaggle datasets download -d jdpaletto/glove-global-vectors-for-word-representation -f glove.twitter.27B.50d.txt -p {GLOVE_DIR}
Downloading glove.twitter.27B.50d.txt.zip to data/glove
 91% 185M/204M [00:01<00:00, 95.9MB/s]
100% 204M/204M [00:01<00:00, 112MB/s] 
In [0]:
!unzip -q {GLOVE_DIR}glove.twitter.27B.50d.txt.zip -d {GLOVE_DIR}
In [13]:
!ls -R data
data:
glove  toxic_comments

data/glove:
glove.twitter.27B.50d.txt  glove.twitter.27B.50d.txt.zip

data/toxic_comments:
sample_submission.csv.zip  test.csv.zip  test_labels.csv.zip  train.csv.zip
In [0]:
WORD_VECTORS_TWITTER_FILE=f'{GLOVE_DIR}glove.twitter.27B.50d.txt'

So the word vector file is made of rows where the first element is the word, and the next embed_size numbers are it's vector representation. I'm using the 50-dimensional-vectors version of glove in this case.

Next step is to generate a dictionary that maps all the words contained in the glove file to their vector representations.

In [0]:
def get_coefs_twitter(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index_twitter = dict(get_coefs_twitter(*o.strip().split()) for o in open(EMBEDDING_FILE_TWITTER))
In [56]:
embeddings_index_twitter['matrix']
Out[56]:
array([-0.76652  ,  0.057405 , -0.69997  , -0.39679  ,  0.15084  ,
        0.10149  ,  0.24257  , -0.50817  ,  0.27054  ,  0.0037956,
        0.12802  ,  0.39282  , -1.9335   ,  0.65609  ,  0.36521  ,
       -1.0025   ,  0.69579  ,  0.94691  ,  0.21036  ,  0.83848  ,
       -0.39195  , -0.12665  ,  0.30457  , -0.45966  , -0.56864  ,
       -0.46552  , -0.29693  ,  0.088017 , -0.14209  ,  0.20631  ,
        0.44746  , -0.41997  ,  0.38827  , -0.62171  , -0.39465  ,
       -0.018662 ,  0.15455  , -0.25936  ,  0.19837  , -0.67338  ,
        0.025445 ,  0.28613  ,  0.054073 , -0.25733  , -0.21719  ,
        0.28868  , -0.31912  ,  0.40194  , -0.091508 , -0.16081  ],
      dtype=float32)

One small snag here is that one row has only 50 separate elements. As a result, the word corresponding to that row is a 49 dimensional vector.

In [58]:
s = set(arr.shape for arr in embeddings_index_twitter.values())
print(s)
{(50,), (49,)}

The next few steps are just to remove that row from the embeddings index dictionary.

In [61]:
for i,el in enumerate(embeddings_index_twitter.values()):
    if el.shape[0]==49:
        print(i)
38522
In [62]:
!sed '38523q;d' {WORD_VECTORS_TWITTER_FILE}
… 0.45973 -0.16703 -1.2028 0.41675 0.14643 -0.39861 -0.35118 -0.46944 0.63799 0.49569 -0.038122 -0.37854 -1.2221 -1.0439 -1.2604 0.01232 -0.5159 0.1357 -0.093283 0.12307 0.48072 -0.66419 0.50046 -0.58255 0.81583 0.72197 -0.101 -0.17283 0.51572 0.3296 -0.0024615 0.19475 2.1163 0.20636 -1.2026 -0.0767 -0.1058 -0.82518 -0.31287 -0.19303 0.061489 -0.30422 0.75731 -0.53688 -0.22277 -0.22173 -0.37943 0.17821 -0.34743 0.3064
In [84]:
embeddings_index_twitter['0.45973'].shape
Out[84]:
(49,)
In [0]:
del(embeddings_index_twitter['0.45973'])
In [87]:
set(arr.shape for arr in embeddings_index_twitter.values())
Out[87]:
{(50,)}

All set. Next step is to tokenize out data. Tokenizing will convert a sentence to an array of word indices, as described below.

Tokenization

In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

I'll work with a subset of the data to make the training a little faster.

In [0]:
df_subset = df_train.sample(frac=1)[:50000]
In [0]:
list_sentences_train = df_subset["comment_text"].fillna("_na_").values
list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y = df_subset[list_classes].values

We need to set the values of a few constants. maxlen defines the max number of words in a sentence. max_features is used by Keras's Tokenizer to select that number of unique words from a set of words input to it.

In [0]:
embed_size = 50       #EMBEDDING_DIMENSION
max_features = 20000 
maxlen = 100 
In [0]:
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(list_sentences_train))
list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train)
X_t = pad_sequences(list_tokenized_train, maxlen=maxlen)
In [114]:
print(X_t.shape, y.shape)
(50000, 100) (50000, 6)

Now that the training data is tokenized, we can create an embedding matrix of shape (max_number_of_unique_words_we_tokenized_capped_by_some_upper_limit,embed_size).

We'll use random initialization for words that aren't in GloVe. We'll use the same mean and stdev of embeddings the GloVe has when generating the random values.

In [115]:
all_embs = np.stack(embeddings_index_twitter.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
emb_mean,emb_std
Out[115]:
(0.04399518, 0.73192465)
In [116]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index_twitter.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector
print(embedding_matrix.shape)
(20000, 50)

Simply put, each row of this matrix is the vector representation (from glove file) of a word in the word index.

This embedding matrix can then be used as weights for an Embedding layer in a model. We can choose to update these weights during training or not.

So when an input of shape (None, maxlen) is given to this Embedding layer, it will convert it to (None, maxlen, embed_size) as seen in the diagram below.

Working of embedding layer

In [0]:
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
In [3]:
def model():
    
    inp = Input(shape=(maxlen,))
    x = Embedding(nb_words, embed_size, weights=[embedding_matrix], trainable=False)(inp)
    x = Bidirectional(LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(x)
    x = GlobalMaxPool1D()(x)
    x = Dense(50, activation="relu")(x)
    x = Dropout(0.1)(x)
    x = Dense(6, activation="sigmoid")(x)  #sigmoid since this is a multi-label problem
    model = Model(inputs=inp, outputs=x)
    
    return model
In [120]:
model = model()
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 50)           1000000   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 100)          40400     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 50)                5050      
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 306       
=================================================================
Total params: 1,045,756
Trainable params: 45,756
Non-trainable params: 1,000,000
_________________________________________________________________
In [121]:
model.fit(X_t, y, batch_size=32, epochs=4, validation_split=0.1)
Train on 45000 samples, validate on 5000 samples
Epoch 1/4
45000/45000 [==============================] - 229s 5ms/step - loss: 0.0868 - acc: 0.9726 - val_loss: 0.0566 - val_acc: 0.9804
Epoch 2/4
45000/45000 [==============================] - 227s 5ms/step - loss: 0.0585 - acc: 0.9795 - val_loss: 0.0518 - val_acc: 0.9812
Epoch 3/4
45000/45000 [==============================] - 228s 5ms/step - loss: 0.0547 - acc: 0.9804 - val_loss: 0.0484 - val_acc: 0.9822
Epoch 4/4
45000/45000 [==============================] - 227s 5ms/step - loss: 0.0523 - acc: 0.9813 - val_loss: 0.0489 - val_acc: 0.9821
Out[121]:
<keras.callbacks.History at 0x7fe3167d0c88>

Seems to have started overfitting after the 3rd epoch. A bunch of improvements can be done beyond this baseline, but my main motive behind this exercise was to understand how word embeddings and LSTMs work together.