Deep Learning Techniques for Text Representation — Part 1

Tharuka KasthuriArachchi
Analytics Vidhya
Published in
6 min readFeb 23, 2022

--

Image Source : https://eewc.com/what-do-you-call-it/dictionarym/

Globally, the amount of digital text data available is continually expanding. As a result, being able to acquire necessary data from large text corpus quickly is becoming increasingly critical. In traditional NLP applications, handcrafted features are deriving from data more often. Handcrafting features takes a long time and may need to be done multiple times for each activity or domain-specific challenge. Because of this, traditional NLP systems suffer from scalability and generalization issues.

On the other hand, deep learning extracts information from data and representations at several levels without any handcrafted features. So, the model can be generalised by fine-tuning at the ground level. This is the most important advantage of deep learning. So, the learned information can be constructed as level-by-level composition. Across tasks, the lowest level of representation is frequently shared. Traditional NLP words are treated as atomic symbols and semantic link between the words is not captured by atomic symbol representations. Deep learning algorithms capture these relationships and allow the next NLP system to derive more complicated reasoning and knowledge. Furthermore, the recursively of human language is handled naturally in deep learning. Words and phrases with a specific structure make up human speech. Particularly recurrent neural models are significantly better at capturing sequence information.

As I mentioned earlier, in traditional NLP systems treat words as atomic symbols so that one-hot encoding is used to represent words in a fixed vocabulary, while a Bag of Word (BoW) is used to represent documents. Representations like [ 0 0 0 1 0 0 0 …] will generate by one-hot encoding. The size of the vector is the size of the vocabulary. With this representation, large sparse vectors are frequently produced.

Distributed Representations

When a word’s representation is not independent or mutually exclusive of another word’s, it is called distributed text representation. Its configurations can represent a variety of metrics and concepts in data. As a result, the information about a word is dispersed across the vector in which it is represented. This differs from discrete representation (atomic representation), in which each word is considered distinct and unrelated to the others. Below I will summarize several states of art deep learning algorithms which present distributed representations of texts.

Word2Vec

Word2vec algorithm suggests a method of representation of words based on their neighboring words (Thomas M. et al 2013). When the word2vec algorithm trained with enough data, it captures both syntactic and semantic relationships between words. As an example, “King” and “Queen” have a relationship with one another, with algebraic operations on generated vectors of these words we can find close approximation of similarities between these words.

        [King]    -    [Man]    +    [Woman]    =    [Queen]

As a summary this model takes a large corpus of words as input and generates a vector space with hundreds of dimensions, with each unique word in the corpus allocated to a matching vector in the space. Word vectors are placed in the vector space in such a way that words with similar contexts in the corpus are close to one another.

Word2vec Architecture

Word2vec model is a three-layer neural network architecture, with an input layer, hidden layer and an output layer. Since we cannot input a string to a nueral network we are feeding one hot encoding vector to the word2vec as the input. This vector length is equal to the size of the vocabulary (number of uniques words in the corpus) and filled with zeros except for the index that indicates the word we wish to represent, which is set to “1. The word embeddings are the weights of the hidden layer, which is a typical fully-connected (Dense) layer while the output layer generates probability for the vocabulary’s target terms.

Image Source : https://israelg99.github.io/2017-03-23-Word2Vec-Explained/

Word embeddings are given by the rows of the hidden layer weight matrix where the hidden layer works as a lookup table. All of this is leading up to learning this hidden layer weight matrix and then tossing eliminating the output layer once training is completed.

Word2vec model observes considerable improvements in accuracy at a much lower computational cost. There are two flavors of the word2vec algorithm, Skip Gram model and CBOW (Continues bag of words) model.

Image Source: https://arxiv.org/pdf/1301.3781.pdf

CBOW Model:

This is quite similar to the Feedforward Neural Net Language Model architecture. This model is used to produce word embeddings by using n future and n past words. The word in the middle of the window is predicted using distributed representations of context. The order of the word doesn’t influence the projection, so it’s referred as the bag of word, further unlike standard bag of word model, this uses continues distributed representation of the context. So this model denoted as continues bag of word model.

Skip-Gram Model:

While the CBOW model is predicting the middle word given a set of surrounding words while the skip-gram model does the vice versa. More precisely, it takes each current word as an input to a log-linear classifier with continuous projection layer, and predict n past and n future words. It has been proven that expanding the range enhances the quality of the resultant word vectors while simultaneously increasing processing complexity.

CBOW model has a faster training time than Skip-Gram, although the Skip-Gram model Performs better in larger datasets but requires more training time. On the other hand, these vectors provide state of art performance on a test set for determining syntactic and semantic word similarity.

Fast Text

As an extension of the Vanilla Word2Vec model, Facebook introduced the Fast Text model in 2016 and revised it in 2017. (Bojanowski et al, 2017). The major shortcoming of the word2vec model is the inability to deal with words that aren’t in the training corpus. The FastText model has overcome this issue and it allows to efficiently train big corpora and calculate word representations for words that were not in the training data.

This paradigm is derived from the continues skip-gram model which proposed by the authors of word2vec model. The skipgram approach ignores the underlying structure of words by generating a distinct vector representation one per word. In order to account for this information, Fast text offer an alternative strategy. In this approach each word is represented by a bag of character n-grams with special boundary symbols <and > at the beginning and end. And also include the word w itself. This adds subword information to the model and aids in the understanding of suffixes and prefixes via the embeddings. Because of this it can also compute valid representations for words which doesn’t appear in the training data. It do so by taking the sum of its n-gram vectors.

As a example lets take the word “learning”. When n=3 the word learning is represented as <le, ear, arn, rn>, and the special sequence <learn>. It’s worth noting that the sequence <ear>, which corresponds to the word ear, differs from the trigram ear, which corresponds to the word learn. After the word has been represented using character n-grams, a skip-gram model is trained to learn the embeddings.

Bibliography

Word2vec: https://arxiv.org/abs/1301.3781

Fasttext: https://arxiv.org/abs/1607.04606

--

--