NLP wants to create computer programmes which understand, generate and work with human languages. As we know languages with words and sentences, we also know that computers understand only the numbers.
Thus we map the words or sentences to vectors — a bunch of numbers. It is called text vectorization. However, this does not mean we can assign a number to each word. The idea is to have a vector of numbers which represents the words and information. In text vectorization, text is converted into a numeric vector. ML algorithms take numeric feature vectors as input.
In ML, the words in training data called vocabulary are considered. Let us say there are 2000 words. We will have an order from 0 to 2000. We then define the vector ith word as all zeros except for a 1 in the position of i.
If the vocabulary consists of three words, monkey, ape and banana, the vector for monkey will be (1,0,0). The vector for ape will be (0,1,0). The one for banana will be (0,0,1.).
Internet tells us that monkeys eat bananas. And apes eat fruits. Algorithm thus should understand the similarity of information. Intuitively, we feel they are similar.
Vectors of one hot encoders when compared indicate there is no word match between the phrases. Therefore, our programme will treat them as two completely different pieces of information. Human beings can guess the meaning of word through the context. The words of similar meanings can be interchanged — it is called distributional hypothesis.
It is the foundation of how word vectors are created. As Firth says, a word is characterized by the company it keeps.
In vector representation, similar words, ideally, should end up with similar vectors. We then create dense vectors, where the values are not only 0s and 1s, but any decimal between them. Monkey’s vector (1,0,0) could be (0.96, 0.55, 0.32). It will have a dimension (the amount of numbers) we choose.
Similar representations are required when some common properties are shared by the words — plural or singular, verbs or adjectives, or gender. All these can be encoded in vectors. This led to Word2vec (2013) and changed in text vectorization field.
We do require vectors to represent words better. There are some methods to do so. Some methods are explicit.
Human evaluation measures the distance between vectors, and takes the opinion of some linguist. This is, however, time consuming and not scalable. Besides, one will require large number of words to get new vectors.
Synactic analogues means that these can be added or substracted. Play is to playing, as eat is eating. The operation is Playing – play+eat and find the most similar vector to this result. It should be one correspondent to eating.
Semantic analogies are almost similar to synactic ones. The analogies are ‘ donkey is to animal as pineapple is to __, and the guess is fruit. Or ‘monkey is to animal, as banana is to fruit.’ Complex analogy could be monkey+food=banana.
The vectors we create must pass such tests. It could then be surmised that they are understanding the words.
We have considered explicit methods. Let us consider the implicit methods. Here the word vectors are created and their impact is measured. Let us consider sentiment analysis classifier. Here instead of hot-encoder, we can use word embedding vectors. If results are better, these suit our problem.
Such testing becomes expensive. It is time consuming and uses computational resources too.
Word Vectors Creation
Lot of text refers to billions of words. There are big families of word vectors.
In statistical methods, a co-occurance matrix is created. Then a method of matrix dimensionality reduction is applied. Some words occur very often. There we invoke PMI — pointwise mutual information.
In predictive methods, ML algorithms make predictions based on words and their context. Some weights are used to make the algorithm learn to represent each word. These are used in neural networks.
The networks are trained to predict the next word in a text given the previous N words.
Bengio’s 2003 paper used this approach. Here a hidden layer is used.
Breakthrough work is Word2vec.
Continuous Bag of Words (CBOW) sets up a sliding window of size N. On a huge text, network is trained to predict a word inputting N words on each side. Skipgram is similar to CBOW. The difference is instead of predicting the word in the middle given all others, all others are predicted given the one in the middle. At times, methods are combined to get good results.
The other popular methods of vectorization are Binary Term Frequency, Term frequency (L1), Normalized, Term Frequency (L2), Normalized, Term frequency (L2), Normalized, IF-IDF and Word2Vec.