In natural language processing, we parse a sentence in n-gram sequence of words. To illustrate, ‘I am a good boy’, can be parsed as ‘I’, ‘am’, ‘a’, ‘good’ ‘boy’. Here ‘I’ is a unigram (n=1}. The same sentence, can be parsed as ‘I am’, ‘am a’, ‘a good’, ‘good boy’. Here ‘I am’ is a bigram (n=2). Then the sentence is parsed on trigram — ‘I am a’, ‘a good boy’. Here ‘I am a’ is a trigram where n=3.
When each word of a sentence is considered independently, it is called unigram model. Bigram models examine the probability of each word in a phrase based on the probability of the preceding word. Thus in NLP, two most frequently used models are unigram and bigram.
Bigram is a pair of consecutive words in a sentence. Trigram is a special case of the n-gram where n=3. It is a sequence of three consecutive words in a sentence. Similarly 4-gram is a sequence of four consecutive words in a sentence.
N-grams are used to create language models. In fact, they are statistical models which predict the probability of a sequence of words. N-grams can also be used in spelling correction. N-grams are also used to tag words with their part-of-speech (POS) tags. N-grams can be used to classify text into different categories.
N-grams help in summarizing text. Here most important n-grams are identified. They are used in machine translation. Text is translated from one language to another by considering the probability of a sequence of words appearing in the target language given the sequence of words in the source language. N-gams are used to train the chatbots to respond to queries.
N-grams are useful in converting words in numerical formats. It helps in capturing the context of the words. It facilitates text classification, sentiment analysis and machine translation.
Ways to Convert N-grams into Vectors
One-hot encoding : Here each n-gram is assigned a unique integer. Let us say we have a bigram ‘fox ate’. It is assigned the integer one. N-gram ‘I love’ can be assigned integer 2. Thus each n-gram is represented as a vector of zeroes, with a single one in the position corresponding to the unique integer.
Count encoding : Here each n-gram is assigned a count of number of times it appears in the text. If ‘I love’ appears two times in the text, it would be assigned count 2. Thus each n-gram is represented as a vector of non-negative integers, and the value of each integer corresponds to the number of times the n-gram appears in the text.
The method used depends on task on hand. One-hot coding is used for text classification. Count encoding is used for sentiment analysis and machine translation.
Example
‘I love’, ‘love birds’. If we use hot encoding, each bigram will be assigned a unique integer. ‘I love’ can be assigned unique integer 1 and ‘love birds’, would be assigned unique integer 2. Thus ‘I love birds’ would be represented by a vector (1,2).
If we use count encoding each of these grams and thus both are assigned 1, which is their count the representation of vector would be (1,1).