You know, the generative AI models are trained on tokens, say 1 trillon tokens. What does this mean?Tokens are the smallest units of text which can be processed by ML algorithm. In natural language processing (NLP), tokenization is the process of splitting a piece of text into small units called tokens. Thus a token may be a word, or a part of a word or just characters like a percentage sign or punctuation marks.
Tokenization is a foundational task. It is a difficult since every language has its own grammatical constructs (which cannot be expressed as rules).
The more the tokens are used to train an AI model, the more it understands the language and the more efficiently it generates human like text.
In short, tokenization splits longer strings of text into smaller pieces or tokens, based on some criteria. These criteria could be punctuation marks, whitespace or other delimiters.
Thus ‘I love you’ can be broken down into individual words ‘I’, ‘love’ and ‘you’
Tokenization is used for search engine indexing (web pages are broken down into individual words or phrases), text classification ( documents are broken down into individual words or phrases) and sentiment analysis ( text is broken down into individual words or phrases that can be analysed for sentiment.
Different tokenizations are whitespace tokenization, word tokenization and character tokenization.
There is difference between tokenization and lemmatization. Lemmatization reduces the words to its base form, e.g. playing when lemmatized becomes ‘play’.
There are tools for tokenization. NTLK is a popular Python library that includes tokenizer. spaCy is another Python library that includes tokenizer. TextBlob is a Python library for processing textual data that includes a tokenizer. Gensim is a Python library for topic modelling that includes a tokenizer.
Training the AI Model
After the text has been tokenized, it can be used to train the generative AI model. The model learns to associate each token with a probability distribution over the next possible tokens. It is this distribution which is then used to generate new text that is similar to the input text. The new text is grammatically correct as well as semantically meaningful. Tokenization improves the accuracy of the model. It increases its efficiency. It makes the model scalable.
Language has statistical properties. The model counts the frequency of different tokens. It learns which tokens are likely to appear together.
After tokenizing the text, it is converted into a numerical representation such as a vector. The numerical representation of each token is typically created using a technique called word embedding. Word embedding is a way of representing words as vectors that capture the meaning of words.
There are two types of embeddings. Static embeddings where a large corpus of text is used. Vocabulary of all words in the corpus is created. Each word in the vocabulary is assigned a vector representation. it is a statistical method of finding patterns in the way the word is used in the corpus. In dynamic embeddings, a neural network predicts the context of a word. the neural network is trained on a corpus of text and word is given an input. it is the asked to predict words likely to come after the word. the vector representation of a word is the created by taking the output of the neural network.