Tokenization and Vectorization

Tokenization is breaking the input text into individual units called tokens — words, sub-words, characters, punctuation marks. You can choose a suitable tokenization method. Each token is assigned a unique numerical ID.

Then the process of vectorization starts. A matrix embedding is created — each row corresponds to a unique token ID and contains a vector of numerical values, typically floating-point numbers.

When we come across a token in the text, its ID is used to retrieve the corresponding vector from the embedding matrix. This vector is readable in a machine-readable format.

Vectors in the embedding matrix are initialized with random values. The model is trained on a corpus of data. It learns meaningful representation of each token. During training, these vectors are adjusted to capture semantic relationships, similarities and patterns.

The aim is to have similar vector representation for tokens with similar meanings. It enables models to generalize and make predictions for new text inputs based on learned patterns.

The number of values in each vector shows its dimensionality. It is a hyperparameter. It can be adjusted depending on the task and model architecture.

There are pre-trained embeddings which can be used instead of training these embeddings from scratch. It is time and cost-effective to do so.

There is positional encoding in sequential models such as RNNs and Transformers. It is used to incorporate information about the position of a token within a sequence. It makes the model learn the word order, and sentence structure.

Tokenization and Vectorization

Comments

Leave a Reply Cancel reply

More posts

Fragile Language Models

AI Infrastructure in the UK

Quantum Theory

Bots Which Rot