Consider three words — cat, dog and bird. Each word can be represented by a numerical vector in a high-dimensional space. The vector captures three dimensions — x, y and z.
Cat could be represented by [0.8, 0.2, 0.5]
Dog could be represented by [0.7, 0.3, 0.6]
Bird could be represented by [0.3, 0.9, 0.2]
x, y and z could represent size, animal type and habitat (a different aspect of the word’s meaning or usage).
Algorithms analyze large amounts of text data and construct these word embeddings. These encode semantic and syntactic information about words.
These representations are standardized to a certain extent. However, there is no single standard. Word embeddings (Word2Vec, GloVe and FastText) are popular approaches for generating vector representations of words. These vectors are of fixed length. It facilitates standardization across words and models. What varies are the specific dimensions and values within these vectors, depending on the algorithm used, and the training data used.
Without using these algorithms too, words can be converted into vectors. The approach is called one-hot encoding. Here, each word in the vocabulary is represented as a vector where all elements are 0 except for the element corresponding to the index of that word in the vocabulary, which is 1. Let us consider a small vocabulary with three words : eat, dog and bird.
Cat could be represented as [1, 0, 0]
Dog could be represented as [ 0, 1, 0]
Bird could be represented as [ 0, 0, 1]
The vectors created are sparse vectors, where most elements are zero. However, one-hot encodings do not capture the semantic relationships between words (like embeddings do). They can result into very high-dimensional representations for large vocabularies.
Index here refers to its position in pre-defined vocabulary. Each word has a unique index. The element corresponding to the word’s index is set to 1. All other elements are set to 0.
In the three-word vocabulary of cat, dog and bird, let us consider the indices assigned.
Cat _> index 0
Dog _> index 1
Bird _> index 2
Cat has three elements in the vector. The element at index 0 (corresponding to cat ) would be set to 1.
The other elements would be set to 0.
Dog and Bird — 1 would move to their respective index. It is thus a binary vector with a single 1. (including the position of the world in the vocabulary).
We now know what a sparse vector is. Let us consider now a dense vector where most of its elements are non-zero. Dense vectors are typically used (rather than sparse vectors).
Dense vectors are often used in word embeddings. Each word is represented by a vector of real numbers (floats). It is in a continuous vector space. The real numbers capture nuanced relationship between words.
Each dimension of the vector might represent a different aspect of the word’s meaning or context. These are generally lower-dimensional (compared to one-hot encodings). They are computationally more efficient and are able to capture subtle semantic relationships between words.
Though the data stored in the hardware is in vectors (word embeddings), the answers to our prompts are in the natural text.
It involves a process of decoding these representations back into the natural language.
The input prompts is processed. It is converted into corresponding embeddings. These embedding go into the model. They are processed into layers (RNNS, transformers or other architectures). The model learns to generate text based on input embedding and context provided.
The output is a sequence of tokens or words embeddings. They are decoded back into text. It involves selecting the most probable word for each position in the sequence (based on the model’s learned probabilities in training). It can also use techniques such as beam search or sampling.
In post-processing, coherence and readability of the output is checked. Duplicate phrases are removed. Grammatical mistakes are corrected. The style is adjusted to match the input prompt or context.