While we human beings think in words or text, computers think in numbers, or vectors of numbers. Machine-understandable text was formulated early as ASCII. This approach lacked the meaning of words. In search engines, there was key word search. Here in the documents, some specific words were searched — N-grams.
Later embeddings emerged. Embeddings can represent words, sentences, or even images. Embeddings are the vectors of numbers. They do capture meaning. Thus, they can be used in semantic search. They can work with documents in different languages.
Text representation is done through embeddings. The most fundamental approach is to convert texts into vectors is through a bag of words. The texts is converted into words or tokens, and later into their base forms (say running is converted to run). There is a list of base forms for all the words. Later we calculate their frequencies to create a vector.
The vector conversion takes into account not only the words that appear in text, but the whole vocabulary. There are words such as ‘I’, ‘You’ and ‘Study’. ‘The girl is studying physics‘ and ‘the woman is studying AI’ are not close to each other. Bag of words is improved by TF-IDF (Term Frequency-Inverse Document Frequency). It is a multiplication of two metrics.
IF-IDF (t,d,D) =TF (t,d) x IDF (t,D)
Term Frequency shows the frequency of the word in the documents.
TF(t,d) = number of time t appears in document d /number of terms in document d
Inverse Document Frequency denotes how much information the word provides — articles and certain pronouns do not give any additional information about the topic. Words such as AI or LLM define the tenor. It is calculated as the logarithm of the ratio of total number of documents to those containing the word.
IDF(t,D)= log total number of documents in corpus D / number of documents containing term t
The closer the IDF is to 0 — the more common the word is and less information it provides.
Ther are vectors where there are common words with less weights. The rare words in documents carry higher weights. This improves the results. Still, this cannot capture the semantic meaning.
This approach produces sparse vectors. The length of the vector is equal its corpus size. There are 470k unique words in English. They will show huge vectors. A sentence will not have more than 50 unique words. Therefore 99.99 per cent values in vectors will be 0 ( It does not encode any info).
To overcome these limitations, researchers started looking at dense vector representation.
Word2Vec
Google’s paper (2013) Efficient Estimation of Word Representation in Vector Space by Mikolov et al is one of the most well-known approaches to dense representation.
In this paper, there are two different approaches — Continuous Bag of Words where word is predicted based on surrounding words and Skip-gram where the word is predicted on the basis of the context on the word.
Mikolov model trains two models — encoder and decoder. When in skip-gram model, we pass on Diwali to the encoder, the encoder will produce a vector ‘happy’ ‘to’ and ‘you’.
Input — Encoder– Embedding — Decoder– Output
It has considered the meaning since it is trained on the context of words. However, it ignores morphology, information from word parts such as less or lack of something. This limitation was addressed in GloVe.
Though Word2Vec is suited to work with words, we would like to encode whole sentences. We shall examine now the transformers.
Transformers and Sentence Embeddings
‘Attention Is All You Need‘ (2017) paper by Vaswani’ et al led to transformers. There emerged information-rich dense vectors. They became the principal technology for modern Large Language models (LLMs).
Transformer’s core model is used and it is fine tuned for specific purposes. Transformers are pre-trained.
BERT or Bi-directional Encoder Representation from Transformers from Google AI is one such early model. To begin with, it operated on token level. Just like Word2Vec. It took an average of all tokens. It was not efficient performer.
In 2019, Sentence-BERT was released. It was good at semantic textual similarity and enabled calculation of sentence embeddings.
Open AI’s models are embedding-3-small and embedding-3-large. They are the best performing embedding models.
Distance Between Vectors
Embeddings being vectors, if we want to understand how close two sentences are to each other, we can calculate the distance between vectors. A smaller distance indicates closer semantic meaning.
The metrics used to measure distance are Euclidean (L2), Manhattan (L1), Dot product and Cosine distance. For NLP tasks, the best practice is to use cosine similarity.
As OpenAI embeddings are already normed, dot product and cosine similarity are here equal.
Cosine is less affected by the curse of dimensionality. The higher the dimension, the narrower the distribution the distances between vectors.
The most basic dimensionality reduction technique is PCA or Principal Component Analysis. Since PCA is a linear algorithm, t-SNE is used to separate clusters on account of non-linearity.
Vectors are used for clustering, classification, finding anomalies and RAG.