When vectors are compared for similarity, one hears the term high-dimensional space. In maths and computer science, we deal with a large number of dimensions where each dimension refers to a different variable or feature. In 3-D space, the three dimensions are length, width and height. In high-dimensional space, there are many more dimensions. Imagine a dataset where each data point represents a person, and there are features of this person such as age, height, weight, income and education. Each of these features constitute a dimension in the dataset. Thus, five features in the dataset exist in a 5-D space. In high-dimensional space, we refer to datasets with a large number of features or dimensions — may be dozens or hundreds or even thousands of dimensions.
It is difficult to visualize such spaces since we deal with only 3 dimensions in the physical world. In high-dimensional space, certain phenomena occur which do not occur in lower dimensions, e.g. the curse of dimensionality (here the distances between points become less meaningful as the number of dimensions increase).
While comparing vectors in high-dimensional spaces, techniques like cosine similarity or Euclidean distance are commonly used (to measure how similar or dissimilar they are). These measures help us in tasks such as clustering, classification and information retrieval.
Cosine similarity measures the cosine of the angle between two vectors in the space which ranges from –1 to 1. (1 means the vectors are pointing in the same direction — perfect similarity. 0 means they are orthogonal — no similarity and -1 means they point to the opposite directions — complete dissimilarity.
Sine similarity measures the sine of the angle between two vectors, and ranges from -1 to 1. The interpretations are similar to cosine similarity in terms of direction but are calculated differently.
Cosine similarity measures the alignment, sine similarity measures the misalignment. Sine similarity is less commonly used compared to cosine similarity. It is, however, useful in certain contexts such as in image processing or signal analysis.
A large language model such as GPT uses vector comparison both during training and during text generation. In training, the model learns to associate words, phrases and sentences with corresponding vectors in a high-dimensional space. These vectors capture semantic and syntactic information about the language and are used to make predictions about the next word in a sequence, given some input context. While generating text, the model uses these learned representations to determine the likelihood of different words or sequences of words. It can compare the vectors representing different words or sequences to determine which are more similar or relevant in the current context. Thus, the model produces coherent and contextually appropriate responses.
Vector comparison is an integral part of both the training and generation processes for LLMs. These techniques capture semantic relationships (between words and texts in a continuous vector space).
The idea that vector comparison could facilitate NLP tasks evolved over a period of time. It originated from distributional hypothesis — words that appear in similar contexts tend to have similar meanings. It goes back to the work of linguists and cognitive scientists such as Zelling Harris (1950s). It was later formalized in the field of computational linguistics.
The milestone event in this was Word2Vec model proposed by Tomas Mikolov and his colleagues (at Google 2013).
Since then, more sophisticated methods for representing and comparing vectors have evolved. All this led to the emergence of large language models (LLMs) such as GPT and BERT. These models leverage vector comparison as a fundamental component of this architecture.
Can NLP proceed on vector-based linear line, or will it require deviations and non-linear breakthroughs? To achieve AGI, we may require new paradigms, new architectures and innovative techniques. There should be advances in symbolic reasoning, commonsense understanding and content-aware processing. All this cannot rely solely on vector-based representations.
In future, we may require a hybrid approach that combines the strength of vector-based methods with other AI techniques — symbolic reasoning, probabilistic modelling, neurosymbolic approaches.
The future sustains vector-based advances and searches breakthroughs of new paradigms and approaches.