As we know, generative AI models have been trained on massive amount of data. Since computers do not understand text, the models do not take text as it is, but take it in a numerical format. These are called embeddings or representation of data in a numerical format. All inputs to LLMs (large language models) and outputs from LLMs is through embedding. If we have to access these embeddings, it is time consuming. Therefore, these embeddings are stored in Vector Databases, which store them and from which these can be retrieved.
Thus we know, that embeddings or vector embeddings represent data — text, images, audio, video and so on. The data is in the numerical format in any n-dimensional space. It is called a numerical vector. Word2Vec developed by Google is a model that converts words to vectors. All LLMs have their respective embedding models to create embeddings.
This way the vectors can be compared to each other. A computer cannot compare two words, but can compare two vectors, we can create a cluster of words with similar embeddings, e.g. ball, bat, wickets, pitch will appear in a cluster as they are related to cricket.
The embeddings facilitate finding words similar to a given word. These can be made into sentences. A sentence can be used as an input to obtain related sentences from the data stored. It is the basis of semantic search, sentence similarity, anomaly detection and chatbot.
The chat bots perform question answering from a given PDF, Doc. by making use of the concept of embeddings.
All LLMs use this approach to get similarly related content to the queries provided to them.
A chat bot based on PDF is asked a query. As we know the data is represented in vector embeddings. Similarities are detected between different parts of data. Data is extracted which is similar to a particular embedding. Vector Store performs the similarity search through search algorithms. It fetches all relevant data. These are passed to chat bot which generates a final answer for the user.
Chat bots create vector embeddings by using ML algorithms which are trained on massive amount of data to learn how to represent words or phrases as vectors of numbers. The most popular algorithm is Google’s Word2Vec invented in 2013. Word2Vec takes a word and spits out an n-dimensional coordinate (or vector) so that when these word vectors are plotted in space, synonyms cluster.