Gecko: Text Embedding Model

Google has revealed text embedding model, Gecko which is trained on LLM generated synthetic dataset FRet. What are text embedding models? They represent natural language as dense vectors. These place semantically similar text next to each other (within the embedding space).

In other words, text embedding models act as translators for computers — they convert text into numbers which a computer understands.

As we know now, embeddings are numerical representations. These capture semantic information (about words and sentences) in the text. It enables computers to process natural language. Such processing leads a wide range of tasks (document retrieval, sentence similarity, classification, clustering). Without building a separate model for each of these tasks, a single model is being pushed for a variety of tasks.

Being a general-purpose model, it requires huge data for training. It is here that LLMs come handy. This is what Google has done — leverage the LLMs for training Gecko. Gecko is a two-step LLM powered embedding model. Synthetic data is generated using an LLM. It is refined by retrieving a set of candidate passages for each query. There is then relabeling — positive and negative passages using the same LLM. LLM re-ranks the passages based on LLM scores. Gecko utilized this approach to achieve strong retrieval performance. It becomes a zero-shot embedding model on the Massive Text Embedding Benchmark (MTEB).

LLM generated and LLM-ranked data is combined with human-annotated data in Gecko. It achieved best performance on the MTEB benchmark (Average score 66.31).

print

Leave a Reply

Your email address will not be published. Required fields are marked *