Blog

Happy Makar Sankranti: Vector Databases

Chatbots in previous years were fluent but they were forgetful. Since then, we are using vector embeddings. Embeddings, as we know, are words represented as vectors. Vectors are a sequence of numbers which encode information.

In math, we also have the notion of proximity or closeness. We can use geometry to encode any property.

In natural language processing, the idea is to encode semantic similarity through the distance of embeddings in the representation space.

Vector databases store embeddings of words and phrases. These enable LLMs to fetch quickly contextually relevant information. When LLMs come across a term, they can retrieve similar embeddings from the database, maintaining context and coherence.

Vector databases can scale to accommodate vast amounts of embeddings. Scalability is vital for chatbots, content generation and question-answering.

It is necessary to run a safety check. There should be ethical and cultural nuances. There is industry-specific jargon. There should be ambiguity resolution.

14th January 2024
Types of Vector Embeddings

Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships.

Vectors represent different types as points in a multidimensional space. In this space, similar data points are clustered closer together. It is a search for similarity.

This process enables the machine to understand and process data more effectively. Vector embeddings help ML algorithms find patterns in data and perform tasks such as language translation, sentiment analysis, recommendation system and so on.

Word embeddings represent individual words as vectors, e.g. Word2vec, GloVe and Fast Text. These capture semantic relationships and contextual information from large text.

Document embeddings represent documents as vectors, e.g. newspaper articles, research papers, books. They capture semantic information and context of the whole document. There are techniques such as Doc 2 Vec and Paragraph Vectors that are designed to do document embeddings.

Sentence embeddings represent entire sentences as vectors. There are models like USE or Universal Sentence Encoder and Skip Thought to do this. These capture the meaning and context of sentences.

Image embeddings represent images as vectors. CNNs and VGG generate image embeddings.

User embeddings represent users in a system or platform as vectors. They capture preferences behaviours, and characteristics. They are used in recommendation systems and personalized marketing.

Product embeddings represent products as vectors. They capture product’s attributes, features and other semantic information of a product.

13th January 2024
Copyright Violation Suits

Sara Silverman and George R.R. Martin took legal action against tech firms for alleged misuse of their work in training AI programmes. There are other authors such as Basbenes and Gage who accuse OpenAI and Microsoft of unauthorized use of their work especially for training the algorithms. It is surprising that the tech firms create billion-dollar businesses without compensating the creators. There is a lawsuit against Google filed by Clarkson, a law firm alleging the company scraped data from millions of users. This was done to train the LLM models. Google has publicly acknowledged that there is use of publicly available information to train AI models (such as Bard). Jill Levoy, a best-selling author of NYT advocates for millions affected by alleged copyright infringements.

12th January 2024
LLMs and Copyrighted Data

OpenAI justifies its data collection for training LLMs by holding that those models are not feasible in the absence of copyrighted material to train them. This submission was made to the UK’s House of Lords, Communications and Digital Select Committee.

Copyright covers virtually every sort of human expression, say blogs, posts, photos, software code and government documents. It covers both fiction and non-fiction.

Generative AI tools such as ChatGPT and image generation tools such as Stable Diffusion are trained on vast amounts of data. Most of the data is covered under Copyright Act. This agitates the original publishers of such data. The creators of the data are also agitated. Their contention is that their work is being used without consent and/or compensation.

The NYT legal suit against OpenAI and Microsoft is before the judiciary. OpenAI argues that copyright does not deprive them of their right to train the models. Some authors have also approached the courts.

The generative AI companies act as if copyright laws do not exist. The line that divides just adaptation and creation of something new is blurry. AI allows us to see how poorly the adaptations and new content are defined.

Data mining should be treated as a safe harbour. The use of data to develop AI technology is fundamentally not an act of infringement. Just material lifting is not enough for infringement. There should be unauthorized use of the work for its expressive purpose. Use of material for technical purposes, for non-communicative purposes are not the uses for expressive purposes. Therefore, they are not copyright violatrous.

Models do not redistribute the same data or recommunicate it. Models use data to recognize patterns and associations. Machines are trained to learn, reason and act as humans do. The models generate new text, learning from the data ingested.

11th January 2024
AI Winter

2023 has been an eventful year for AI. ChatGPT took the world by storm after its advent in late November 2022. What followed was a spate of similar large language models from other organizations, some of which are big names of Silicon Valley. AI was on the lips of everyone in 2023.

Can AI stagnate in 2024? Rodney Brooks, former director, Computer Science and AI Lab, MIT who predicts on technology ever since 2018 believes that 2024 is not the best year for AI. He alerts us that AI winter is likely to set in and advises us to get our thick coats out as it is going to be cold.

He is cynical despite ChatGPT, Bing, Bard. He believes that these do not have the capability to become a potent artificial general intelligence (AGI) system. These lack imagination and genuine substance. He proclaims that there is much more to life than LLMs. These models still hallucinate. They make mistakes while doing coding. Though they answer with confidence, half the time their answers are wrong. Intelligence and interaction are two different things. Being clever wordsmiths is not enough. They still have a long way to go. They are still good at only correlation of language.

10th January 2024
Make AI Competitive

Most organizations would love to integrate AI into their systems to acquire competitive advantage. AI has revolutionized output generation and performs tasks in a blink of an eye. As these models have been trained on mammoth datasets events including ever expanding internet (some 157 trillion gigabytes), these models are excellent, and are able to distinguish signal from noise while navigating the information.

APIs democratise access to generative AI. They serve as user-friendly tool-kits, and enable development of context-specific and organisation-specific applications on the foundation model. These APIs allow access to even small organisations.

Generative AI becomes standard for data analysis and reporting. The results are homogenous. There is no distinctiveness. The landscape tends to be uniform. It is necessary to mobilize context-specific datasets. Proprietary datasets can provide a strategic edge. Pharma companies can use anonymized patient data from various hospitals. It will enhance drug development process.

One can elicit creative results from the model. We have to craft skillful prompts and insightful queries. Prompt engineering skills must be acquired.

There should be adequate compute power. AI needs advanced Nvidia processors. More computer power means faster iteration cycles, larger training models with more parameters.

There should be AI research and development. We are still away from AI that equals human capabilities. Over time, this gap will narrow.

The excitement for AI today resembles the exacitement of the early days of IT option. Later, IT becomes standardised. The issue then was, ‘Does IT matter?’ IT had become commoditised. It could no longer give a competitive advantage. Later the issue was reframed — IT does matter. Here adoption of AI is not enough. To make AI competitive, we will have to consider the above-discussed issues.

In a fast-changing world of technology, the technology advances faster than the regulatory response. In fact, the first driving licenses were issued two decades later than the cars navigating the American streets became a common sight. There is a difference between the pace of innovation and the regulatory response. The gap is ever widening.

Some unscrupulous businessmen develop the whole business models around ‘regulatory arbitrage’. Profitable areas of business are chosen. Nobody had turned any attention to these areas. The businesses become a fait accompli. Regulators find it hard to wish them away.

In case of generative AI, there is an unprecedented challenge. No one knows why they generate a particular answer. AI as an innovation can dwarf any other innovation ever since the transistor was invented. Politicians are already concerned about AI-generated fake news. It has degraded even the trust in real things.

9th January 2024
The New York Times — OpenAI Case

As we have already observed that the New York Times has filed a copyright violation suit against OpenAI and Microsoft, here we shall discuss certain additional points with respect to this suit.

It is felt that the models indulge in memorization of the copyright material. That brings another term to the center stage — approximate retrieval. The model does not repeat exactly the same information that has gone into it. The crux of approximate retrieval lies in the fact that LLMs do not fit into the mould of traditional databases. Here precision and exact matches are paramount. LLMs operate as n-gram models and inject an element of unpredictability into the retrieval process.

Prompts are not keys to a structured database. Just they serve as cues for the model to generate the next token based on the context. LLMs do not promise exact retrieval.

The lawsuit revolves around the issue of memorization. There is no verbatim reproduction through LLMs. However, the extensive context window and network capacity leaves space for potential memorization. That becomes unintended plagiarism. If prompted again and again, LLM could generate exact sentences. Besides there is fine tuning. The task of LLM becomes memory-based retrieval. There is no autonomous planning on the part of LLM. The expanded context window makes memorization even worse.

In legal discussions, the focus should be LLMs inability to achieve exact retrieval. It is a defense against copyright violation.

LLMs may behave both as memorization devices and on its own generation devices. In news generation, if an LLM is too creative, it can generate fake news or inaccurate news. If it offers exact news, it violates copyright. It is a dilemma. Another concept has emerged — retrieval-augmented-generation. LLMs have a structured approach to information retrieval. It strikes a balance between spontaneity of LLMs and the disciplined traditional search engine methods (to reduce hallucinations).

The material from the NYT is converted into vector databases. That facilitates RAG or retrieval-augmented-generation.

The way the n-grams models work, the possibility of exact NYT article being reproduced unaltered is not there. The case sustains if there is actual lifting of the NYT articles, and such a thing creates a dent at the revenue of the NYT.

8th January 2024
AI to Detect Cancer

Tata Memorial Hospital, a leading cancer hospital in Mumbai and Indian Institute of Technology, Bombay (IIT-B) in collaboration are developing radiomics — a technique of extracting essential information from medical scans (not easily discernible by the human eye).

Advanced algorithms and deep learning AI can facilitate the diagnosis of cancer early on by analyzing medical data. The detection tool will help avoid unnecessary chemotherapy for predicted non-responders.

To train the model, the data is being collected from AIIMS, Delhi, Rajiv Gandhi Cancer Research Centre, Delhi and PG Institute of Medical Education and Research, Chandigarh. These institutes provide medical scans of cancer cases. A bioimaging bank is being built. The bank facilitates deep learning efforts.

Mostly the data is in the form of slides from medical tests to help diagnose the disease and develop treatments. Human eye cannot always detect tumors. There are identification factors, e.g. texture analysis, elastography to check stiffness of the organ, and tumor hardness. The biobank will indicate prognosis directly from images using specialized algorithms. These algorithms are called prognostication or prediction algorithms. These anticipate a tumor’s aggressiveness, speed of system’s immune response and the chances of patient’s survival (from CT or MRI scans). The final diagnosis and treatment will be made by an experienced doctor-oncologist.

AI models process the information obtained from radiological and pathological images. They employ ML to recognize unique features associated with different types of cancer. They assess changes in tissues and potential malignancies. It results into early detection of cancer.

The images are segmented and annotated. Segmentation outlines tumors and identifies areas with different features. They are then annotated either as malignant, inflammatory or edematous (or swollen with fluid). Biopsy results, histopathology and immunochemistry data are correlated. Diverse algorithms are developed by aligning genomic sequences with images and climical data.

IIT, Bombay contributes the graphics processing units for algorithm testing.

Tata Memorial avoided exposure to radiation on the basis of image analysis. The chest related conditions of ICU patients on the basis of images are diagnosed with 98 per cent accuracy. Chest X-rays are subjected to AI algorithms to identify pathologies such as nodules and pneumothorax.

AI tool will enable oncologists to diagnose cancer with a simple click. Even GPs will be able to diagnose complex cancers.

AI keep learning continuously. This improves accuracy.

7th January 2024
LeCun and Google

We have mentioned the name of Facebook’s chief AI Scientist Yann Lecun in previous blogs. He won the Turing Prize for his research in 2018. He is a deep learning pioneer. He received a job offer from Google –Director of Research. He turned down this offer on account of several reasons, of which one was the compensation package. Of course, the stock option package was attractive. He was concerned about the size of the company — it had 600 employees and no revenue at that time. It meant the job would go beyond research, say corporate strategy and operational management. LeCun wanted to refocus on ML. Had he taken the Google assignment, he would have made the organization a bit more open and a bit more ambitions a bit earlier. Google was considered a bit slow and cautions by critics with its AI development. Google responded to OpenAI’s ChatGPT by releasing competing products.

6th January 2024
AI: A Fast Change Agent

AI has made us optimistic. Who could have imagined the tremendous transformative power of AI? Experts now are hopeful that AI will surpass humans in most fields, either five or ten years hence, or as some say, right now. The advent of artificial general intelligence (AGI) is a matter of time.

AI could be considered one of the most important advances in human history. It stares us into the face. We must realize that even major innovations happen to come apparently out of nowhere, though there could be a long period stagnation prior to this happening. We realized this when even Covid Vaccine using mRNA technology was developed within a year. These are sudden leaps. Though there is slow accretion of expertise and execution, but such sudden leaps burst on the scene. This is an important takeaway of 2023.

AI’s rise means change will happen even faster.

5th January 2024