AI vs. Regulators

Silicon Valley is not known for its co-operation with regulators. They hold veiled contempt for those who require explanation of concepts. They firmly believe any regulation of a technology will result in failure. They are unaware that China effectively controls emerging technologies.

Ultimately, there could be over-regulation by those who do not have any stake in the technology. Judiciary force-fits the technology into regulatory structures.

European Union has adopted a wide-spread approach to regulate AI. French president Macron is its harshest critic. Instead of promoting the technology, the approach is to protect the consumers of this technology. EU may lag behind in AI and could remain a laggard for a long time.

NYT has sued OpenAI and Microsoft in the US for copyright violation, since the training data was used either without permission, payment or acknowledgement.

AI companies will have to scale back its profit sources. The companies will have to enter into agreements with publishers. This indicates that their case is lacks strong legal grounds. Judges may write restrictions on AI use and training.

Innovation in AI is happening in the USA. US itself may have a comprehensive regulatory approach. This is a classic but defining clash.

Embeddings and Vectors

Vector embeddings refer to numerical representations of data. Each data point is represented by a vector in high-dimensional space. Here embeddings and vectors are the same thing.

Vector is an array of numbers with a specific dimensionality. Embeddings refers to the technique of representing data as vectors. These capture the underlying structure or properties of data.

Vector embeddings are created through an ML process. A model is trained to convert any pieces of data into numerical vectors.

A dataset is selected. It is preprocessed. A neural network model is selected that meets our data goals. Data is fed into the model. The model learns patterns and relationships within the data. (there is adjustment of internal parameters). To illustrate, it learns words that often appear together. The model after learning generates numerical vectors. Each data point (say a word or an image) is represented by a unique vector. At this point, the model’s effectiveness can be assessed by its performance on specific tasks or asking humans to evaluate it. If the embeddings are functioning well, it can be put to work.

Word embeddings can have dimensions ranging from a few hundred to a few thousand. Humans cannot visualize such a diagram. Sentence and documents embeddings may have more dimensions.

Vector embeddings are represented as sequence of numbers. Each number in the sequence corresponds to a specific feature or dimension and contributes to the overall representation of the data point.

The actual numbers within the vector are not meaningful on their own. The values and relationships are relative.

Applications of Vector Embeddings

They are used in NLP. They are used in search engines. They are used in personalized recommendation systems. They are used for visual content. They are used for anomaly detection. They are used in graph analysis. They are used in audio and music.

Happy Makar Sankranti: Vector Databases

Chatbots in previous years were fluent but they were forgetful. Since then, we are using vector embeddings. Embeddings, as we know, are words represented as vectors. Vectors are a sequence of numbers which encode information.

In math, we also have the notion of proximity or closeness. We can use geometry to encode any property.

In natural language processing, the idea is to encode semantic similarity through the distance of embeddings in the representation space.

Vector databases store embeddings of words and phrases. These enable LLMs to fetch quickly contextually relevant information. When LLMs come across a term, they can retrieve similar embeddings from the database, maintaining context and coherence.

Vector databases can scale to accommodate vast amounts of embeddings. Scalability is vital for chatbots, content generation and question-answering.

It is necessary to run a safety check. There should be ethical and cultural nuances. There is industry-specific jargon. There should be ambiguity resolution.

Types of Vector Embeddings

Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships.

Vectors represent different types as points in a multidimensional space. In this space, similar data points are clustered closer together. It is a search for similarity.

This process enables the machine to understand and process data more effectively. Vector embeddings help ML algorithms find patterns in data and perform tasks such as language translation, sentiment analysis, recommendation system and so on.

Word embeddings represent individual words as vectors, e.g. Word2vec, GloVe and Fast Text. These capture semantic relationships and contextual information from large text.

Document embeddings represent documents as vectors, e.g. newspaper articles, research papers, books. They capture semantic information and context of the whole document. There are techniques such as Doc 2 Vec and Paragraph Vectors that are designed to do document embeddings.

Sentence embeddings represent entire sentences as vectors. There are models like USE or Universal Sentence Encoder and Skip Thought to do this. These capture the meaning and context of sentences.

Image embeddings represent images as vectors. CNNs and VGG generate image embeddings.

User embeddings represent users in a system or platform as vectors. They capture preferences behaviours, and characteristics. They are used in recommendation systems and personalized marketing.

Product embeddings represent products as vectors. They capture product’s attributes, features and other semantic information of a product.

Copyright Violation Suits

Sara Silverman and George R.R. Martin took legal action against tech firms for alleged misuse of their work in training AI programmes. There are other authors such as Basbenes and Gage who accuse OpenAI and Microsoft of unauthorized use of their work especially for training the algorithms. It is surprising that the tech firms create billion-dollar businesses without compensating the creators. There is a lawsuit against Google filed by Clarkson, a law firm alleging the company scraped data from millions of users. This was done to train the LLM models. Google has publicly acknowledged that there is use of publicly available information to train AI models (such as Bard). Jill Levoy, a best-selling author of NYT advocates for millions affected by alleged copyright infringements.

LLMs and Copyrighted Data

OpenAI justifies its data collection for training LLMs by holding that those models are not feasible in the absence of copyrighted material to train them. This submission was made to the UK’s House of Lords, Communications and Digital Select Committee.

Copyright covers virtually every sort of human expression, say blogs, posts, photos, software code and government documents. It covers both fiction and non-fiction.

Generative AI tools such as ChatGPT and image generation tools such as Stable Diffusion are trained on vast amounts of data. Most of the data is covered under Copyright Act. This agitates the original publishers of such data. The creators of the data are also agitated. Their contention is that their work is being used without consent and/or compensation.

The NYT legal suit against OpenAI and Microsoft is before the judiciary. OpenAI argues that copyright does not deprive them of their right to train the models. Some authors have also approached the courts.

The generative AI companies act as if copyright laws do not exist. The line that divides just adaptation and creation of something new is blurry. AI allows us to see how poorly the adaptations and new content are defined.

Data mining should be treated as a safe harbour. The use of data to develop AI technology is fundamentally not an act of infringement. Just material lifting is not enough for infringement. There should be unauthorized use of the work for its expressive purpose. Use of material for technical purposes, for non-communicative purposes are not the uses for expressive purposes. Therefore, they are not copyright violatrous.

Models do not redistribute the same data or recommunicate it. Models use data to recognize patterns and associations. Machines are trained to learn, reason and act as humans do. The models generate new text, learning from the data ingested.

AI Winter

2023 has been an eventful year for AI. ChatGPT took the world by storm after its advent in late November 2022. What followed was a spate of similar large language models from other organizations, some of which are big names of Silicon Valley. AI was on the lips of everyone in 2023.

Can AI stagnate in 2024? Rodney Brooks, former director, Computer Science and AI Lab, MIT who predicts on technology ever since 2018 believes that 2024 is not the best year for AI. He alerts us that AI winter is likely to set in and advises us to get our thick coats out as it is going to be cold.

He is cynical despite ChatGPT, Bing, Bard. He believes that these do not have the capability to become a potent artificial general intelligence (AGI) system. These lack imagination and genuine substance. He proclaims that there is much more to life than LLMs. These models still hallucinate. They make mistakes while doing coding. Though they answer with confidence, half the time their answers are wrong. Intelligence and interaction are two different things. Being clever wordsmiths is not enough. They still have a long way to go. They are still good at only correlation of language.

Make AI Competitive

Most organizations would love to integrate AI into their systems to acquire competitive advantage. AI has revolutionized output generation and performs tasks in a blink of an eye. As these models have been trained on mammoth datasets events including ever expanding internet (some 157 trillion gigabytes), these models are excellent, and are able to distinguish signal from noise while navigating the information.

APIs democratise access to generative AI. They serve as user-friendly tool-kits, and enable development of context-specific and organisation-specific applications on the foundation model. These APIs allow access to even small organisations.

Generative AI becomes standard for data analysis and reporting. The results are homogenous. There is no distinctiveness. The landscape tends to be uniform. It is necessary to mobilize context-specific datasets. Proprietary datasets can provide a strategic edge. Pharma companies can use anonymized patient data from various hospitals. It will enhance drug development process.

One can elicit creative results from the model. We have to craft skillful prompts and insightful queries. Prompt engineering skills must be acquired.

There should be adequate compute power. AI needs advanced Nvidia processors. More computer power means faster iteration cycles, larger training models with more parameters.

There should be AI research and development. We are still away from AI that equals human capabilities. Over time, this gap will narrow.

The excitement for AI today resembles the exacitement of the early days of IT option. Later, IT becomes standardised. The issue then was, ‘Does IT matter?’ IT had become commoditised. It could no longer give a competitive advantage. Later the issue was reframed — IT does matter. Here adoption of AI is not enough. To make AI competitive, we will have to consider the above-discussed issues.

In a fast-changing world of technology, the technology advances faster than the regulatory response. In fact, the first driving licenses were issued two decades later than the cars navigating the American streets became a common sight. There is a difference between the pace of innovation and the regulatory response. The gap is ever widening.

Some unscrupulous businessmen develop the whole business models around ‘regulatory arbitrage’. Profitable areas of business are chosen. Nobody had turned any attention to these areas. The businesses become a fait accompli. Regulators find it hard to wish them away.

In case of generative AI, there is an unprecedented challenge. No one knows why they generate a particular answer. AI as an innovation can dwarf any other innovation ever since the transistor was invented. Politicians are already concerned about AI-generated fake news. It has degraded even the trust in real things.

The New York Times — OpenAI Case

As we have already observed that the New York Times has filed a copyright violation suit against OpenAI and Microsoft, here we shall discuss certain additional points with respect to this suit.

It is felt that the models indulge in memorization of the copyright material. That brings another term to the center stage — approximate retrieval. The model does not repeat exactly the same information that has gone into it. The crux of approximate retrieval lies in the fact that LLMs do not fit into the mould of traditional databases. Here precision and exact matches are paramount. LLMs operate as n-gram models and inject an element of unpredictability into the retrieval process.

Prompts are not keys to a structured database. Just they serve as cues for the model to generate the next token based on the context. LLMs do not promise exact retrieval.

The lawsuit revolves around the issue of memorization. There is no verbatim reproduction through LLMs. However, the extensive context window and network capacity leaves space for potential memorization. That becomes unintended plagiarism. If prompted again and again, LLM could generate exact sentences. Besides there is fine tuning. The task of LLM becomes memory-based retrieval. There is no autonomous planning on the part of LLM. The expanded context window makes memorization even worse.

In legal discussions, the focus should be LLMs inability to achieve exact retrieval. It is a defense against copyright violation.

LLMs may behave both as memorization devices and on its own generation devices. In news generation, if an LLM is too creative, it can generate fake news or inaccurate news. If it offers exact news, it violates copyright. It is a dilemma. Another concept has emerged — retrieval-augmented-generation. LLMs have a structured approach to information retrieval. It strikes a balance between spontaneity of LLMs and the disciplined traditional search engine methods (to reduce hallucinations).

The material from the NYT is converted into vector databases. That facilitates RAG or retrieval-augmented-generation.

The way the n-grams models work, the possibility of exact NYT article being reproduced unaltered is not there. The case sustains if there is actual lifting of the NYT articles, and such a thing creates a dent at the revenue of the NYT.

AI to Detect Cancer

Tata Memorial Hospital, a leading cancer hospital in Mumbai and Indian Institute of Technology, Bombay (IIT-B) in collaboration are developing radiomics — a technique of extracting essential information from medical scans (not easily discernible by the human eye).

Advanced algorithms and deep learning AI can facilitate the diagnosis of cancer early on by analyzing medical data. The detection tool will help avoid unnecessary chemotherapy for predicted non-responders.

To train the model, the data is being collected from AIIMS, Delhi, Rajiv Gandhi Cancer Research Centre, Delhi and PG Institute of Medical Education and Research, Chandigarh. These institutes provide medical scans of cancer cases. A bioimaging bank is being built. The bank facilitates deep learning efforts.

Mostly the data is in the form of slides from medical tests to help diagnose the disease and develop treatments. Human eye cannot always detect tumors. There are identification factors, e.g. texture analysis, elastography to check stiffness of the organ, and tumor hardness. The biobank will indicate prognosis directly from images using specialized algorithms. These algorithms are called prognostication or prediction algorithms. These anticipate a tumor’s aggressiveness, speed of system’s immune response and the chances of patient’s survival (from CT or MRI scans). The final diagnosis and treatment will be made by an experienced doctor-oncologist.

AI models process the information obtained from radiological and pathological images. They employ ML to recognize unique features associated with different types of cancer. They assess changes in tissues and potential malignancies. It results into early detection of cancer.

The images are segmented and annotated. Segmentation outlines tumors and identifies areas with different features. They are then annotated either as malignant, inflammatory or edematous (or swollen with fluid). Biopsy results, histopathology and immunochemistry data are correlated. Diverse algorithms are developed by aligning genomic sequences with images and climical data.

IIT, Bombay contributes the graphics processing units for algorithm testing.

Tata Memorial avoided exposure to radiation on the basis of image analysis. The chest related conditions of ICU patients on the basis of images are diagnosed with 98 per cent accuracy. Chest X-rays are subjected to AI algorithms to identify pathologies such as nodules and pneumothorax.

AI tool will enable oncologists to diagnose cancer with a simple click. Even GPs will be able to diagnose complex cancers.

AI keep learning continuously. This improves accuracy.