Blog

Data Run Out for LLMs

Epoch is a no profit institute hosted by San Francisco-based Rethink Priorities and funded by proponents of altruism. It makes research on the supply of training data for LLMs. Its new study released in June 2024 predicts that the supply of the data available for training will get exhausted roughly by the turn of the decade — sometime between 2026 and 2032. As such the data is finite resource now, and there is going to be a literal ‘gold rush’ for it.

As we know, as the LLM are getting trained on more amount of data, they are getting smarter. The amount of text fed into LLMs has been growing about 2.5 times every year. As we know, LLMs require two key ingredients — vast stores of internet data and computing power. Computing power has grown about 4 times per year. The latest upcoming version of face book’s Llama 3 has been trained on up to 15 trillion tokens.

AI companies keep on signing deals with publishers to enhance their data requirements. There is a steady flow of sentences out of Reddit forums and news media outlets. In long term, there will not be sufficient blogs, news articles and social media comments to sustain the current trajectory.

AI can then try to tap the private data such as emails and text messages. It will have to rely on less-reliable synthetic data generated by the chatbots. OpenAI experimenting with generating lots of synthetic data for training. However, LLMs require high-quality data, whereas synthetic data is low-quality data.

At present, models are scaled up to expand their capabilities and improve the quality of their output. Researchers predicted 2020 as the cut-off-year to get high-quality text data. New techniques were employed to make better use of existing data. Models are sometimes overtrained on the same sources multiple times.

It is debatable how much it is worth to be concerned about data paucity. Is it necessary to keep on training larger and larger models? Models can be trained for specialized areas for specific tasks.

12th July 2024
Generative AI and Predictive AI

Generative AI has recent genesis while predictive AI has been around for quite some time. Generative AI models create new content based on the patterns and data they have been trained on. Predictive AI, on the other hand, forecasts outcomes based on historical data.

Generative AI models hallucinate — they generate plausible but factually incorrect outputs. These models also show bias which they derive from the training data.

Predictive AI is used in finance, medicine, stock price prediction, patient diagnosis and customer behaviour analysis. There are challenges here too. They fail to mould to unseen data so far — leading to inaccurate predictions. The quality of historical data affects their performance. Prediction in the field of medicine could be problematic. GANs. There is confusion between correlation, and causation. In correlation, one number is in fact causing the other number to move in tandem. This causes errors.

Both the models require extensive manual training and continuous improvement. The nature of this training differs. Generative AI requires diverse and well-curated datasets to train. There are often human feed-back loops. Predictive AI needs clean, relevant and comprehensive historical dataset. It is necessary to identify, and create right features that the model will use to make predictions. It requires domain experience.

Google DeepMind used techniques such as GANs. They also exported self-supervised learning.

11th July 2024
2900th Write-up. Competition between Nvidia and AMD

The world’s largest computing conference, Computex, was held in Taipei in the first week of June 2024. Jensen Huang, Nvidia’s CEO and Lisa Su of Advanced Micro Devices (AMD) both participated in the conference.

Both Microsoft and OpenAI rely upon the accelerators of Nvidia to build generative AI services. Nvidia proposes to introduce in 2026 a chip called Rubin, named after Vera Rubin, the American woman who helped discover dark matter. Rubin will succeed Blackwell family.

Nvidia’s clientele is sold a fully proprietary system consisting of chips, networking gear and other paraphernalia to run an advanced AI development in data centers.

AMD’s Lisa Sue introduced Ryzen AI 300 processors, announced new AI chips for 2025 and 2026 and showed off Copilot+ computers.

The presentations were back-to-back. In this competition, there is a personal element too. Both Huang and Su are Taiwanese and are distant relations. However, that does not make them cede ground to each other.

Nvidia considers AI as the new industrial revolution and expects to play a vital role as the technology shifts to PCs. Nvidia exercises dominance in the field. In the short run, it is difficult for the rivals to assault its dominance.

The upcoming Rubin platform will use HBM4, the next iteration of the essential high-bandwidth memory. This comes in the way of accelerator production. Rubin and Rubin Ultra indicates the cadence of innovation. Nvidia relentlessly pursues maximization of technology and strengthening of its market position.

Su too commanded a supporting lineup.

10th July 2024
Data Protection Board

Under the Digital Personal Data Protection (DPDP) Act, 2023, it is proposed to form the Data Protection Board (DPB) — an adjuratory body to govern the data-related matters. It will work in co-operation with other regulatory bodies such as the RBI and TRAI. Under the Act, Rules will be framed to spell out the procedure for appointing the members of the board and its chairperson.

The selection committee should have members of the judiciary, executive and legislatures. It can have representatives from the concerned ministries and people drawn from civil society, business and industry.

The Board can have full and part time members. There should be diversity in membership. The part time members could be drawn from retired parliamentarians, business, industry, researchers and civil society.

It should have a tiered structural design. There could be a bottom-up approach where tasks and responsibilities are mapped and partially calibrated.

It can have an advisory expert council as a part of its tiered structure. It can have a research wing that aids the advisory expert council.

The Board must have supportive office and its staff. It can hold competitive public examinations to recruit the manpower.

There should be consumer dispute redress wing to allow individuals to raise their complaints.

10th July 2024
AI and Coding

The estimated number of coders in India is 26 million, which is likely to reach 30-35 million in five years. As AI technologies evolve, they will automate routine coding tasks and some entry-level coding roles may evolve or diminish on account of AI.

Advanced AI tools (such as the recent arrival Devin) can take on the entire coding projects and handle them from start to finish. However, there is always a need for human intervention, oversight and creativity in software development.

At present, AI is helping coders to enhance their job performance, AI frees up coders to focus on designing architectures and solving complex problems. AI can make coders free to improve user experience.

There are tools like GitHub Copilot, Microsoft 365 Copilot and Replit. These help coders by suggesting next line of code, by correcting syntax and fixing bugs and running debugging on the top of the code. It accelerates the coding process.

AI is also used to examine the code for possible errors and make corrective suggestions (for instance Deepcode). It is as difficult to debug and correct errors as trying to find a needle in the haystack. Debuggers infused with AI can examine the code, find problems and suggest solutions.

AI also generates doc strings, adds comments, formats the code and creates unit tests. AI can prepare a blueprint of the initial version, which the coder can edit. A simple UI-UX design can also be introduced in the code.

AI cannot substitute the nuanced understanding and innovative traits of the human programmers. AI cannot understand emotions, cultural aspects and values. All these matter and have to develop in human-centric applications.

AI can facilitate code testing. It is a time-consuming task. AI can create tests for the code automatically. These tests can be run to search for the flaws.

AI can facilitate the automation of laborious processes. The use of AI could be compared to RPA: robotic process automation that revolutionized BPO business by doing call-related work. That enhanced the productivity of the employees. It made the market competitive.

In a large application, some little pieces can be automated. However, hardcore design issues or complex issues do crop up where human intervention is necessary.

9th July 2024
Working of GPTs: A Mystery

In a recent interview in Geneva, Sam Altman, CEO, OpenAI spoke about AI safety. He, however, was reticent about how the GPT model works. According to him, they have not solved interpretability or explainability of the model. It means how AI and ML systems make decisions.

Of course, if there is lack of understanding about the working of LLMs, is it right to release new and powerful models? Altman dodged the question. He finally answered by saying that even in the absence of full cognition, AI systems are generally considered safe and robust. He further elaborated taking the example of human brain. A neuron-to-neuron level process in the brain is not understood fully by us. Yet, there are some rules we follow, and can ask others what is behind this thinking.

Altman referred to a Blackbox presence or a sense of mystery behind the functionality. Generative AI (like human brains) create new content based on existing datasets. They supposedly learn over time. GPT may not have emotional intelligence or human consciousness. It is difficult to understand how algorithms or human brain arrive at the conclusions they draw.

Early in May 2024, OpenAI released GPT-4o and announced that they are working on a new next model. It is anticipated that it will take us closer to AGI.

While doing its iterative development, OpenAI poses issues of safety, especially after its safety team has been disbanded. Altman has indicated the formation of a new safety and security committee.

8th July 2024
Women in Tech

In AI research and building AI models, too few women are involved. That results into software that reinforces stereotypes about women. Just 18 per cent of AI staff at Open AI are women. Though the Chief Technology Officer, Mira Murati, at OpenAI is a woman, there are just 122 are women in the company’s 686 staff whose job it is to build AI systems.

Women and ethnic minorities do not play a major role in building critical technology. Thus, many biases sneak into the system about them. Amazon’s recruitment algorithm had a bias against women candidates since the algorithm was trained on CVs of male candidates. Algorithms scanning chest X-rays underdiagnose female patients since they are trained on data that skews male.

Image generators make women appear more sexualized. Stable Diffusion produces more images of men than of women. Men are shown to occupy high-end jobs, whereas women are shown in low-end jobs. There is a legacy of objectifying women. In academics too, women are underrepresented. They make up 12 per cent of ML researchers.

There should be a level playing field. Many more women should come into STEM industries. One way to fix the problem is to be careful about the data that is used to train the models. Another step is to recruit more female researchers.

7th July 2024
AGI: A New Definition

Can machines surpass human beings? This query has produced a lot of science fiction. However, the rapid rise of AI systems in a decade or so has led experts to conjecture that science fiction would soon become fact. Nvidia, the third most valuable US company manufacturing GPUs is headed by Jensen Huang who believes that artificial general intelligence (AGI) could be reached in five years.

What precisely is AGI and how can we be sure that it has arrived?

Huang’s conjecture should be taken with a pinch of salt. His chips provide the compute power to AI companies. Thus, he has a vested interest in promoting AI. However, Huang has defined AGI or has answered what constitutes AGI. According to him, AGI is a programme that can do 8 per cent better than most people at certain tests (such as bar examinations for lawyers or logic quizzes).

5th July 2024
OpenAI Starts Training New AI Model

OpenAI has begun training a new flagship AI model in May 2024 to succeed GPT-4 that powers ChatGPT. It would like the new model to possess ‘the next level of capabilities’ as it strives to build AGI, a machine that can match a human brain. The new model, would be an engine for AI products (chatboats, digital assistants like Siri, search engines and image generators).

OpenAI is creating a new Safety and Security Committee to explore how it could handle the risks posed by the new model and future technologies.

The company would like to have a robust debate on capabilities and safety. OpenAI would like to accelerate the pace of research to move ahead of its rivals but is also conscious of the criticism that such a move could pose dangers for the humanity.

Experts differ on when the tech companies will reach AGI, but they have increased the power of AI technologies for more than a decade. There is a remarkable leap every two-three year. GPT-4 enables chatbot to answer questions, write emails, generate term papers and analyze data. GPT-4 was released in March 2023.

An updated version called GP-4o is not yet widely available. It has been released in May 2024. It can generate images and respond to questions and commands in a highly conversational voice It can learn skills by analyzing vast amounts of digital data — including sounds, photos, videos, Wikipedia articles, books and news stories

4th July 2024
Autonomous Cars: UK, the New Testing Ground

In one of its Tweets, Musk predicted that Tesla cars would self-drive as well as humans by 2021. The same year Ford had predicted that it would sell cars without steering wheels. Both proved wrong. Many companies tried to make AI-driven vehicles — Apple. Ford and Uber and have failed. The industry slowed down, but all hope has not been lost. Google’s Waymo, GM’s Cruise and several Chinese firms still pursue autonomous cars projects.

Wayve, a UK-based startup raised finance for its self-driving software Kendall, Wayve’s chief executive, was a Cambridge student doing research in deep learning. Cambridge has a legacy of AI breakthroughs (for instance, Alan Turing). Its spin-offs have struggled to commercialize cutting edge research on the lines of Silicon Valley. Oxa, a driverless car spinout from Oxford University, too sells self-driving software to enterprise customers. Wayve focuses on building software rather than manufacturing cars. To enhance this technology, Wayve takes footage collected from cameras on its test-driven cars, and plans to collect more footage through its licensing deals with car manufacturers.

The UK has passed a new law — it will allow driverless cars to British roads by 2026. This law addresses the problem of exaggeration done by the carmakers. The law is called The Automated Vehicles Act. It includes a section called ‘communication likely to confuse as to autonomous capability.’ It bans companies from creating such confusion. It prevents companies from claiming what they cannot deliver. The regulatory environment is now conducive for developing self-driving capability.

The UK lagged behind in automobile industry. Its marquee brands Jaguar, Rolls Royce, Bentley have been acquired by foreign companies. There is a 50 per cent reduction in car production since 2016. If Wayve succeeds to power many car companies to put autonomous vehicles on the roads, there could be a revival of the UK’s car industry. And the regulatory environment supports this. Chinese companies too are in the fray, but UK has the AI expertise from some of the finest universities together with friendly regulatory environment, and so can withstand the Chinese competition.

2nd July 2024