Checking the Fact Checkers

There are concerns for misinformation campaigns and fake news. Such communication affects not only individuals, groups but also the governments and big corporates. These could also affect elections, valuation of companies, supply chains and individual reputations.

It is extremely necessary to check the facts — a piece of information can be true or false, and it is branded so on a question of fact. It gives rise to fact checking individuals and organisations. Fact checkers aim to get closer to the truth, but they can have biases, which shrouds the very truth they seek to check.

Even wrong videos can be ‘verified’ (as true). There are ‘unverified’ videos showing Tesla cars catching fire. There could be wrong financial evaluation of companies by vested interests, say short sellers, bringing a run on the company. Fact checking is adjudication. Public interest is involved. Can it be outsourced to private parties? Or only the government has to regulate this. It is an issue of morality also. Since public interest is involved, legally only a judicial or quasi-judicial authority must decide it. The only requirement is that the regulation governing fact checking authority must be fair. The findings of fact checking authority are subject to challenge in a court of law.

Decisions in this area have to respect freedom of speech and personal liberty.

Private players do fact checking. Here the issue is who will check the checkers? Some of those build a reputation over a period of time, and become trustworthy.

This is an era, where it is both very easy and very difficult to verify. There are predatory attacks on individual and corporate reputations. This has not changed from the era of Mahabharat where Yudhishthir , whispers ‘naro va kunjaro va’ when the news of Ashwatthama’s death was spread.

Validation of responsible private checkers is required.

Neurotechnology on Steroids

Brain signals can be manipulated. It is neurotechnology. It is taking rapid strides. As it affects human rights, it requires global regulation(Unesco).

These days they interface computers with human brains, and while doing so use artificial intelligence (AI). The aim is to analyze the neural activity.

AI-assisted neurotechnology means putting neurotechnology on steroids (Mariagrazia, Report on Innovation in Neurotechnology).

The life of people who live with their disabilities is improved by neurotechnology assisted by AI. It is used to treat cerebral ailments. It is used to diagnose brain-related disorders. There is use of implants at times.

If the technology is abused, it affects human rights and freedom. It can affect our identity, autonomy, privacy, sentiments, behaviours and overall well-being.

We have reached a point where the very essence of being a human can be changed.

The field is attracting substantial private investments, including Neuralink of Elon Musk. There are many scientific papers and patents of late in this area.

These are the days when non-invasive devices have been used to decode information. There is a need to protect mental privacy

Corporates take ownership of the data collected during such studies. In an experiment, implants in the cortex of mice made them see things they are really not seeing–hallucinations. What is possible in future, it should be discussed now. Here part of the cerebral activity happens outside the brain.

No one objects to neurotechnology. It has the potential to reduce death and disabilities. At the same time, a globally coordinated effort should be made to regulate neurotechnology.

How LLMs work

‘The quick brown fox’ is the input sequence. It is vectorized. The vector is used to calculate the attention weights. The attention weights are used to create a weighted sum or hidden states of the encoder. It is then passed to the decoder. The attention mechanism focuses on the words ‘quick’ and ‘brown’. The output vocabulary is used to generate a probability distribution over the possible next words. The word ‘jumps’ is predicted as the next word. While deciding probability of each word, the output of attention mechanism is used. The decoder repeats this process. The predicted word is taken as input in the sequence and is used for the next reiteration. The process continues until the decoder predicts the end of the sequence.

While training, decoder could be subjected to masked language modelling. Some words in the input sequence are masked out. The decoder predicts the masked words. It helps the decoder to focus on the context of the current word, while predicting the next word.

After Vaswani’s paper on Attention Mechanism (2017), transformer model is used. It has since then changed to decoder-only transformer (2019). This transformer is less accurate, and is used where accuracy is not as important. The decoder-only transformer takes the previous words in the sequence as input. The decoder only transformer produces a sequence of hidden states which are used to predict the next word in the sequence. It does so depending on the previous words in the sequence. Here first a score for each token in the input sequence is computed. the score is based on how well the token matches the current score of the decoder, the tokens with the highest score are then used to generate the next token in the output sequence. The most common scoring function is the dot product.

The attention weights are calculated for the hidden states. They indicate how much attention is to paid to individual words in the sequence. The attention weights are used to combine the hidden states in a single representation. This representation is used to predict the next word in the sequence.

LLMs use deep learning techniques by analyzing and learning from vast amounts of text data. They use this data to learn the relationships between words and phrases. It is called ‘transfer learning’ where a pre-trained model is adapted to a specific task. In training, they process vast amounts of text. They learn the structure and meaning. They are trained to identify meanings and relationships between words. They deal with large swaths of text to understand the context better.

Vectorised models can use a distributed representation where different words with similar meanings have similar representation and are close in vector space.

A unigram model works taking each word in a sentence independently. Bigram models examine the probability of each word in a phrase depending upon the probability of the previous word. A trigram model considers two previous words. An n-gram model considers n-1 words of the previous context.

Expensive LLMs

First of all, the total costs of building an LLM consists of manhours spent by highly talented manpower, who use expensive chips to train them and the various operational costs. We are leaving aside the fixed overheads here.

LLMs and even smaller models are expensive to train and deploy. There is hardware cost. Training prerequisites are the GPUs or graphics card. Nividia’s A 100 which is commonly used costs $10000. The computation requires tens of thousands of these GPUs. A GPT-3 model has 175 billion parameters. It takes around 285 plus years of computation. To make it manageable, OpenAI used thousands of GPUs to reach its computation goals. According to one estimate, OpenAI may have used more than 30,000 GPUs to commercialise ChatGPT. It wold have cost $30 million. While integrating it to Bing, Microsoft could have spent over $4 billion in the hardware cost. Bard-powered Google could have cost Alphabet $100 billion.

Running costs of such a model is also very high. An exchange with an LLM costs several times more than the search on search engine. A ChatGPT-like model receives millions or billions of daily queries. Basic running costs are too high even for organisations with deep pockets. According to one estimate, OIpenAI spends about $7 lac per day to run ChatGPT.

Talented manpower’s salary (with compensations over a million dollars per annum) is another cost component. Skilled talents come at premium cost.

There is environmental cost on account of carbon emissions.

There is constant research on LLMs. There are data collection costs. There is electricity cost. And a host of administrative costs.

All these costs take the best model out of the reach of public. All factors are antithetical to mass adoption. Even Big Tech will have to ration the services.

Not-for-profit models care not sustainable.

Organisations must give serious thought to how these models can be monetised. OpenAI has converted itself into for-profit.

Research will have to focus on reducing the training cost and hardware cost. Already, organisations such as Google produce their own chips, e.g. TUP or Tensor Processing Unit, Amazon’s Inferentia and Trainium, Facebook’s MTIA etc. There should be research on computer memory.

Paucity of Training Data for LLMs.

LLMs are fed on massive data. Stuart Russel, University of California, Berkeley feels that soon there will be no data left to ingest and bots like ChatGPT may hit a brick wall. In near future, the whole field of generative AI may be adversely affected by paucity of data. It is this anxiety that compels companies to resort to data harvesting. The data collection processes as it is are under a radar of those whose copyright material is being used. Much of data collection is being done without consent. The most worrying factor is the shortage of data — all high quality data could be exhausted by 2026. Such high quality data is sourced from books, news articles, scientific papers, encyclopedias and web-content.

OpenAI has bought datasets from private sources. We can infer that there is acute shortage of high quality of data.

GPT-4 has been created by making use of public data as well private data. OpenAI has, however, not revealed the sourcing of data for GPT-4

Sam Altman, CEO, OpenAI has no plans to offer an IPO, as there could be conflicts with investors, in view of the unorthodox structure and decision making in the company.

Fine Tuning an LLM Model

Pre-trained LLM models are retrained on specific datasets. That is called fine tuning the model. It makes the model ready for the specific content of the present needs, say you train an LLM on medical datasets to help diagnosis of a specific disease. An LLM can be made more specific in particular domain by fine tuning. The model is trained on smaller but targeted dataset relevant to the desired task or subject matter.

A fine tuning API is used to fine tune a model. It is done online. If the weights of the model are open source, fine tuning can be done on our premises. HuggingFace offers an easy Auto Train feature. You can select the parameters on your own, (manual) or can do so using Auto Parameter Selection On Hugging Face. The best parameters for the task are selected.

A fine tuned model is evaluated on three criteria — perplexity, accuracy and F1 score. Perplexity refers to how well the model predicts the next word in a sequence. The lower the perplexity score, the higher is the ability of the model to predict the next word. Accuracy refers to how well the model performs on a given task. It is given by division of correct predictions by the total number of predictions. F1 score refers to how well the model performs on a binary classification of tasks. Here harmonic mean of precision and recall is taken.

If fine tuning is to be done for a different application, one will have to repurpose the model by a small change in its architecture. Here embeddings produced by the transformer part of the model are used. (embeddings are numerical vectors).

In repurposing, the model’s embedding layer is connected to a classifier model, e.g. a set of fully connected layer. The LLM’s attention layers are frozen. They should not be updated. It saves compute costs. Classifier is trained on supervised learning dataset.

In some instances, parameters weights of the transformer are updated. Here attention layers are not frozen. Fine tuning covers the entire model. It is expensive computationally.

To update a model knowledge-wise, say in medical literature, an unstructured data set is used. The model is trained through unsupervised or self-supervised learning. Foundation models are trained this way.

At times, more then knowledge upgradation, an LLM’s behaviour is to be modified. Here supervised fine tuning (SFY) dataset is used. It is a collection of prompts and the responses elicited. It is also called instruction fine tuning.

Some organisations use reinforcement learning from human feedback (RLHF), taking SFT to the next level. It is an expensive process. Human reviewers and auxiliary models are needed for RLHF. Only well-equipped AI Labs can afford this. RLHF brings humans in the loop.

Research is directed to parameter-efficient-fine-tuning (PEFT), e.g. low-rank adaptation (LORA).

Some models cannot be fine tuned, especially models available through API. At times there is no sufficient data, or data changes frequently. The application could be dynamic or context-sensitive. Here one can use in-context learning or retrieval augmentation.

From MCP to Dr. Licata on Neural Networks

In 1943, American neurophysiologist and cybernetician of the the University of Illinois, Chicago Warren McCulloch and psychologist Walter Pitts published a paper ‘A Logical Calculus of the Ideas Imminent in Nervous Activity’ describing the ‘McCulloch-Pits’ (MCP) neuron. It was the first mathematical model of a neural network.

They described brain functions in abstract terms, and showed that simple elements connected in a neural network can have immense computational power.

The paper received little attention. The ideas were applied by John von Neumann, Norbert Wiener and others.

MCP paper was the pioneer in the field of Artificial Intelligence (AI) and cognitive science. It is a core event in computer science and AI. The brain is considered a neural network and the mind is interpreted as a product of its functional properties.

Biological neuron takes an input signal (dendrite), processes it like CPU (soma), passes it through a cable like structure to other connected neurons (axon to synapse to other neuron’s dendrite). There is a lot more than this in the functioning of a biological neuron, but broadly what happens in our brain is that there is an input, there is processing, and there is an output. The sensory organs send the input to activate a neuron. The decision making is actually done by a couple of neurons.

Human brain consists of an interconnected network of 10 raised to 11 neurons (100 billion). The connections are complex.

The output of the processes is passed on to the next layers in a hierarchical manner. There is division of work. A neuron may perform a certain role to a certain stimulus. Each layer has its own role and responsibility. Some functions, e.g. face recognition, could involve many layers.

MCP designed a network consisting of nodes — a part that takes an input, and a part that makes a decision. The neuron learns Boolean functions — inputs are Boolean and the output is also boolean.

The MCP neuron is a binary neuron, which manifests either active or inactive states. Its activation is determined by the sum of inputs received by it. If the sum of the inputs is greater than a certain threshold, the neuron fires and remains active. If the sum of inputs is less than or equal to the threshold, the neuron remained inactive. The MCP neurons can represent any logical expression — logical function such as AND, OR and NOT could be implemented by the MCP neurons.

Brain simulation is problematic because of the complexity of its structure consisting of 100 billion neurons and 1000 trillion synaptic interconnections. Beside communication in the brain is not digital, it is electromechanical. There are inter-related timing and analogue components. Simulation of brain is beyond the technological reach today.

Neural networks roughly resemble the structure of the brain. The architecture is arranged into layers, and each layer has processing units called nodes. These are in turn connected to other nodes in the layers either above or below. The data fed into the lowest layer is passed on to the next layers. Artificial neural networks are fed with huge amount of data. These are designed to function like biological neural networks. However, brain’s functioning is much more complex.

Real neurons do not compute the output by summing up the weighted inputs. Real neurons do not remain on until the inputs alter. The output might encode information using pulse arrangements.

Dr. Licata published a paper in Journal of Computer Science and Biology. The paper questions whether artificial neural networks are good models for human minds.

According to Dr. Licata, they are not good models for human mind. However, it does not make them useless since they do computation in parallel.

Modern science has yet to distinguish between human mind and brain. Research is needed for the concept of consciousness. It is necessary to understand how thought emanates. Artificial feedback is unstable. Neurons in the brain which do thinking and planning have tree-like structures. It is not clear how the brain solves the credit assignment problem.

It is necessary to integrate research in neuroscience and AI, since the paper of MCP in 1943, there is very little integration of neuroscience and AI.

Meme Coins

Meme coins are a type of cryptos that are inspired by an internet meme or viral image. The underlying technology is blockchain. As they are cryptos, they can be mined. They are available for buying-and-selling at crypto exchanges.

Meme coins are generally associated with culture and community. Some are created for fun or recreation. Cryptos, as we know, have technical features, whereas meme coins are known for their association with a meme.

Some popular meme coins are Dogecoins (DOGE), Shiba Memu (SHMU), Pepe Coin(PEPE), Wojak Token (WOJAK).

The key difference between meme coins and cryptos such as Bitcoin and Ethereum is the utility. Meme coins are meant for investors to make fast money. Cryptos such as Bitcoin have a limited supply of 21 million — once it reaches the limit, people can longer mine Bitcoin. This drives the demand and price. It becomes an expensive buy. This is not the case with meme coins. Dogecoin has an unlimited supply. Shiba Inucoin has a supply of 1 quadrillion. A Shiba Inucoin is priced on July 4, 2023 at $0.068 For $100, you can buy 1463 Dogecoins. The proposal is attractive for young generation. A few dollars investment buys you thousands of meme coins.

Meme coins have significant presence on social media. Prices are driven by their popularity.

As meme coins do not have any fundamental economic or business use, there is price volatility. It is vulnerable to sentiments. At times, creators or investors of meme coins disappear with investor’s money. It is called rug pulls.

Governments initiate efforts to rein in some meme coins.

Despite inherent risks, some meme coins have a strong following. Meme coins have an uncertain future.

Time and Cost Reduction in Pretraining LLMs

Training an LLM is very costly — ranging from $10 million to tens or hundreds of times costlier than that. Thus cost-wise LLMs are not affordable for smaller organisations or research/academic groups. It is necessary to revisit the current optimization methods of the LLMs. Standford researchers started working on this. The aim was to curtail the training time of these models to half. There are millions or billions of parameters. These parameters have curvature — the maximum achievable speed these models reach as they progress towards the final goal of LLM pretraining. Curvature in short is the workload of parameters in LLM model. It is for this reason that while optimizing LLM pretraining the curvature estimation step is foregone.

Researchers noticed a possible inefficiency in previous methods which used parametric curvature estimation. The curvature estimates were updated at every step of optimization. Thought was given to the proposal whether the process can be improved upon by decreasing the number of updates. The idea was tested by designing Sophia to estimate parameters’ curvature every ten steps. That was a winning proposition.

Another trick tried was clipping. Inaccurate estimation of curvature increases the workload. Clipping prevents that by setting a threshold or a maximum curvature estimation.

Sophia was used to pretrain a small LLM on par with GPT-2. It reached optimization in half the number of steps and half the time. It means substantial improvement in pretraining and massive cost reduction.

In future, the researchers would like to experiment with a larger LLM using Sophia, and other models such as CV and multi-modal models.

As Sophia is open source, the experiment can be carried forward. Sophia is a new approach developed by Stanford researchers to train LLMs. The other optimization algorithms previously used are Stochastic Gradient Descent (SGD), RMSProp Optimisation and Adam.

Use of Calculus in Neural Networks

Calculus helps us understand the internal workings of different ML algorithms. One application of calculus in ML is the gradient descent algorithm along with backpropagation. This is used to train a neural network.

Backpropagation involves taking the error rate of forward propagation and feeding this loss backward through the layers of neural network. The aim is to fine tune the weights. This is the essence of neural net training.

The method calculates the gradient of the error function with respect to the weights of the neural network. It reduces the error values in randomly allocated weights and biases in such a way that it produces the desired output.

The gradient of the loss function for a single weight is computed by the chain rule. It is computed for one layer at a time.