Natural Language Processing Techniques

As we know by now, LLMs are useful in natural language processing. The text is linguistic data, and there is always pre-processing of this data by using a number of techniques.

Tokenization

A token is a word segment. It is a vital step to divide the text into tokens — lengthy strings of text are dissected into more manageable and meaningful units. Tokens are the building blocks of NLP. It provides structured framework.

Stemming and lemmatization

After tokenization, there is stemming and lemmatization. These processes distill the root form of words from their morphological variations. To illustrate, ‘stick’ can appear in various forms — stuck, sticking, sticks, unstuck. We ignore the prefixes and suffixes while stemming. Lemmatization leads us to a root form of a word (commonly called lemma). It surpasses the limitations of stemming and identifies, the root word. Stemming vs lemmatization is considered for ‘change’. The various forms could be changing, changes, changed and changer. The stemming gives us ‘chang’. However, lemmatization leads as to ‘change’.

Morphological Segmentation

Some words are monomorphic, say table, lamp consisting of a single morpheme. Some words have more than one morpheme — sunrise which have two morphemes ‘sun’ and ‘rise’. The fusion of these two morphemes will lead to the holistic understanding of the word meanings.

Unachievability has four morphemes — ‘un’, ‘achiev’, ‘abil’ and ‘ity’.

Morphological segmentation prepares the text for subsequent analysis.

Stop Words Removal

This is a crucial pre-processing step. Here we eliminate extraneous linguistic elements which do not contribute much to the meaning of the text. Such words are ‘and’, ‘because’, ‘under’ and ‘in’. These are filler words.

Marketingganga — A marketing portal for market savvy — this has stops. Without stops, it would read Marketingganga, marketing, portal, market, savvy.

I like reading, so I read. This is with stops. Remove the stops and it would read — Like, Reading, Read.

Text Classification

Text classification has a number of techniques to organize vast quantities of unprocessed (textual) data.

The ultimate aim is to convert unstructured data into structured format.

Sentiment Analysis

It is also called emotion AI or opinion mining. It examines user-generated content. It can be leveraged to address evolving needs and enhance consumer experience.

Topic Modelling

Here the underlying themes and topics of the text are identified. It operates as an unsupervised ML process. The topics within the corpus are identified and categorized. The essential key words can be extracted while sifting the document. It identifies a subject of interest within a textual dataset.

Text Summarization

Here the text is condensed into a cohesive summary. It is either extraction-based or abstraction-based.

Parsing

It unravels a grammatical framework. In parsing we come across Named Entity Recognition (NER). It extracts information that identifies ‘named entities’. Here it uses pre- defined key works.

Then there are TF-IDF. It is an acronym for term frequency-inverse document frequency. It is a statistical methodology. It assesses significance of words within a document (relative to a bunch of documents). A word pervasive in all documents, attracts a lower score, even though its occurrence is frequent.

Verses Claims Breakthrough in AI

Verses is a US-based company developing AI systems patterned after ‘wisdom and genius of nature’. It has set up a billboard outside OpenAI headquarters expecting a tie-up and has published an open letter in the New York Times dated 19th December, 2023 claiming a breakthrough that could lead to more advanced form of AI.

The efforts of researchers are now directed to artificial general intelligence (AGI) which could match human capability or exceed it. That goal is called a singularity, later leading to superintelligence — AI outperforming the human being. Big Tech is in the race to develop AGI, which could benefit humanity. OpenAI’s founding mission written large on its website is to create AGI that benefits the whole of humanity.

Verses assertion in the billboard and its open letter do not disclose technical details but talks of a technical breakthrough in Active Inference. Sentience in living organisms can be comprehended by a mathematical framework. It is this framework that could unlock AGI (Active Inference is a book authored by Karl Friston an acclaimed neuroscientist, 2022). Active Inference outlines ‘free energy principle’.

Verses breakthrough could make the current models more reliable, efficient and aligned with human goals. Sam Altman of OpenAI knows that LLMs are not capable to push AI models into AGI. He believes another breakthrough is required.

Verses letter also talks about safety concerns. It considers itself as worthy of attention, and volunteers to develop general intelligence and superintelligence in collaboration, with safety concerns and beneficial effects in mind.

Satellite Internet: Spectrum Licensing

In the new Telecom Bill, the issue of whether to auction satellite broadband spectrum or offer it at administrative prices has been resolved in favour of the administrative prices. Amongst the telecom operators OneWeb of Bharati pushed for administrative prices, whereas Jio of Reliance demanded an auction. DOT preferred auction and TRAI demanded larger consultation. Indian Space Association (ISpA) which represents key satellite players, along with support from Nelco of Tatas, Kuiper of Amazon and Starlink of Musk are of the opinion that administrative allocation is a global trend, and India should not be an outlier.

In terrestrial operations, the operators require earmarked frequencies with no interference. Here the frequency is given on auction. In satellite services, the spectrum is globally shared by satellites. These work mostly within a particular band. Spectrum use is coordinated by satellite companies through a dynamic automated system on a good faith basis. Hence spectrum should be given administratively. Countries such as Thailand, Mexico and Brazil tried auctions but shifted to administrative allocation. Sharing makes spectrum use efficient. Startups will also get access to the spectrum, and they could provide connectivity to fisherman at sea. Expensive auction route will stifle the existence of these startups.

Jio opposes the move of administrative prices route. Telecom operators who pay for spectrum offer broadband services. Satellite players who pay administrative prices would also offer the same broadband services but without paying any cost on the spectrum. It is thus not a level playing field. There is another possibility. The early entrants in satellite communications will be given preferred orbital slots by International Telecom Union (on a first come, first served basis). The newer ones will be at a disadvantage.

DOT sent a reference to TRAI in September 2021, and TRAI initiated a consultation process. TRAI has not yet stated its final views. However, the government has decided to offer spectrum through administrative route.

Krutrim: India’s Chatbot

Ola’s cofounder and CEO Bhawish Agrawal launched Krutrim, an LLM and generative AI platform on the lines of ChatGPT and Bard. It is an indigenous model. It has 2 trillion tokens or pieces of textual information, which represent Indian data.

Krutrim is derived from Sankrit, and means artitificial. There are two models — a base model and a Pro model. The current model answers queries and prompts from people. It can understand 22 Indian languages and can generate text in 10 Indian languages including Gujarati, Marathi, Hindi, Bengali, Tamil, Kannada, Telugu, Malayalam and Odia. Krutrim pro will be launched in the first quarter of 2024.

After sign up, Krutrim will be available in batches, and it will be open for all users hopefully by January, 2024.

Developers will access Krutrim APIs.

India being a multi-cultural and multi-lingual country, the currently available LLM models are not able to capture the unique nature of India. Krutrim is India’s own AI.

Krutrim’s training is on India specific data sets. Krutrim is also working on creating AI cloud infrastructure. It wants to work on AI compute by developing GPU chips. Krutrim’s architecture has multiple chipsets to power different AI infrastructure, models and applications. Krutrim is already being used by Ola group of companies. Krutrim is faster on Indian languages, and generates responses in less time, using less compute. In English, it outperforms Llama2 of Facebook. Krutrim’s design was launched in 2023.

Indian startup Sarvam has also launched OpenHathi, the first Hindi LLM. Krutrim’s launch comes on the heels of it.

Generative AI’s Contribution to GDP

Generative AI could contribute $1.2 to $1.5 trillion to India’s GDP over the next seven years (EY report). Generative AI has the potential to speed up India’s digital transformation.

The most promising sectors that would adopt Gen AI are the services (IT, legal, consulting, outsourcing, rental machinery and equipment and others), financial services, education, retail and healthcare.

Such adoption would result into enhanced productivity, operational efficiency and personalized engagement with customers.

In 2029-30 alone, generative AI could add $359-438 billion to India’s GDP. It indicates an increase of 5.9-7.2 per cent above the baseline GDP.

These are early days. There are challenges while adopting AI — skill gap, clear use cases, risks of data privacy.

AI-first approach is becoming acceptable. It leads to digital transformation.

AI regulation in the initial stages should be light touch. It should be responsive. There has to be a balance between innovation and risk management. There should be regulatory sandboxes. AI-generated content could be watermarked. There should be standards for accountability to build trust in the AI systems.

AI-systems could be offered as public goods. Conducive environment must be provided — 5G, data centers, access to chip, AI specific compute, access to talent, public funding of R&D.

New Automobile Industry

The entire automobile industry will undergo a metamorphosis on account of gigacasting or megacasting. Tesla started the trend using large-scale die-casting in automobile industry. Dozens of chassis pieces were combined into one entire section for its model Y in 2020. The company used Giga Press equipment from an Italian supplier. The whole process curtails the number of welds and bolts and reduces weight. Megacasting machines are massive and apply 9000 tons of force upon the molten aluminum alloys within a casting mould. The punched out panels are larger and weigh 200 kg. apiece. The whole assembly process is rethought and reconfigured. It requires upfront investment. While Tesla pioneered gigcasting with their model Y, other makers are catching on recognizing its potential to revolutionize car manufacturing.

The change in production process changes the economics of the automobile industry. There could be a reduction of 20 per cent in traditionally stamped and joined body parts type cars by 2030.

The current car making process is modular, but it allows easy repairs. In a collision case, the whole burnt was borne by the front and rear bumpers sticking out from the chassis. In low-speed accidents, there was little structural destruction. The rest of the car was immune to damage. A few parts were required to be replaced. It could be done in a few hours.

In a gigacast car, repairs are costly and complicated. Large sections of the car are affected.

Modern cars are computers on wheels. There are levels of autonomy available on cars. Sensors and software have been deployed on the cars. There are rear cameras. In an accident, a car’s electronics get affected. The electronics is built into panels, doors, bumpers, fenders and the trunk. A car after accident is examined to analyze all the sensors and controllers to check what is damaged. The damaged components are replaced and recalibrated. It is an expensive and time-consuming process.

Even gigacasting does not make a car foolproof. There could be cracks in aluminium castings on models (such images could be seen online).

There are costs involved in repairing large sections. It is complex too. Industry groups and insurers are concerned. Gigacast cars may attract more premiums.

A new industry can come up — an industry to refurbish car and then sell them. It is on account of replacement (rather the repairs) that the whole car becomes brand new. Automotive recycling can become a big industry.

Superhuman AI: Some Observations

There were rumours that OpenAI is close to developing superhuman AI when Sam Altman was unceremoniously dismissed from the organization. OpenAI has built a Super-alignment team to control AI that surpasses the human beings. The members of this team include Collin Burns, Pavel and Leopold. They see to it that AI systems behave as intended. The team was built in July, 2023 to steer, regulate and govern superintelligent AI systems.

These days we tend to align models that are dumber than us. The idea is to find ways to align models that are smarter than us.

We know Sutskever played an active role in Altman’s ouster. After Altman’s come back, he is in a state of limbo. Still, he heads the Super-alignment team.

To the AL community, super-alignment is a sensitive subject. To some, it is a red herring. To others, it is a premature subfield.

Surprising Altman compares OpenAI and the Manhattan project. Both are treated as projects which require protection against catastrophic risks. Many scientists are skeptical about AI gathering world-ending capacity anytime soon, or for that matter ever.

Instead, attention should be focused on AI bias and toxicity. Sutskever believes that AI, either from OpenAI or some other can threaten humanity. At OpenAI, 20% computer chips are available for Super-alignment’s team research.

The team is currently developing the framework for AI’s governance and control.

It is a moot point to define superintelligence, and whether a particular AI system has reached that level. The present approach is to use less sophisticated models such as GPT-2 so as to guide the more sophisticated models towards the desired direction.

The research will also focus on a model’s egregious behaviour. Human beings are trading off between weak models and sophisticated models. But can a lower-class student direct a college student? The weak-strong model approach may lead to some breakthroughs, as far as hallucinations are concerned.

Internally, a model recognizes its hallucination — whether what it says is fact or fiction. However, the models are rewarded, either thumbs up or thumbs down. Even for false things, they are rewarded at times. Research should enable us to summon a model’s knowledge and to let it discriminate with such knowledge whether what is said is fact or fiction. This would reduce hallucinations.

As AI is reshaping our culture and society, it is necessary to align it with human values. The most important thing is the readiness to share such research publicly.

Regulation of AI

The European Union passed Artificial Intelligence Act for oversight and regulation. India too is holding a Global Partnership on AI Summit to reach a global consensus on AI regulation.

AI has affected the manufacturing sector. AI facilitates drug discovery, material science research. It transforms healthcare and diagnostics, autonomous transport and small but efficient and smart power grids. It assists financial systems and telecom networks. It enhances the provision of a host of public and private services.

AI has its demerits. It can promote criminal activities. It consolidates power in authoritarian regimes through face recognition, surveillance and discriminatory systems. At present, human beings are in charge of ‘pulling the trigger’ of dangerous military weapons. This power gets transferred to AI. Then self-aware AI possesses traits such as inquisitiveness and has an instinct of self-preservation. These issues must be tackled holistically. As AI spreads across economies, there should be consensus on regulation. The ideal oversight exercises control and mitigates the possibility of harm without crippling research and the rollout of useful AI.

The European regulation attempts a technology-neutral uniform definition for AI applicable to all future system. There is a classification of AI systems as per the risks. The higher the risk, the greater the oversight and the more the obligations imposed on providers and users.

According to AI Act, the limited risk systems should comply with transparency requirements. Users should be made aware that they are interacting with AI. To illustrate, systems that generate images should warn against deepfakes and image manipulation. There should be disclosure that the content is AI-generated. This puts curbs on the generation of illegal content. There should be public summary of copyrighted data used in training.

High-risk AI systems affect safety and fundamental rights. There are two categories — AI in products such as toys, aviation, cars, medical devices and lifts. Then there is another category — AI used across specific areas such as biometric identification, critical infrastructure, education and vocational training and AI-managed access to essential private and public services. These are registered in EU databases. Both categories of high-risk AI systems must be assessed before roll-out and must be reviewed throughout their life cycles.

Some systems pose unacceptable risks in the AI Act. One such example is behavioral manipulation of people, or vulnerable groups. Then there are biometric identification systems. These must be used with court approval to identify criminals and apprehend them after a serious crime is committed.

There can tweaking in this basic framework as per the needs of the country. But it is a reasonable framework for global regulation. The framework excludes military research and development.

The GPAI summit in India could adopt some version of this Act.

Walled Gardens

A San Francisco jury decided Google was running an illegal monopoly to recover huge fees from app developers. It marked a victory for Epic Games. Though a huge relief, Google’s and Apple’s walled gardens still stand.

Epic was ousted from Google Play Store and Apple’s App Store for attempting to bypass the payment systems (avoiding a cut of 30 per cent commission from transaction that go through the payment system). Epic lost its case against Apple but won the case against Google. It is heralded as a turning point in the mobile app economy.

The walled gardens of iOS and Android are built on strong foundations. Of course, there is a dent on the wall. The fees charged for in-app purchases has been a bone of contention since long. At present, almost $200 billion a year are collected for both these companies. It is treated as a fair compensation for the security these stores provide. However, developers take a different view.

Google discourages efforts of the developers to launch mobile distribution efforts of their own. It has launched Project Hug luring top game developers with financial incentives. It actively steers them away from agitating for better terms. Sweetheart deals are offered asking for a smaller percentage of commission on in-app transactions. Google ties up with hand-set makers such as Samsung so as to prioritize Google’s store over any other.

Apple, by contrast, treats every developer equally on its store. There is no need to exert pressure on handset maker because it manufactures its own handsets. Thus, Google’s behaviour is particularly egregious. And as the judge put it ‘success is not illegal.’ Apple was asked to abandon its ‘anti-steering’ rules.

Apple and Google are called walled gardens as they are pleasant and well-maintained. However, more choice is good for consumers.

Epic Wins against Google

Epic Games makes Fortnite games. As we know, on cell phones, there is a duopoly of Google Play and Apple which run app stores generating close to $200 billion a year.

Google was facing a case from Epic. The jury in San Francisco ruled after a three-year legal battle that Google has turned is play app store and billing service into a legal monopoly.

American video game-maker Epic Games has sued Google in 2020. Google is the Mountain View headquartered search engine company.

The jury has observed that the company hurt competition by tying its Google Play Store with its billing services.

Epic had secretly installed its own payment system to bypass the up-to-30 per cent revenue share that the two tech giants, viz. Google and Apple take from in-app purchases and subscriptions on their platforms. If this is removed from the eco-system, consumer prices will get better.

There is a fortune at stake for both Apple and Google. The Digital Markets Act in the European Union will bring about changes.

Both the companies are making adjustments. Apple allows the so-called reader apps (say software for cloud storage, watching video and reading books) link to outside websites to let users pay. That bypasses Apple’s revenue cut. Both Apple and Google have changed their policies to take commission on subscription apps.

Epic’s win against Google has the potential to bring major changes in the USA, its home country. That will take internet software back to a more open environment. App stores are a closed eco-system.