Blog

N-Grams in Natural Language Processing

In natural language processing, we parse a sentence in n-gram sequence of words. To illustrate, ‘I am a good boy’, can be parsed as ‘I’, ‘am’, ‘a’, ‘good’ ‘boy’. Here ‘I’ is a unigram (n=1}. The same sentence, can be parsed as ‘I am’, ‘am a’, ‘a good’, ‘good boy’. Here ‘I am’ is a bigram (n=2). Then the sentence is parsed on trigram — ‘I am a’, ‘a good boy’. Here ‘I am a’ is a trigram where n=3.

When each word of a sentence is considered independently, it is called unigram model. Bigram models examine the probability of each word in a phrase based on the probability of the preceding word. Thus in NLP, two most frequently used models are unigram and bigram.

Bigram is a pair of consecutive words in a sentence. Trigram is a special case of the n-gram where n=3. It is a sequence of three consecutive words in a sentence. Similarly 4-gram is a sequence of four consecutive words in a sentence.

N-grams are used to create language models. In fact, they are statistical models which predict the probability of a sequence of words. N-grams can also be used in spelling correction. N-grams are also used to tag words with their part-of-speech (POS) tags. N-grams can be used to classify text into different categories.

N-grams help in summarizing text. Here most important n-grams are identified. They are used in machine translation. Text is translated from one language to another by considering the probability of a sequence of words appearing in the target language given the sequence of words in the source language. N-gams are used to train the chatbots to respond to queries.

N-grams are useful in converting words in numerical formats. It helps in capturing the context of the words. It facilitates text classification, sentiment analysis and machine translation.

Ways to Convert N-grams into Vectors

One-hot encoding : Here each n-gram is assigned a unique integer. Let us say we have a bigram ‘fox ate’. It is assigned the integer one. N-gram ‘I love’ can be assigned integer 2. Thus each n-gram is represented as a vector of zeroes, with a single one in the position corresponding to the unique integer.

Count encoding : Here each n-gram is assigned a count of number of times it appears in the text. If ‘I love’ appears two times in the text, it would be assigned count 2. Thus each n-gram is represented as a vector of non-negative integers, and the value of each integer corresponds to the number of times the n-gram appears in the text.

The method used depends on task on hand. One-hot coding is used for text classification. Count encoding is used for sentiment analysis and machine translation.

Example

‘I love’, ‘love birds’. If we use hot encoding, each bigram will be assigned a unique integer. ‘I love’ can be assigned unique integer 1 and ‘love birds’, would be assigned unique integer 2. Thus ‘I love birds’ would be represented by a vector (1,2).

If we use count encoding each of these grams and thus both are assigned 1, which is their count the representation of vector would be (1,1).

2nd August 2023
GPT-5

Open AI has already filed a trademark registration application for GPT-5. It does not mean that soon the model is going to be released. It means the company wants to protect the trademark and trade name and prevent its unauthorized use.

GPT-5 is anticipated version of GPT-4. It is to be seen when the company releases it after Sam Altman and others want the further development of AI to be paused for some time. In June 2023, Altman declares that there is no training for GPT-5 as yet.

There is no confirmation about the features of the new version. The present trademark application would take 4-5 months for approval. Does that mean by year the end, GPT-5 could be released? By inference, do we have right now GPT-4.5?

In a previous article, we have discussed GPT-4 with Code Interpreter feature. Is that then GPT-4.5. Maybe, the company avoids the nomenclature as there are noises for the pause of AI.

OpenAI had claimed AI with superintelligence in the next four years in a blog about alignment of AI. Thus the trademark registration materialising by the year end, and the four years expected to have AI with superintelligence are two different propositions, and the timelines are different.

Maybe as the capabilities of competitors are rising, OpenAI may discard the halting of AI and may actually build it as soon as possible, say the year end.

1st August 2023
Twitter : New Logo

Elon Musk declared on Sunday, 23 July, 2023 that he intends to change Twitter’s logo by uncaging the bird. He proposes to bid adieu to the twitter brand and gradually all the birds.

The company has already changed its name to X Corp, reflecting Musk’s vision to create a super app like China’s WeChat.

The new logo X will go live. The blue bird is its most recognisable asset. However, Musk does not buy this argument. He has already altered the sign on the company’s San Francisco headquarters.

The new logo of X has been officially announced, and a bio reads reading ‘what’s happening?’

1st August 2023
How Chat Bot Answers a Query

As we know, generative AI models have been trained on massive amount of data. Since computers do not understand text, the models do not take text as it is, but take it in a numerical format. These are called embeddings or representation of data in a numerical format. All inputs to LLMs (large language models) and outputs from LLMs is through embedding. If we have to access these embeddings, it is time consuming. Therefore, these embeddings are stored in Vector Databases, which store them and from which these can be retrieved.

Thus we know, that embeddings or vector embeddings represent data — text, images, audio, video and so on. The data is in the numerical format in any n-dimensional space. It is called a numerical vector. Word2Vec developed by Google is a model that converts words to vectors. All LLMs have their respective embedding models to create embeddings.

This way the vectors can be compared to each other. A computer cannot compare two words, but can compare two vectors, we can create a cluster of words with similar embeddings, e.g. ball, bat, wickets, pitch will appear in a cluster as they are related to cricket.

The embeddings facilitate finding words similar to a given word. These can be made into sentences. A sentence can be used as an input to obtain related sentences from the data stored. It is the basis of semantic search, sentence similarity, anomaly detection and chatbot.

The chat bots perform question answering from a given PDF, Doc. by making use of the concept of embeddings.

All LLMs use this approach to get similarly related content to the queries provided to them.

A chat bot based on PDF is asked a query. As we know the data is represented in vector embeddings. Similarities are detected between different parts of data. Data is extracted which is similar to a particular embedding. Vector Store performs the similarity search through search algorithms. It fetches all relevant data. These are passed to chat bot which generates a final answer for the user.

Chat bots create vector embeddings by using ML algorithms which are trained on massive amount of data to learn how to represent words or phrases as vectors of numbers. The most popular algorithm is Google’s Word2Vec invented in 2013. Word2Vec takes a word and spits out an n-dimensional coordinate (or vector) so that when these word vectors are plotted in space, synonyms cluster.

31st July 2023
Industry 4.0

Industry 4.0 is a convergence of several technologies, both physical and digital — Internet of Things (IoT), artificial intelligence (AI), drones, robots, autonomous vehicles and other interconnected technology that have the potential to communicate, analyze and act.

Industrial IoT (IIoT) refers to the industrial subset of IoT in manufacturing industries. It means the use of smart sensors, and actuators to improve manufacturing and industrial processes.

It leverages the power of smart machines and real-time analytics to capitalise on data generated by these machines.

In digital twinning, there is the process of recreating a physical object on a virtual interface to improve the overall business process. Digital twins can be used in various ways.

Digital twins traditionally focused on anomaly detection and remote maintenance. As technologies such as IoT digital twins, AI and ML have emerged, the whole organisation can be connected, instead of connecting just one asset. The digital twin software created a holistic digital experience.

Mere buying the technologies is not enough. People are the backbone of a company’s success. People should be ready to adapt to new technologies.

30th July 2023
Akashvani

Prasar Bharati is India’s public broadcaster. Its radio services will no longer be referred to as All India Radio (AIR), but will be referred to as ‘Akashvani’. This decision taken by the Government long back was not operational earlier, but is being operationalised now. The Prasar Bharati Act or Broadcasting Corporation of India Act, 1990 refers to Akashvani. The Act came into force in 1997. There should be compliance with the name change with immediate effect. Rabindranath Tagore while inaugurating Calcutta shortwave services in 1939 wrote a poem where AIR was referred to as Akashvani. Akashvani Mysore, a private radio station was set up in 1935.

29th July 2023
AI as Transformative Tool

Organisations are undergoing digital transformation. It makes them faster and agile. AI, especially generative AI, accelerates growth. It facilitates new concepts being tested, processes being optimised and new solutions being discovered. However, AI alone cannot drive digital transformation. AI must be backed up by strategic management, good product management, excellent engineering and data management. Business must take this holistic approach to gain maximum growth from digital transformation.

There is resistance from the legacy systems. The implementation cost for new technologies is significantly high. There is a change — from traditional systems to new technologies. Despite these obstacles, there are benefits of AI which outweigh all these.

AI is adopted as a transformative tool. AI should be fair and free of bias. Patterns and algorithms must be uncovered for unfairness. There should be preprocessing and diverse perspectives. There should be ethical guidelines. There should be oversight on AI.

28th July 2023
New Crypto –Worldcoin

Sam Altman of OpenAI has become the co-founder of Worldcoin, cryptocurrency project. The trading in this currency commenced on Monday, 24 July, 2023. The currency recorded an initial price of $1.70 before falling back to $2.52 at noon in London. By the intraday trade in London, $145 million worth of token had been traded. On the world’s largest exchange, Binance, the crypto hit a peak of $5.29. It had seen a trading volume of $25.1 million.

The core offering of this project is its World ID. It is described as a ‘digital passport’ and proves that the holder is a real human, and not an AI bot. To get the World ID, a person undergoes iris scan of the eye, using orb or a silver ball, approximately the size of a bowling ball. Once the orb’s iris scan verifies the person is a real human, a World ID is created.

The initial supply of the crypto is capped at 10 billion tokens. The initial launch consisted of 143 million Worldcoins, out of which 100 million were loaned to market makers. The remaining were allocated to investors who were verified by Orb.

Since its launch the orbing, operation is being scaled up to 35 cities in 20 countries.

Blockchain can store the World IDS in a way that preserues privacy and cannot be controlled or shut down by any single entity.

San Altman believes in the concept of UBI or universal basic income to remove inequalities. Since World IDs are with real people, these could be used while implementing UBI. Worldcoin lays the ground work for the UBI concept to become a reality.

People around the world are getting their eyeballs scanned in exchange for a digital ID and the promise of free crypto currency. Each verified user will receive 25 free World coin tokens.

During a trial period, the company has issued IDs for more than two million people in 120 countries. The trial period is the period of last two years.

In London, the Worldcoin representatives showed a stream of people how to download the app and get scanned, handing out free t-shirts and stickers saying ‘verified human’.

Worldcoin tokens were trading around $2.30 on Binance on Tuesday, July 25, 2023.

27th July 2023
AI in Assisted Fertility

AI assists the doctors to select the ideal embryo in IVF. AI is being used in fertility treatment. AIVF, a reproductive technology company based in Tel Aviv (Israel) has developed an AI-assisted software (called EMA) that processes vast amount of data to facilitate embryo selection process.

An Orissa-based startup Santaan offers services using AI to select embryos for transfer to the wombs. This selection is crucial for the success of and IVF cycle. AI algorithms analyse the images of embryos to predict which have the highest probability of leading to a successful pregnancy. AI prevents unnecessary transfer of the embryos to the uterus. It also minimises the risk of multiple pregnancies, say the birth of twins, triplets, quadruplets and so on.

Indira IVF clinic, Delhi too uses AI to improve embryo selection.

Machine learning (ML) is used for select oocytes, female germ cells (involved in reproduction) and monitor their behaviour during intracytoplasmic sperm injection (ICSI). It is an assisted reproductive technology. (ART).

AI is used in analysing embryo development. It is used in semen analysis and DNA integrity. Embryologists can identify sperm cells of males who suffer from infertility.

Cryopreservation technique is assisted by AI to maintain and preserve cells tissues and other biological samples. AI analyzes datasets of frozen embryo outcomes. The patterns and factors are identified. These influence the viability of thawed embryos. AI can be used to develop protocols for cryopreservation.

AI can be used to asses the suitablility of reproductive organs, uterus and ovaries, and identify anomalies.

AI in ART is useful, but it requires precise data. AI makes predictions. These have to be validated by comparing them with clinical outcomes. Thus we can refine the algorithms.

There are ethical issues. ART processes involve highly sensitive data ( personal information). There should be stringent data protection. There should not be unauthorized access.

AI-driven ART processes are expensive. IIT, Hyderabad is working on an indigenous and affordable solution in this field.

26th July 2023
Chip Ecosystem

A fab plant manufactures Integrated Circuits (ICs) by working on raw wafers through a complex process. The manufacturing involves 500 machines. There are some 700-1500 processes, some of which require heating to 1100 degree centigrade, at times as many as 27 times. The process requires 300 plus gases, and chemicals such as acetone. Many of the inputs depend on imports as these are not made in India. A fab plant requires a minimum investment of $3-4 billion.

The raw material most commonly used is silicon. Some materials are combination of elements –compound semiconductors. These materials are gallium nitride and silicon carbide. These are more heat resistant and compact. The material plants are set up with an investment of an investment below $500 million.

There are different kinds of chip makers. Some are called IDMs — integrated device manufacturers. Samsung and Intel too both design and manufacture chips. Some manufacturers are foundries — they make chips under contract for others, e.g. TSMC. Some organisations are fabless firms — they only design the chips, e.g. Qualcomm, Media Tek and Nvidia. These designers then get them manufactured by foundries.

Stages of Making a Chip

To begin with, wafers are carved out from a salami-shaped bar of 99.99 per cent pure silicon. These are polished to ensure smooth texture. The films of conducting materials are deposited on the wafer.

The second step is to cover the wafers with a light-sensitive coating (photo resist). Those areas which are exposed to UV light change their structure. Thus they become ready for etching.

As a third step, these are put through a lithography machine. It decides just how small the transistor on a chip could be.

As a fourth step, the wafer is etched. It is then baked to reveal the 3D pattern of open channels. It creates a cavity in the wafer with exact depth. Lastly, the wafer is subjected to bombing with ions to facilitate the control of the flow of electricity.

There are two models after this. Foundries or IDMs might do the few processes left themselves or they can outsource to third parties to reduce the investment.

The player involved is outsourced semiconductor assembly and test (OSAT) who do the assembly, packing and testing of the ICs for others. There are companies such as Micro which can set up an assembly, testing, marking and packaging (ATMP) operation. Micron can use its own fabs produced abroad. The chips are then taken out of the wafer, sliced and diced with a diamond saw. These become individual chips.

Wafers can contain a few chips or thousands of chips.

IC packaging is also done — wire bonding and laser marking. The wafers and ICs are tested.

Types of Chips

First there are logic chips.They process information to complete a task. These are used to optimize visual display. They also do processing of ML and deep learning apps. They also act as processors in CPUs.

Secondly, there are memory chips used to store information and save data when machines are switched on and also when they are off.

Thirdly, there are application specific ICs. They are single purpose chips used for repetitive, routine processing.

Lastly, there are chips which carry a system on them. These integrate and combine many chips and circuits into one chip, e.g. a camera, video or Wi-Fi.

Chip Designers

Companies such as Qualcomm, Medi Tek design the chip and get it made by the foundries.

Some Comments

Initially, there will be limited buyers for wafers of any fab plant in India. There are hardly any companies in India which can place orders. India lacks fabless design companies such as Qualcomm to place orders. India must have a foreign partner who can buy back and offtake agreement with the fab company.

Fab plant requires complex technology. However, it is not enough to pay and get technology and drawings to make a fab plant. There should be desirable yields — the total number of chips produced to the maximum chip count on one wafer, ideally it should be 90-95 per cent.

Fabless design companies are ready to buy provided the prices are competitive and the quality is on par with other foreign firms.

Some IDM players can make their own design and fabricate the chips in their own fab plants. However, big IDMs have not shown much interest in India.

Micron is ready to shift its ATMP functions to India to test wafers. Its global plants will import the wafers at agreed transfer price.

Some experts suggest India should have compound semiconductor plants (based on gallium nitride or silicon carbide, rather than pure silicon). These require investments of $100 million-$500 million. But these can be built quickly and have a growing market in automobiles, telecom and power electronics.

25th July 2023