Chandamama Kathalu: Telugu SLM

Chandamama was a popular magazine that told stories to children. Swecha, a non-profit organization, in collaboration with Ozonetel decided to tell those stories by developing a small language model (SLM) in Telugu. The SLM will be launched in January, 2024.

Let us understand the concept of SLM. The genesis of SLM lies in a paper authored by Microsoft research scientists titled Tiny Stories. SLMs are built on the same methology as any LLM, but its neural network is smaller, it has fewer parameters, and it is trained on smaller corpus of data.

Ozonetel who collaborated with Swecha decided to develop a Telugu SLM. They were assisted by IIIT, Hyderabad. There was a dataset of Telugu stories — some 40000 pages of stories, preprocessed by some 8000 students. The idea was to give children access to the kind of stories that used to appear in Chandamama Kathalu magazine. This magazine was in print till 2012. Chandamama was available in all Indian homes through the 1940s till 2012. It published long-running mythological and magical Indian stories.

After building a dataset, it is being assessed whether the data is to be tokenized. Tokens, as we know, are the basic units of text or code that a language model uses to process and generate language. Tokens could be words or parts of words or characters or other segments of text or code.

Microsoft soon after the release of their paper Tiny Stories developed an SLM using 21 million stories. This SLM was capable of generating coherent text. That gave Swecha a lot of hope.

Swecha also worked upon optical character recognition (OCR). It is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine encoded text. They used an open source OCR tool and converted 70 per cent of text. The remaining 30 per cent were typed out by students. They had a corpus of 45000 stories. Big Stories were added too. They generated half a million lines of text.

The corpus was uploaded on Hugging Face, so that other companies can use this dataset. They wanted to open up this dataset.

They are now researching if an LLM is to be built, what kind of tokenization is needed. They are interacting with IIIT. They will take four or five months before they have their LLM.

print

Leave a Reply

Your email address will not be published. Required fields are marked *