Training Data for AI Models

LLMs are being trained by vast amounts of data and operate with lot of programming. In fact, human programmers create AI machines. Consequently, human errors and innate biases are seamlessly transferred to the machines and get manifested through AI actions and behaviour.

AI’s life blood is data. Data is the driving force behind its development and learning. ChatGPT was trained using 570 GB of text data, or around 300 billion words. DALL-E and Midjourney, two image-generating apps, employ a stable diffusion algorithm that is trained on 5.8 billion image-text pairs. We do not know what training data has been used by Google for Gemini. It could include trillions of pieces of text, images, videos and audio clips.

The data used by AI developers is sourced from high-quality academic papers, books, news articles, Wikipedia and filtered internet content to train the LLMs. The available high-quality data is not enough. Consequently, low-quality data from user generated texts (blog posts, social media posts, online comments) is also used. The low-quality data is perhaps more biased than the high-quality data. It might have illegal content as well. Apart from this, AI systems use simulated or synthetic data that is created by an AI model. Almost some 60 per cent data is likely to be synthetic by 2024, as against 1 per cent in 2021. The underlying programming for these simulations may introduce bias into the data.

If the data is not regularly updated, it becomes dated. Initially, ChatGPT’s data’s cut off was until 2021.

print

Leave a Reply

Your email address will not be published. Required fields are marked *