LLMs and Biology

LLMs are likely to be used in the field of biology, by learning the language of biology. In previous century, there was considerable research in molecular biology, biochemistry and genetics. It is obvious that biology is programmable. It is decipherable too. There four basic components of life — adenine(A), cystosine (C), guanine (G) and thymine (T). Computers depend on the binary system of 0s and 1s. Biology depends on the quartenary system of A, C, G and T. Here there is conceptual overlap. Proteins are made of amino acids — could be a few dozen amino acids to several thousands of them. There are 20 amino acids to choose from. Thus this too is amenable to computerization.

Denis Hassabis, DeepMind, treats biology as an information processing system. Physics depend on maths as its primary language. AI could as well depend on biology as its primary language.

LLMs work optimally in presence of massive signal-rich data. LLMs infer patents and structures. They generate novel output by comprehending the topic.

By ingesting the whole Internet, ChatGPT has become conversational.

If LLMs are trained on biological data, they could learn the language of life.

In early application, they could be used to design proteins, the building blocks of life. Proteins have shapes as their functions. Antibody proteins target foreign bodies, the antigens, just as the key fits into a lock. Enzymes accelerate biological reactions. They are proteins which bind with certain molecules. This alignment makes us aware about how the life functions.

Protein’s one-dimensional structure was converted to 3D using protein alignment. This was done by using AlphaFold AI system. Of course AlphaFold has not been developed using LLMs. It used MSA — multiple sequence alignment from bioinformatics. But this has limitations. It is a slow compute intensive system. It cannot be used for ‘orphan’ proteins with no analogues. Such proteins constitute 20 per cent of all proteins. Protein structure can be deducted/predicted using LLMs.

LLMs can be trained on protein sequences, instead of the English language. They can efficiently be used to predict the protein structure. It started in 2019. In 2022, Facebook put forward ESM-2 and ESMFold, two powerful protein models. They had 15 billion parameters. The predicted sequences can be reversed, thus paving way to generate novel protein structures.

AI can be used to invent new proteins. The vast unchartered protein space can be explored. It iss a nascent field.

LLMs can be used to generate biomolecules such as nucleic acids.

The ultimate aim is to go beyond modelling. We have to study the interactions of proteins with other molecules, and cells, and tissues, and organs so as to cover the whole living organism.

20th Century was dominated by Physics. It is expected that 21st century will be dominated by Biology.

print

Leave a Reply

Your email address will not be published. Required fields are marked *