Apart from generative tasks of natural language, LLMs can be put on tasks involving vision. An optical encoder could be trained to represent pictures. These are a series of continuous embeddings. Alternatively, there could be a contrastively trained frozen vision encoder. A lightweight transformer could be aligned to a frozen vision encoder.
There are costs involved in pretraining. Textual and visual datasets are to be aligned to an existing LLM.
Flamingo adds cross-attention layers into an LLM pre-trained to add visual features. The pre-training is multi-modal — 2 billion picture-text pairs, and 43 million websites.
Researchers from Contextual AI and Stanford University has developed LENS (Large Language Models ENhanced to see) strategy. Here an LLM functions as a reasoning module across vision modules.
First rich textual information is extracted using pretrained vision modules. It is then sent to the LLM, which has to execute tasks such as object recognition, vision and language. LENS becomes a bridge between the modalities, without any additional expense since multi-modal pre-training stages are eliminated. In addition, we can avail of the most recent developments in computer vision and NLP by this integration.
Language model’s few-shot, in-context learning capabilities through natural language descriptions of visual parts is facilitated by LENS. It vests an LLM off the self ability to see. Frozen LLMs can be used to handle object recognition and visual reasoning tasks. ( No need to align multi-modal data).