o3 from OpenAI has achieved human-level results on a test designed to measure general intelligence. The evaluation test is called ARC-AGI on which the previous AI systems could not cross 55 per cent, though the threshold set for human-level results was set at 85 per cent. The test is a tough mathematics test. Apparently, OpenAI has achieved a significant step towards the goal.
Let us understand ARC-AGI test. It is a test of AI system’s sample efficiency in adapting to something new. An AI system is tested against new examples of a novel situation which the system has to figure out.
ChatGPT is not very sample efficient, as it is pretrained on millions of examples of human text and constructs probabilities rules about combinations of words most likely to follow. It is good at common tasks but not so good or bad at uncommon tasks — it is exposed to less data or fewer such samples of these tasks.
The ability to generalize in essence means to solve previously unknown or novel problems from limited samples of data. This is an important component of general intelligence.
The ARC-AGI benchmark tests for sample efficient adaptation. It uses little grid-square problems. The AI needs to figure out the pattern that turns grid on the left into the grid on the right. Each question poses three illustrations to learn from. The AI system needs to figure out the rules that generalize from these three examples to the fourth.
It is akin to IQ tests in some school tests and competitive examinations.
OpenAI’s o3 model is highly adaptable and it has done it. In detecting the pattern, there should not be unnecessary assumptions. At the same time, we should not be more specific than it is needed. The weakest rules, in theory, are the ones that do what you want to do. If you can identify these, it maximizes your ability to adapt to new situations.
Weaker rules are the ones phrased in simpler statements.
OpenAI may not have optimized o3 model to find weak rules. However, to succeed at ARC-AGI, it is necessary to find them.
OpenAI began with a general-purpose version of the o3 model, and then trained it specifically for the ARC-AGI test.
Francois Chollet who designed the benchmark believes o3 searches through different ‘chains of thought’ to solve the task. It then chooses the ‘best’ according to some loosely defined rule, or heuristic’.
The chains of thought could be considered as programme that aligns with the examples.
Several different programmes are generated. The heuristic could be ‘chose the weakest’ or ‘choose the simplest’.
Is this what is closer to AGI? If the modus operandi of the model is what has been described above, it would not be better than the previous models. Its learning from language generalization is not suitable. A more generalizable ‘chain of thought’ could be seen through the extra steps of training a heuristic specialized to this test.
Much of what has gone into the making of o3 remains unknown. Its exposure is still limited to a select audience. Its understanding will require extensive work. It should be seen how often it succeeds and how often it fails. Let it hit the market. We will then know whether it is as adaptable as an average human being. If it is, it can be further improved to achieve accelerated intelligence. It is to be governed too by certain framework. If it is not adaptable, we should call the test result impressive, but life goes on as usual.