OpenAI's O3 and O4-mini: Performance Benchmarked on ARC-AGI

```html

OpenAI's O3: A Reality Check on Performance

Expectations for OpenAI's O3 were high, especially after the promising results of the preview version. But how does the finally released model perform in practice compared to the preview and other AI models? An analysis by the ARC Prize Foundation, a non-profit organization specializing in the evaluation of AI systems, provides insightful findings.

ARC-AGI Benchmark: A Testing Ground for AI Reasoning

The ARC Prize Foundation uses open benchmarks like ARC-AGI to highlight the discrepancy between human reasoning and the capabilities of current AI systems. The ARC-AGI benchmark tests abilities such as symbolic thinking, multi-step inferences, and context-dependent rule application – areas where humans often act intuitively, but AI models still struggle. The tests are conducted at different "reasoning levels" ("low", "medium", "high") to vary the depth of model reasoning.

O3 in Comparison: Performance and Cost

The ARC Prize Foundation's analysis included 740 tasks from ARC-AGI-1 and ARC-AGI-2, tested with O3 and O4-mini at all three reasoning levels. O3 achieved an accuracy of 41 percent (low computational cost) and 53 percent (medium computational cost) on ARC-AGI-1. The smaller model O4-mini achieved 21 percent (low) and 42 percent (medium). On the more challenging ARC-AGI-2 benchmark, both models performed significantly worse, achieving less than three percent accuracy. The comparison with the predecessor model O1 is interesting: O3 surpasses O1 on ARC-AGI-1 by about 20 percent, but remains significantly behind the results of the O3 preview from December 2024.

At higher computational cost ("high"), both models failed to complete many tasks. The analysis also showed that the models tended to solve easier tasks and leave more difficult ones unanswered. An evaluation of only the successful answers would distort the actual performance, so these partial results were excluded from the official rankings.

Focus on Efficiency: The Advantage of O4-mini

The data shows that a higher computational cost does not necessarily lead to better results, but often only to higher costs. In particular, O3-high consumes significantly more computing power without a corresponding improvement in accuracy for simpler tasks. The ARC Prize Foundation therefore recommends O3-medium as the default setting for cost-sensitive applications. The "High-Reasoning" mode is only recommended if maximum accuracy is required and cost is less important.

In terms of efficiency – how quickly, cost-effectively, and with minimal computational effort a model can solve problems – O4-mini stands out: It achieves 21 percent accuracy on ARC-AGI-1 at a cost of about five cents per task, while older models like O1-pro require about eleven dollars per task for comparable results.

From Preview to Reality: Differences in O3

The released version of O3 differs significantly from the O3 preview version. OpenAI confirmed to ARC that the released model has a different architecture, is smaller overall, is multimodal (processes both text and image inputs), and uses fewer computational resources than the preview version. There are also differences in the training data. While the O3 preview was trained with the ARC-AGI-1 dataset to 75 percent, the released O3 model was not directly trained with ARC-AGI data. However, the model may have been indirectly exposed to it through the public availability of the benchmark.

Outlook: Progress and Challenges

Although O3 is the best-performing publicly tested model on ARC-AGI-1, the newly introduced ARC-AGI-2 benchmark remains largely unsolved by both new models. While humans solve an average of 60 percent of ARC-AGI-2 tasks even without special training, OpenAI's strongest model currently achieves only about three percent. This highlights the continuing gap between human and machine problem-solving abilities.

The results of the ARC Prize Foundation emphasize that benchmark results, especially for unreleased AI models, should be treated with caution. The development of AI models is a dynamic process, and the performance of a model can be influenced by various factors.

Sources: https://ground.news/article/openais-o3-is-less-agi-than-originally-measured https://techcrunch.com/2025/04/20/openais-o3-ai-model-scores-lower-on-a-benchmark-than-the-company-initially-implied/ https://arcprize.org/blog/oai-o3-pub-breakthrough https://www.reddit.com/r/OpenAI/comments/1hiskbt/o3_is_not_agi/ https://openai.com/index/introducing-o3-and-o4-mini/ https://www.newscientist.com/article/2462000-openais-o3-model-aced-a-test-of-ai-reasoning-but-its-still-not-agi/ https://arxiv.org/html/2501.07458v1 https://www.datacamp.com/blog/o3-openai https://arcprize.org/blog/analyzing-o3-with-arc-agi ```