Leading AI models were recently subjected to an unusual test: their performance in the courtroom of the video game "Phoenix Wright: Ace Attorney". Researchers at the Hao AI Lab at the University of California San Diego used the popular game to evaluate the AIs' ability to reason logically and argue. "Ace Attorney" is particularly suitable for this purpose because players must gather evidence, uncover contradictions in witness testimonies, and reveal the truth behind lies – complex tasks that also pose a challenge for artificial intelligence.
The inspiration for the experiment came in part from Ilya Sutskever, co-founder of OpenAI. He compared predicting the next word in a text to understanding a detective story. The models used in the "Ace Attorney" test had to process long dialogues, identify inconsistencies in cross-examinations, and present the appropriate evidence to challenge witness statements.
Various multimodal and logic-specialized AI models were tested, including OpenAI o1, Gemini 2.5 Pro, Claude 3.7-thinking, and Llama 4 Maverick. Both o1 and Gemini 2.5 Pro made it to level 4 of the game, with o1 coming out on top in the most difficult cases. With scores of 26 and 20, o1-2024-12-17 and Gemini-2.5-Pro achieved the best results in the "Ace Attorney" performance test.
The test goes beyond mere text or image analysis. The models had to search through long passages of text, identify contradictions within them, accurately understand visual information, and make strategic decisions throughout the game. "The game design forces the AI to translate its understanding into context-dependent actions, which goes beyond pure text and image tasks. It's harder to overfit the model because success here depends on thinking about the context-dependent action space – not just memorization," the researchers explain. Overfitting occurs when a language model memorizes its training data – including all randomness and errors – and therefore performs poorly on new, unknown examples.
Gemini 2.5 Pro proved to be significantly more cost-effective compared to the other models tested. The Hao AI Lab reports that it is six to fifteen times cheaper than o1, depending on the case. In a particularly long scenario in level 2, o1 incurred costs of over $45.75, while Gemini 2.5 Pro completed the task for $7.89. Compared to GPT-4.1, which is not specifically optimized for logical reasoning, Gemini 2.5 Pro also performed more cost-effectively at $1.25 per million input tokens compared to $2 for GPT-4.1. However, the researchers point out that the actual costs could be slightly higher due to image processing requirements.
The Hao AI Lab has been comparing language models using various games like Candy Crush, 2048, Sokoban, Tetris, and Super Mario since February. Of all the titles tested so far, "Ace Attorney" is likely the game with the most demanding mechanics in terms of logical reasoning. The results of the test show that while AI models are making progress in the area of logical reasoning, there is still room for improvement in terms of cost efficiency.
Quellen: - https://automaton-media.com/en/news/ace-attorney-dev-reacts-to-the-game-being-used-to-test-how-smart-ai-models-are-maybe-this-kind-of-deductive-power-is-the-strength-of-humans/ - https://www.eurogamer.net/ace-attorney-developer-responds-as-ai-research-pits-various-models-against-original-game-with-intriguing-results - https://x.com/theaitechsuite/status/1916137530161799439 - https://www.facebook.com/HDSIUCSD/posts/how-can-video-games-like-ace-attorney-be-used-to-evaluate-popular-ai-models-like/1256846369463520/ - https://de.linkedin.com/company/the-decoder-en - https://the-decoder.com/ - https://www.skool.com/promptmonkey/ai-models-tested-as-digital-detectives-in-ace-attorney-game - https://www.instagram.com/p/DIhURNryplq/ - https://www.dexerto.com/gaming/phoenix-wright-ace-attorney-dev-shocked-after-his-game-is-used-to-test-ai-3183610/ - https://www.reddit.com/r/AceAttorney/comments/1e3b47u/ace_attorney_localization/