O3 underwhelms: openAI’s latest AI scores lower than advertised

The difference in benchmark results from first-party and third-party sources for OpenAI’s o3 AI model is prompting scrutiny regarding the company’s transparency and its practices in model testing.

In December, OpenAI introduced o3, asserting that the model was capable of answering slightly more than 25% of the questions presented in FrontierMath, a notably difficult collection of mathematical challenges. The score outperformed the competition significantly, with the next-best model achieving a mere 2% accuracy on FrontierMath problems.

Mark Chen, chief research officer at OpenAI, stated during a livestream, “Today, all offerings out there have less than 2% [on FrontierMath].” “Internally, we are observing that with o3 in aggressive test-time compute settings, we are able to achieve over 25%.”

It appears that the reported figure may represent an upper limit, reached by a variant of o3 that utilised more computational resources than the model OpenAI introduced to the public last week.

On Friday, Epoch AI, the research institute responsible for FrontierMath, unveiled the findings of its independent benchmark tests of o3. Epoch reported that o3 achieved a score of approximately 10%, significantly lower than the highest score claimed by OpenAI.

It is not accurate to say that OpenAI has engaged in deceit, strictly speaking. In December, the company released benchmark results indicating a lower-bound score that aligns with the score noted by Epoch. Epoch has indicated that its testing setup is probably distinct from that of OpenAI, and it utilised a revised version of FrontierMath for its assessments.

Epoch noted that the disparity between their findings and those of OpenAI could stem from several factors. These include the possibility that OpenAI utilised a more advanced internal framework, employed greater computational resources during testing, or conducted evaluations on a different selection of FrontierMath problems, specifically contrasting the 180 problems from frontiermath-2024-11-26 with the 290 problems from frontiermath-2025-02-28-private.

A recent post on X by the ARC Prize Foundation, which evaluated a prerelease version of o3, indicates that the public o3 model “is a different model […] tuned for chat/product use,” supporting the findings reported by Epoch.

ARC Prize reported that “all released o3 compute tiers are smaller than the version we [benchmarked].” In general, larger compute tiers tend to deliver superior benchmark scores.

During a livestream last week, Wenda Zhou, a member of the technical staff at OpenAI, stated that the o3 currently in production is “more optimised for real-world use cases” and speed compared to the version of o3 showcased in December. Consequently, it could display notable “disparities” in benchmarks, he noted.

Zhou stated that they have implemented optimisations aimed at enhancing the model’s cost-efficiency and overall utility. “Our optimism remains intact as we believe that this represents a significantly improved model […] “You can expect quicker responses when seeking answers, a notable advantage of these models.”

While it is true that the public release of o3 does not fully meet OpenAI’s testing commitments, this observation may be somewhat irrelevant. The company’s o3-mini-high and o4-mini models have demonstrated superior performance over o3 in FrontierMath. Additionally, OpenAI is set to introduce a more advanced version of o3, known as o3-pro, in the near future.

This serves as yet another reminder that AI benchmarks should be approached with caution, especially when they originate from companies with vested interests in promoting their services.

In the rapidly evolving AI industry, benchmarking controversies are increasingly frequent as vendors strive to dominate headlines and capture consumer attention with their latest models.

In January, Epoch faced criticism for delaying the disclosure of funding from OpenAI until following the company’s announcement of o3. Numerous scholars who played a role in the development of FrontierMath were unaware of OpenAI’s participation until the information was disclosed to the public.