O3 underwhelms: openAI’s latest AI scores lower than advertised

April 22, 2025

by The Ai Legends

The difference in benchmark results from first-party and third-party sources for OpenAI’s o3 AI model is prompting scrutiny regarding the company’s transparency and its practices in model testing.

In December, OpenAI introduced o3, asserting that the model was capable of answering slightly more than 25% of the questions presented in FrontierMath, a notably difficult collection of mathematical challenges. The score outperformed the competition significantly, with the next-best model achieving a mere 2% accuracy on FrontierMath problems.

Mark Chen, chief research officer at OpenAI, stated during a livestream, “Today, all offerings out there have less than 2% [on FrontierMath].” “Internally, we are observing that with o3 in aggressive test-time compute settings, we are able to achieve over 25%.”

It appears that the reported figure may represent an upper limit, reached by a variant of o3 that utilised more computational resources than the model OpenAI introduced to the public last week.

On Friday, Epoch AI, the research institute responsible for FrontierMath, unveiled the findings of its independent benchmark tests of o3. Epoch reported that o3 achieved a score of approximately 10%, significantly lower than the highest score claimed by OpenAI.

It is not accurate to say that OpenAI has engaged in deceit, strictly speaking. In December, the company released benchmark results indicating a lower-bound score that aligns with the score noted by Epoch. Epoch has indicated that its testing setup is probably distinct from that of OpenAI, and it utilised a revised version of FrontierMath for its assessments.

Epoch noted that the disparity between their findings and those of OpenAI could stem from several factors. These include the possibility that OpenAI utilised a more advanced internal framework, employed greater computational resources during testing, or conducted evaluations on a different selection of FrontierMath problems, specifically contrasting the 180 problems from frontiermath-2024-11-26 with the 290 problems from frontiermath-2025-02-28-private.

A recent post on X by the ARC Prize Foundation, which evaluated a prerelease version of o3, indicates that the public o3 model “is a different model […] tuned for chat/product use,” supporting the findings reported by Epoch.

ARC Prize reported that “all released o3 compute tiers are smaller than the version we [benchmarked].” In general, larger compute tiers tend to deliver superior benchmark scores.

During a livestream last week, Wenda Zhou, a member of the technical staff at OpenAI, stated that the o3 currently in production is “more optimised for real-world use cases” and speed compared to the version of o3 showcased in December. Consequently, it could display notable “disparities” in benchmarks, he noted.

Zhou stated that they have implemented optimisations aimed at enhancing the model’s cost-efficiency and overall utility. “Our optimism remains intact as we believe that this represents a significantly improved model […] “You can expect quicker responses when seeking answers, a notable advantage of these models.”

While it is true that the public release of o3 does not fully meet OpenAI’s testing commitments, this observation may be somewhat irrelevant. The company’s o3-mini-high and o4-mini models have demonstrated superior performance over o3 in FrontierMath. Additionally, OpenAI is set to introduce a more advanced version of o3, known as o3-pro, in the near future.

This serves as yet another reminder that AI benchmarks should be approached with caution, especially when they originate from companies with vested interests in promoting their services.

In the rapidly evolving AI industry, benchmarking controversies are increasingly frequent as vendors strive to dominate headlines and capture consumer attention with their latest models.

In January, Epoch faced criticism for delaying the disclosure of funding from OpenAI until following the company’s announcement of o3. Numerous scholars who played a role in the development of FrontierMath were unaware of OpenAI’s participation until the information was disclosed to the public.

0 Responses

Dizaynersk_neSl says:

April 26, 2025 at 6:10 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

Лучшие модели дизайнерской мебели премиум-класса. Мебель премиум-класса https://www.byfurniture.by .
New startup ‘grouphug’ aims to power WhatsApp groups with AI – The AI Legends says:

April 29, 2025 at 1:18 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

[…] This serves merely as a preview of what lies ahead. At the heart of this straightforward initiative to attract beta users lies an application operating in stealth mode. Petersen suggested to TechCrunch that the company is preparing to unveil a platform aimed at enhancing the value derived from WhatsApp groups through generative AI. […]
UAE to introduce AI education in schools from kindergarten – The AI Legends says:

May 31, 2025 at 12:00 am
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

[…] the potential for the nation’s youth to lead global discussions on the topic. Sam Altman of OpenAI has previously referred to the UAE as the world’s ‘sandbox,’ where critical […]
ThomasHeamb says:

June 8, 2025 at 10:33 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

https://zumvu.com/roscarpt/
Original Piguet Royal Oak 15710 watches says:

June 11, 2025 at 5:13 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

Discover detailed information about the Audemars Piguet Royal Oak Offshore 15710ST on this site , including pricing insights ranging from $34,566 to $36,200 for stainless steel models. The 42mm timepiece features a robust design with selfwinding caliber and water resistance , crafted in titanium. https://ap15710st.superpodium.com Analyze secondary market data , where limited editions fluctuate with demand, alongside rare references from the 1970s. Request real-time updates on availability, specifications, and resale performance , with trend reports for informed decisions.
Pre-Owned Audemars Royal Oak 15710 st price says:

June 16, 2025 at 7:13 am
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

Discover detailed information about the Audemars Piguet Royal Oak Offshore 15710ST on this site , including price trends ranging from $34,566 to $36,200 for stainless steel models. The 42mm timepiece showcases a robust design with mechanical precision and rugged aesthetics, crafted in titanium. https://ap15710st.superpodium.com Compare secondary market data , where limited editions reach up to $750,000 , alongside vintage models from the 1970s. Get real-time updates on availability, specifications, and resale performance , with free market analyses for informed decisions.
Kennethimmuh says:

June 17, 2025 at 7:42 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

https://share.evernote.com/note/eda6259f-66df-57f4-8df1-2eb1ed0bfda0
avenue17 says:

June 21, 2025 at 3:31 pm
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

What eventually it is necessary to it?
Scottadvaf says:

June 25, 2025 at 3:40 am
Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.

https://www.flickr.com/people/203023791@N03/

Share this:

0 Responses

Related Posts