AI Capabilities May Be Overhyped on Bogus Benchmarks, Study Finds

You know all of those reports about artificial intelligence models successfully passing the bar or achieving Ph.D.-level intelligence? Looks like we should start taking those degrees back. A new study from researchers at the Oxford Internet Institute suggests that most of the popular benchmarking tools that are used to test AI performance are often unreliable and misleading.

Researchers looked at 445 different benchmark tests used by the industry and other academic outfits to test everything from reasoning capabilities to performance on coding tasks. Experts reviewed each benchmarking approach and found indications that the results produced by these tests may not be as accurate as they have been presented, due in part to vague definitions for what a benchmark is attempting to test and a lack of disclosure of statistical methods that would allow different models to be easily compared.

A big problem that the researchers found is that “Many benchmarks are not valid measurements of their intended targets.” That is to say, while a benchmark may claim to measure a specific skill, it could identify that skill in a way that doesn’t actually capture a model’s capability.

For example, the researchers point to the Grade School Math 8K (GSM8K) benchmarking test, which measures a model’s performance on grade school-level word-based math problems designed to push the model into “multi-step mathematical reasoning.” The GSM8K is advertised as being “useful for probing the informal reasoning ability of large language models.”

But the researchers argue that the test doesn’t necessarily tell you if a model is engaging in reasoning. “When you ask a first grader what two plus five equals and they say seven, yes, that’s the correct answer. But can you conclude from this that a fifth grader has mastered mathematical reasoning or arithmetic reasoning from just being able to add numbers? Perhaps, but I think the answer is very likely no,” Adam Mahdi, a senior research fellow at the Oxford Internet Institute and a lead author of the study, told NBC News.

In the study, the researchers pointed out that GSM8K scores have increased over time, which may point to models getting better at this kind of reasoning and performance. But it may also point to contamination, which happens when benchmark test questions make it into the model’s dataset or the model starts “memorizing” answers or information rather than reasoning its way to a solution. When researchers tested the same performance on a new set of benchmark questions, they noticed that models experienced “significant performance drops.”

While this study is among the largest reviews of AI benchmarking, it’s not the first to suggest this system of measurement may not be all that it’s sold to be. Last year, researchers at Stanford analyzed several popular AI model benchmark tests and found “large quality differences between them, including those widely relied on by developers and policymakers,” and noted that most benchmarks “are highest quality at the design stage and lowest quality at the implementation stage.”

If nothing else, the research is a good reminder that these performance measures, while often well-intended and meant to provide an accurate analysis of a model, can be turned into little more than marketing speak for companies.

Trending Products

- 24% Acer KC242Y Hbi 23.8″ Full HD...
Original price was: $117.99.Current price is: $89.99.

Acer KC242Y Hbi 23.8″ Full HD...

0
Add to compare
- 8% Wireless Keyboard and Mouse, Ergono...
Original price was: $49.99.Current price is: $45.99.

Wireless Keyboard and Mouse, Ergono...

0
Add to compare
- 39% Thermaltake View 200 TG ARGB Mother...
Original price was: $130.38.Current price is: $79.99.

Thermaltake View 200 TG ARGB Mother...

0
Add to compare
- 34% Lenovo V-Series V15 Business Laptop...
Original price was: $1,001.68.Current price is: $659.00.

Lenovo V-Series V15 Business Laptop...

0
Add to compare
- 35% Logitech MK955 Signature Slim Wi-fi...
Original price was: $152.98.Current price is: $99.99.

Logitech MK955 Signature Slim Wi-fi...

0
Add to compare
- 29% Acer KB272 EBI 27″ IPS Full H...
Original price was: $154.99.Current price is: $109.99.

Acer KB272 EBI 27″ IPS Full H...

0
Add to compare
- 37% Dell Inspiron 15 3520 15.6″ F...
Original price was: $851.62.Current price is: $539.00.

Dell Inspiron 15 3520 15.6″ F...

0
Add to compare
- 31% ASUS RT-AX1800S Dual Band WiFi 6 Ex...
Original price was: $99.99.Current price is: $68.94.

ASUS RT-AX1800S Dual Band WiFi 6 Ex...

0
Add to compare
- 33% Cooler Master Q300L V2 Micro-ATX To...
Original price was: $89.99.Current price is: $59.99.

Cooler Master Q300L V2 Micro-ATX To...

0
Add to compare
- 42% KEDIERS ATX PC Case,6 PWM ARGB Foll...
Original price was: $188.08.Current price is: $109.99.

KEDIERS ATX PC Case,6 PWM ARGB Foll...

0
Add to compare
.

We will be happy to hear your thoughts

Leave a reply

MaeAlexisFinds
Logo
Register New Account
Compare items
  • Total (0)
Compare
0
Shopping cart