Published 12 days ago • loading... • Updated 10 days ago

Meta's benchmarks for its new AI models are a bit misleading

Meta released two new AI models, Maverick and Scout, based on its Llama 4 model over the weekend.
Maverick achieved an ELO score of 1417, ranking it second on LMArena, although the version tested differed from the public release.
Critics noted that the version of Maverick tested on LMArena was different from the public version, creating confusion.
Meta acknowledged the need to clarify that Maverick was a customized model for benchmarking purposes, stating it does not align with their policy expectations.

Insights by Ground AI

Does this summary seem wrong?

14 Articles

All

Left

Center

Right

Gizmodo

Lean Left

Meta Cheated on AI Benchmarks and It's a Glimpse Into a New Golden Age

The quest to be number one sometimes includes just a little cheating.

11 days ago·United States

Read Full Article

The Verge

Reposted by

technewstube.com

Lean Left

Meta got caught gaming AI benchmarks

Over the weekend, Meta dropped two new Llama 4 models: a smaller model named Scout, and Maverick, a mid-size model that the company claims can beat GPT-4o and Gemini 2.0 Flash âacross a broad range of widely reported benchmarks.â Maverick quickly secured the number-two spot on LMArena, the AI benchmark site where humans compare outputs from different systems and vote on the best one. In Metaâs press release, the company highlighted Maveric…

11 days ago·United States

Read Full Article

TechCrunch

Center

Meta's benchmarks for its new AI models are a bit misleading

Meta appears to have used an unreleased, custom version of one of its new flagship AI models, Maverick, to boost a benchmark score.

12 days ago·United States

Read Full Article

Macitynet.it

Meta Plays Smart with AI Results

Benchmarks to test and evaluate AI models are the same for everyone: Meta's move will be prohibited with the revision of the rules - on macitynet.it Meta plays smart with the results of Artificial Intelligence

10 days ago

Read Full Article

InfoWorld

What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims

Benchmarks are critical when evaluating AI — they reveal how well models work, as well as their strengths and weaknesses, based on factors like reliability, accuracy, and versatility. But the revelation that Meta misled users about the performance of its new Llama 4 model has raised red flags about the accuracy and relevancy of benchmarking, particularly when model builders tweak their products to get better results. “Organizations need to perfo…

10 days ago

Read Full Article