The functioning of judging AI needs to be re -obtained

When Anthropic released Cloud 4 a week ago, the Artificial Intelligence (AI) company stated that these models have “new standards for coding, advanced arguments and AI agents”. They cite the leading score on SWE-Bench verified, which is a benchmark for performing on real software engineering works. Openai also claims that the O3 and O4-Mini models return the best score at some benchmarks. As Mistral, for the Open-SOS Devastral Coding Model.

Representative image. (Getty image/istockphoto)
Representative image. (Getty image/istockphoto)

AI companies flexing comparative test scores are a common theme.

The technology world has long been obsessed with the synthetic benchmark test score. Processor performance, memory bandwidth, storage speed, graphics display – in plenty, it is often used to judge whether PC or smartphone was worth your time and money.

Nevertheless, experts believe that this may be the time to develop functioning for AI testing rather than a wholesale change.

American Venture Capitalist Mary Meikar, in the latest AI Trends Report, notes that AI is doing better than humans in terms of accuracy and realism. She indicates MMLU (massive multitask language undersanding) benchmark, an average of AI model at 92.30% accuracy compared to a human base line of 89.8%.

Mmlu is a benchmark to judge the general knowledge of a model in 57 works covering professional and educational subjects including MMU MMU Mathematics, Law, Medicine and History.

Benchmarks serve as standardized yardstics to measure, compare and understand the development of various AI models. Structured assessments that provide comparable scores for various models. These usually consist of datasets with thousands of curated questions, problems or functions that test special aspects of intelligence.

Understanding the benchmark score requires reference to both scale and meaning behind numbers. Most benchmarks report accuracy as a percentage, but the importance of these percentage varies dramatically in various tests. On MMLU, random estimates will provide about 25% accuracy as most of the questions are multiple choice. Human performance usually occurs from 85–95% based on the subject area.

The headline number often mask important nuances. A model may give more excellence than others, in some subjects. A collected score may hide weak performance on tasks that require multi-step arguments or creative problem-solutions behind strong performance on the so-called performance.

AI engineer and commentator Rohan Paul noted on X that “most benchmarks do not reward long-term memory, but they focus on short-term functions.”

Fast, AI companies are looking closely on the ‘memory’ aspect. In a new paper by Google researchers, a meditation technique was dubbed ‘Infini-Attention’, how to configure how AI models expand their “reference window”.

Mathematical benchmarks often show wide performance gaps. While most of the latest AI models score more than 90% on accuracy, GSM8K benchmark (cloud sonnet 3.5 is with 97.72%, while GPT-4 score 94.8%), more challenging mathematics benchmark sees very low ratings in comparison compared to GOGLE 2.0 Flash 2.0 Flash, Sonet Sonet Has been done).

Re -functioning

For AI testing, the test is required. The words of Microsoft Chairman and CEO (CEO), “All ewals are saturated. It is a little wasted.”

Tech veteran has announced that they are collaborating with institutions including Penn State University, Carnegie Mellon University and Duke University, to develop an approach to evaluate the AI ​​model, which predicts to perform on unfamiliar tasks and explain why, some current benchmarks to struggle.

Efforts are being made to create a benchmarking agent for dynamic evaluation of the model, relevant prediction, human-focused comparative and cultural aspects of general AI.

“Framework uses Edel (Anotate-Demand-Laval), a technology that assesses how to demand a task for AI models by applying the scales of measurement to 18 types of cognitive and knowledge-based capabilities,” says Microsoft’s Research Assistant Laksin Zhou explains, “explains the Framework Edel.

In a moment, the popular benchmark contains SWE-Bench (or software engineering benchmark), which measures AI coding skills, Arch-AGI (abstraction for artificial general intelligence and argument corpus) to measure generalization and logic as well as livebench AI which measure the agent AI which measures agent coding functions and evaluate the arguments, coding work and evaluate the arguments.

Between the boundaries affecting interpretations, many benchmark techniques can be “gaamed” that necessarily improve the score without improving intelligence or capacity. Case in point, new Lama model of Meta.

In April, he announced an array of the model, including the Lama 4 Scout, Lama 4 MAVIC, and still trained Lama 4 Beamoths. Meta Zuckerberg, CEO of Meta claims that Bhamoth will “perform the most performing base model in the world”. Maverick started ranking over Openai’s GPT-4o in LMARENA benchmark, and Gemini 2.5 Pro just below.

This is the place where things went pear -shaped for meta, as AI researchers started digging through these scores. It turns out, Meta shared a Lama 4 Mavric model that was adapted to this test, and would not actually get a specific customers.

Meta denies customization. “We have also heard the claims that we have trained on test sets-it is not just true and we will never do so. Our best understanding is that variable quality people are due to the need to stabilize the implementation,” in a statement, VP of AI in Meta, Ahmed al-Dahle.

There are other challenges. Models can miss specific patterns for benchmark formats rather than developing real understanding. The selection and design of the benchmark also introduces prejudice.

There is a question of localization. The AI ​​researcher of Yi Tay, Google AI and Deepmind has expanded a regional-specific benchmark called SG-Eval, which focuses on helping the train AI model for wide reference. India is also constructing a sovereign big language model (LLM) with Sarvam, a Bangalore -based AI Startup, selected under the mission of India.

Since AI capabilities continue to move forward, researchers are developing methods of evaluation that test for real understanding, reference and abilities in the real world, instead of plain pattern matching. In the case of AI, the numbers tell an important part of the story, but not the whole story.

Leave a Reply

Your email address will not be published. Required fields are marked *