Scientists have devised a new way to measure how capable artificial intelligence (AI) systems are — how fast they can beat, or compete with, humans in challenging tasks.
While AIs can generally outperform humans in text prediction and knowledge tasks, when given more substantive projects to carry out, such as remote executive assistance, they are less effective.
To quantify these performance gains in AI models, a new study has proposed measuring AIs based on the duration of tasks they can complete, versus how long it takes humans. The researchers published their findings March 30 on the preprint database arXiv, so they have not yet been peer-reviewed.
“We find that measuring the length of tasks that models can complete is a helpful lens for understanding current AI capabilities. This makes sense: AI agents often seem to struggle with stringing together longer sequences of actions more than they lack skills or knowledge needed to solve single steps,” the researchers from AI organization Model Evaluation & Threat Research (METR) explained in a blog post accompanying the study.
The researchers found that AI models completed tasks that would take humans less than four minutes with a near-100% success rate. However, this dropped to 10% for tasks taking more than four hours. Older AI models performed worse at longer tasks than the latest systems.
This was to be expected, with the study highlighting that the length of tasks generalists AIs could complete with 50% reliability has been doubling roughly every seven months for the last six years.
Related: Scientists discover major differences in how humans and AI ‘think’ — and the implications could be significant
To conduct their study, the researchers took a variety of AI models — from Sonnet 3.7 and GPT-4 to Claude 3 Opus and older GPT models — and pitted them against a suite of tasks. These ranged from easy assignments that typically take humans a couple of minutes like looking up a basic factual question on Wikipedia) to ones that take human experts multiple hours — complex programming tasks like writing CUDA kernels or fixing a subtle bug in PyTorch, for example.
Testing tools including HCAST and RE-Bench were used; the former has 189 autonomy software tasks setup to assess AI agent capabilities in handling tasks around machine learning, cyber security and software engineering, while the latter uses seven challenging open-ended machine-learning research engineering tasks, such as optimizing a GPU kernel, benchmarked against human experts.
The researchers then rated these tasks for “messiness”, to see and assess how some tasks contained things like the need for coordination between multiple streams of work in real-time — effectively making the task messier to complete — and so are more representative of real-world tasks.
The researchers also developed software atomic actions (SWAA) to establish how fast real people can complete the tasks. These are single-step tasks ranging from one to 30 seconds, baselined by METR employees.
Effectively, the study found that the “attention span” of AI is advancing at speed. By extrapolating this trend, the researchers projected (if indeed their results can be generally applied to real-world tasks) that AI can automate a month’s worth of human software development by 2032..
To better understand the advancing capabilities of AI and its potential impact and risks to society, this study could form a new benchmark relating to real-world outcomes to enable “a meaningful interpretation of absolute performance, not just relative performance,” the scientists said.
A new frontier for assessing AI?
A potential new benchmark could enable us to better understand the actual intelligence and capabilities of AI systems.
“The metric itself isn’t likely to change the course of AI development, but it will track how quickly progress is being made on certain types of tasks in which AI systems will ideally be used,” Sohrob Kazerounian, a distinguished AI researcher at Vectra AI, told Live Science.
“Measuring AI against the length of time it takes a human to accomplish a given task is an interesting proxy metric for intelligence and general capabilities,” Kazerounian said. “First, because there is no singular metric that captures what we mean when we say “intelligence.” Second, because the likelihood of carrying out a prolonged task without drift or error becomes vanishingly small. Third, because it is a direct measure against the types of tasks we hope to make use of AI for; namely solving complex human problems. While it might not capture all the relevant factors or nuances about AI capabilities, it is certainly a useful datapoint,” he added.
Eleanor Watson, IEEE member and an AI ethics engineer at Singularity University, agrees that the research is useful.
Measuring AIs on the length of tasks is “valuable and intuitive” and “directly reflects real-world complexity, capturing AI’s proficiency at maintaining coherent goal-directed behaviour over time,” compared to traditional tests that assess AI performance on short, isolated problems, she told Live Science.
Generalist AI is coming
Arguably, besides a new benchmark metric, the paper’s biggest impact is in highlighting how quickly AI systems are advancing, alongside the upward trend in their ability to handle lengthy tasks. With this in mind, Watson predicts that the emergence of generalist AI agents that can handle a variety of tasks will be imminent.
“By 2026, we’ll see AI becoming increasingly general, handling varied tasks across an entire day or week rather than short, narrowly defined assignments,” said Watson.
For businesses, Watson noted, this could yield AIs that can take on substantial portions of professional workloads — which could not only reduce costs and improve efficiency but also let people focus on more creative, strategic and interpersonal tasks.
“For consumers, AI will evolve from a simple assistant into a dependable personal manager, capable of handling complex life tasks — such as travel planning, health monitoring, or managing financial portfolios — over days or weeks, with minimal oversight,” Watson added.
In effect, the ability for AIs to handle a broad range of lengthy tasks could have a significant impact on how society interacts and uses AI in the next few years.
“While specialized AI tools will persist in niche applications for efficiency reasons, powerful generalist AI agents — capable of flexibly switching among diverse tasks — will emerge prominently,” Watson concluded. “These systems will integrate specialized skills into broader, goal-directed workflows, reshaping daily life and professional practices in fundamental ways.”