How can we prevent AI models from cannibalizing themselves when human-generated data runs out? Scientists say they've found the answer.

While the evolution of artificial intelligence (AI) systems has shown no sign of slowing, there’s a growing concern that large language models (LLMs) will soon run out of human-made data to ingest and learn from.

Once this happens, scientists say, AI models will increasingly rely on synthetic AI-made information, which will lead to an effect called “model collapse.” This is where LLMs spout gibberish and the AI systems they underpin deliver inaccurate answers and hallucinate information to queries far more commonly than they do today.

“That’s especially worrying considering some experts think that we will run out of high-quality human-generated data by the end of the year — so if you’re relying on this synthetic data, but there’s an almost existential threat it will sink your AI, you’re in trouble,” Yasser Roudi, a professor of disordered systems in the Department of Mathematics at King’s College London (KCL), told Live Science. “If, for example, you had LLMs that were used in hospitals to analyze brain scans and find cancers, if while training another model they experienced model collapse, these machines could misdiagnose people.”

However, Roudi recently found that model collapse can be bypassed by adding a single human-made data point to an AI’s training data, even if all the other data is AI-generated.

The study ‪—‬ which involved researchers from KCL, the Norwegian University of Science and Technology, and the Abdus Salam International Centre for Theoretical Physics in Italy ‪—‬ was published May 14 in the journal Physical Review Letters.

While AI model collapse hasn’t happened in a real-world scenario with an actively deployed AI system, anyone who uses tools like ChatGPT or Gemini to generate answers or text has very likely experienced errors or hallucinations. However, Roudi hopes the new findings might outline a method to sidestep this potential emergent threat.

Countering collapse

Beyond widely known hallucinations in primitive generative AI products, we may not have yet seen any dramatic examples of model collapse in the form of sophisticated AIs seemingly “going mad” and outputting complete nonsense. But signs of minor collapse could be observed when AI delivers increasingly inaccurate or bland answers to queries, or completely fabricates information while trying to generate some kind of output it assumes a user desires.

By repeatedly training LLMs on data generated by other LLMs, the core truth and source of information ‪—‬ and spikes of variance between generations of models ‪—‬ get “smoothed out,” delivering homogenized answers and outputs. For example, text that might read well enough at first glance could lack any real detail or nuance. Essentially, model collapse can be split into ‘early’ and ‘late’ stages, where the former sees an AI lose the ability to serve up edge-case (rare and or less common) information and produce bland, synthetic-feeling responses, and the latter sees LLMs deliver gibberish information.

The huge scale of LLMs and the data they process can make it hard to establish how and why they hallucinate information, and how certain choices lead to model collapse.

To tackle this, the researchers used smaller models that belong to exponential families — a catch-all term for a number of probability distributions, like ascertaining the likely outcomes from random events. The bell curve is one such example, as is figuring out the chance that a coin flip will land on heads.

“By looking at analytically tractable models such as the exponential families, you can answer those ‘why’ and ‘how’ questions,” Roudi said. “By that same logic, you can come up with ways to mitigate its dangerous effects, how those ways work, and ultimately apply them to real-life examples.”

The researchers discovered that by introducing a single external human-made data point to a pool of synthetic data used by a model undergoing closed-loop training, whereby a new model is trained on data generated by a previous models, they avoided model collapse.

Roudi said one example could be an AI-based image or video classifier, whereby an LLM is trained on data that includes a real image correctly classified by a human, rather than AI-generated media or media classified by an AI.

“In other words, this data point would be linked to a ‘ground truth,’ something we know undeniably to be true and independently verifiable,” Roudi said.

The next step for Roudi and the researchers is to apply this approach to larger and more complex models to see if this principle still holds true. This method could mitigate potentially “disastrous” scenarios of model collapse, especially within the AI models we use in everyday life, the team said.

“This research is the first step in setting out some ground rules for preventing this [from] happening in the future,” Roudi concluded. “While more work should be done, AI engineers making things like the next ChatGPT can use what we’ve found to develop models that don’t collapse.”

Jangjoo, F., Di Sarra, G., Marsili, M., & Roudi, Y. (2026). Lost in Retraining: Closed-Loop learning and model collapse in exponential families. Physical Review Letters, 136(19). https://doi.org/10.1103/156q-3ngc

What's On

Trump admin claims California may have 190K non-citizens on voter rolls

Superman actor Dean Cain argues that Hollywood’s woke era is coming to an end

New ‘poot’ fashion trend is dividing the internet: ‘Silly in my opinion’

How can we prevent AI models from cannibalizing themselves when human-generated data runs out? Scientists say they’ve found the answer.

FDA links ‘explosive diarrhea’ parasite infections to Taco Bell shredded lettuce

‘Potentially hazardous’ asteroid Apophis could be visible to 90% of Earth’s population during ultraclose 2029 flyby, new maps reveal

Heaven Lake: China’s deepest lake sits atop a colossal volcano and belongs mostly to North Korea

Did ancient Egyptian princesses use weapons? Controversial study claims they hunted or trained with the military, but not all experts agree.

‘This was one of the most arduous expeditions I’ve ever done’: Scientists confirm that 15-mile-wide pit found on Google Maps is ancient meteor crater

Ancient chariot emerges among remains of mysterious society that burned down its own buildings and then disappeared

Physicists solve decade-old mathematical puzzle with help from Claude AI: It ‘significantly shifted my perspective on what these models can achieve in theoretical physics’

Thousands of liberated Africans died on a remote island after the British Navy freed them. We now know where they came from.

Ancient DNA reveals shaman buried near Stonehenge was female, ‘breaking stereotypes’ of Early Bronze Age women

Superman actor Dean Cain argues that Hollywood’s woke era is coming to an end

New ‘poot’ fashion trend is dividing the internet: ‘Silly in my opinion’

Harry Kane ‘empty’ after England’s heartbreaking World Cup exit

FDA links ‘explosive diarrhea’ parasite infections to Taco Bell shredded lettuce

Doctors warn your ‘stomach bug’ may actually be part of US’ foodborne parasite outbreak

Chinese AI firm Moonshot unveils powerful model with capabilities close to Anthropic, OpenAI

Apple races past Nvidia to reclaim crown as world’s most valuable company

Subscribe to Updates

What's On

How can we prevent AI models from cannibalizing themselves when human-generated data runs out? Scientists say they’ve found the answer.

Keep Reading