AI chatbots still can’t accurately answer high-level history questions: study

While artificial intelligence excels at tasks like coding and podcast generation, it struggles to accurately answer high-level history questions, according to a study.

Researchers tested OpenAI’s GPT-4, Meta’s Llama and Google’s Gemini using a newly developed benchmark called Hist-LLM.

The benchmark relies on the Seshat Global History Databank, a comprehensive database of historical knowledge.

The study, which was presented at the NeurIPS AI conference last month, found disappointing results, according to TechCrunch.

GPT-4 Turbo performed best but only achieved about 46% accuracy — barely above random guessing.

“LLMs, while impressive, still lack the depth required for advanced history,” said Maria del Rio-Chanona, a co-author of the paper and associate professor at University College London.

“They’re great for basic facts, but they fail at nuanced, PhD-level historical inquiries.”

Researchers found that LLMs often extrapolate from prominent historical data but struggle with more obscure details.

For instance, GPT-4 incorrectly stated that scale armor was present in ancient Egypt during a specific time period, when in reality, the technology only appeared 1,500 years later.

Similarly, the model falsely claimed ancient Egypt had a professional standing army during a particular period, likely due to the prevalence of information on standing armies in other ancient empires, such as Persia.

“If you get told A and B 100 times, and C only once, you’re more likely to recall A and B,” del Rio-Chanona explained.

Another concern was potential bias.

OpenAI’s GPT-4 and Meta’s Llama models performed worse when answering questions about regions such as sub-Saharan Africa, indicating training data limitations.

“These biases suggest LLMs reflect gaps in historical documentation rather than an unbiased representation of history,” said Peter Turchin, the study’s lead researcher.

Despite these limitations, researchers remain hopeful that AI can assist historians in the future.

They plan to refine the Hist-LLM benchmark by incorporating more diverse data sources and increasing the complexity of the questions.

“Our findings highlight areas where LLMs need improvement, but they also showcase their potential to support historical research,” the paper concluded.

As AI continues to evolve, experts say it is clear that human historians remain irreplaceable in interpreting complex historical narratives and ensuring accuracy in academic inquiry.

What's On

NBA playoffs: Russell Westbrook wants to ‘f–k s–t up’

My Mom Is Difficult to Shop For, So I’m Getting Her the Bestselling Pajamas I Wear on Repeat

In five minutes and 27 seconds, the Knicks flipped a switch — and showed what they’re capable of

AI chatbots still can’t accurately answer high-level history questions: study

Humanoid robots join runners for half-marathon — but some of the droids struggled to find their footing

How your significant other’s name is saved in your phone contacts speaks volumes about your relationship

Americans flock to Chinese e-commerce apps DHgate, Taobao amid Trump tariffs

Apple buying frenzy ahead of tariff pause raises eyebrows — but Letitia James only has blinders for Trump

Exclusive | Do you need a $599 gut test? What your poop can tell you about your health

Tesla accused of speeding up odometers so they fall out of warranty faster: lawsuit

Nvidia CEO makes surprise trip to China as House probes whether it violated chip sale rules

Google operates illegal ad monopolies that ‘substantially harmed’ customers, judge rules

Get Granny on Gmail — computer and smartphone use lowers risk of brain decline by 42%

Leave A Reply

Subscribe to Updates

What's On

AI chatbots still can’t accurately answer high-level history questions: study

Start your day with the latest business news right at your fingertips

Thanks for signing up!

Keep Reading

Leave A Reply Cancel Reply

Leave A Reply