AI Struggles with Advanced History Questions, New Study Finds

Illustration of an AI model struggling to answer a complex history question with books and ancient history symbols in the background.

A recent study reveals that while artificial intelligence (AI) excels in areas like programming and podcast generation, it faces significant challenges in answering high-level history questions.

Researchers have developed a new evaluation framework to assess three leading large language models (LLMs)—OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini—on their ability to answer complex historical questions. The framework, called Hist-LLM, compares LLM responses against the Seshat Global History Databank, an extensive repository of historical knowledge named after the ancient Egyptian goddess of wisdom.

The findings, presented at the prestigious NeurIPS conference, were underwhelming. According to researchers from the Complexity Science Hub (CSH), based in Austria, GPT-4 Turbo was the highest scorer among the LLMs, yet it only achieved an accuracy rate of about 46%, which is not far above random guessing.

“The key takeaway is that LLMs, while impressive in many areas, lack the depth of historical understanding needed for advanced inquiries,” said Maria del Rio-Chanona, co-author of the study and associate professor of computer science at University College London. “While they’re good with basic information, they struggle with more complex, graduate-level history.”

The researchers shared examples of incorrect answers from the models. For instance, GPT-4 Turbo was asked whether scale armor was used in ancient Egypt during a specific period. The model erroneously stated “yes,” although scale armor did not appear in Egypt until 1,500 years later.

So why do LLMs struggle with historical questions, especially when they are highly proficient at tasks like coding? According to Del Rio-Chanona, this issue likely arises because LLMs tend to extrapolate from prominent historical data, making it difficult for them to access and correctly answer questions about obscure historical events.

One example was when the researchers asked GPT-4 about the presence of a professional standing army in ancient Egypt during a particular time. The correct response was no, but the model incorrectly stated “yes,” likely influenced by the widely available historical data on other ancient civilizations, such as Persia, which did have standing armies.

“The model may overemphasize the more widely known empires and misapply that knowledge to regions with less prominent records,” Del Rio-Chanona explained.

The study also noted that certain regions, such as sub-Saharan Africa, were particularly poorly represented in the LLM responses, hinting at potential biases in the training data of OpenAI and Llama models.

Despite these challenges, the researchers remain optimistic about the future of LLMs in historical research. They are working to refine their benchmark by incorporating data from less-represented regions and designing more complex historical queries.

“While our results reveal shortcomings, they also highlight the potential of LLMs to assist in historical research,” the study concluded.

Related Posts