"Humanity’s Last Exam": The Super-Benchmark AI Is Currently Failing

Summary: Standard AI benchmarks have become too easy. To push the limits of artificial intelligence, a global consortium of nearly 1,000 researchers has created “Humanity’s Last Exam” (HLE). This 2,500-question assessment covers highly specialized fields, from ancient Palmyrene inscriptions to microanatomical structures in birds.

The exam was specifically engineered to be unsolvable by current AI; if an AI could answer a question correctly during the testing phase, that question was removed. Early results show that while humans excel, top AI models like GPT-4o and Claude 3.5 struggle significantly, highlighting the vast gap that remains between machine pattern recognition and true human expertise.

Key Facts

A New Bar: Humanity’s Last Exam is designed to be the ultimate benchmark for “expert-level” knowledge, sitting just beyond the current capabilities of the world’s most advanced AI.
Global Expert Effort: Nearly 1,000 experts from across the sciences, humanities, and arts contributed questions to ensure the exam spans the full breadth of human knowledge.
Low AI Scores: Early performance is remarkably low: GPT-4o scored 2.7%, Claude 3.5 Sonnet scored 4.1%, and OpenAI’s o1 model reached 8%. Even the most advanced current models (like Gemini 3.1 Pro) struggle to exceed 50%.
Un-searchable Questions: Each question was designed to have a single, verifiable answer that cannot be found instantly via a simple internet search.
Preserving Human Relevance: Despite its ominous name, HLE is intended as a tool to measure progress and identify risks, proving that human depth, context, and specialized expertise still reign supreme.

Source: Texas A&M

When artificial intelligence systems began acing long‑standing academic assessments, researchers realized they had a problem: the tests were too easy. Popular evaluations, such as the Massive Multitask Language Understanding (MMLU) exam, once considered formidable, are no longer challenging enough to meaningfully test advanced AI systems.

To address this gap, a global consortium of nearly 1,000 researchers, including a Texas A&M University professor, created something different — an exam so broad, so challenging and so deeply rooted in expert human knowledge that current AI systems consistently fail it.

This shows a glowing, golden, digital head. — Humanity’s Last Exam (HLE) highlights the significant difference between an AI’s ability to process data and a human’s ability to possess deep, specialized knowledge. Credit: Neuroscience News

“Humanity’s Last Exam” (HLE) introduces a 2,500‑question assessment spanning mathematics, humanities, natural sciences, ancient languages and highly specialized subfields. The team’s work is outlined in a paper published in Nature with documentation from the project available at lastexam.ai.

Among the long list of contributors is Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M, who participated in authoring and refining questions.

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context and specialized expertise.”

The point wasn’t to stump humans. It was to reveal, precisely and systematically, what AI cannot do, at least not yet.

A global effort to measure AI’s limits

Questions for HLE were written and reviewed by experts in their fields from all over the world, who ensured each one had a single, unambiguous, verifiable answer that couldn’t be solved instantly through internet retrieval.

The prompts draw from expert-level academic problems: from translating ancient Palmyrene inscriptions to identifying microanatomical structures in birds or analyzing the intricate features of Biblical Hebrew pronunciation.

Each question was tested against leading AI models. If any system could answer it correctly, the question was removed. The result is an exam deliberately engineered to sit just beyond current AI capability.

And it worked. Early results showed that even the most advanced models struggled. GPT‑4o scored 2.7%; Claude 3.5 Sonnet reached 4.1%; OpenAI’s flagship o1 model achieved only 8%. The most advanced models, including Gemini 3.1 Pro and Claude Opus 4.6, have reached around 40% to 50% accuracy.

Why a new benchmark matters

The problem with AI outgrowing traditional benchmarks isn’t simply academic, said Nguyen, who contributed 73 of the 2,500 public questions (the second-highest author), and authored the most questions in math and computer science.

“Without accurate assessment tools, policymakers, developers and users risk misinterpreting what AI systems can actually do,” he said. “Benchmarks provide the foundation for measuring progress and identifying risks.”

As the team’s paper notes, while AI may excel on exams designed for humans, those tests aren’t necessarily measuring “intelligence.” They measure performance on a set of tasks crafted for a very different kind of learner.

Not a threat, a tool

Despite its apocalyptic name, Humanity’s Last Exam isn’t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go.

“This isn’t a race against AI,” Nguyen said. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”

A future-proof exam

HLE is intended to serve as a long‑term, transparent benchmark for evaluating advanced AI systems. As part of that mission, the team has made some of the exam publicly available, while keeping most of the test questions hidden so AI models can’t memorize the answers.

“For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” Nguyen said, “and despite rapid technological advances, it remains wide.”

Research on a grand scale

Nguyen noted the massive project reflects the importance of interdisciplinary, international research efforts.

“What made this project extraordinary was the scale,” he said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”

Key Questions Answered:

Q: Why is it called “Humanity’s Last Exam”?

A: The name is a bit tongue-in-cheek, but it represents the idea that this is the final hurdle for AI. If an AI can pass this exam, it will have reached a level of specialized human expertise that was previously thought impossible for a machine.

Q: If AI is so smart, why is it failing?

A: AI is great at pattern recognition and summarizing known data, but it struggles with deep, specialized context. HLE asks questions that require years of niche study—things like specific ancient pronunciations or rare anatomical features—where “guessing” based on common internet data doesn’t work.

Q: Can a regular person pass this test?

A: Not the whole thing! No single human could pass the entire exam because it covers everything from nuclear physics to ancient history. However, a human expert in a specific field will easily answer the questions in their niche, whereas the AI fails across almost every category.

Editorial Notes:

This article was edited by a Neuroscience News editor.
Journal paper reviewed in full.
Additional context added by our staff.

About this AI research news

Author: Lesley Henton
Source: Texas A&M
Contact: Lesley Henton – Texas A&M
Image: The image is credited to Neuroscience News

Original Research: Open access.
“A benchmark of expert-level academic questions to assess AI capabilities” by Center for AI Safety, Scale AI & HLE Contributors Consortium. Nature
DOI:10.1038/s41586-025-09962-4

Abstract

A benchmark of expert-level academic questions to assess AI capabilities

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve more than 90% accuracy on popular benchmarks such as Measuring Massive Multitask Language

Understanding, limiting informed measurement of state-of-the-art LLM capabilities.

Here, in response, we introduce Humanity’s Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be an expert-level closed-ended academic benchmark with broad subject coverage.

HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading.

Each question has a known solution that is unambiguous and easily verifiable but cannot be quickly answered by internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a marked gap between current LLM capabilities and the expert human frontier on closed-ended academic questions.

To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

“Humanity’s Last Exam”: The Super-Benchmark AI Is Currently Failing

A global effort to measure AI’s limits

Why a new benchmark matters

Not a threat, a tool

A future-proof exam

Research on a grand scale

Key Questions Answered:

Editorial Notes:

About this AI research news

Teenage Cannabis Use Linked to 52% Higher Schizophrenia Risk

Physical Activity Rewires the Traumatized Brain

Gene Linking Schizophrenia to Decision-Making Found

Teenage Cannabis Use Linked to 52% Higher Schizophrenia Risk

Physical Activity Rewires the Traumatized Brain

Gene Linking Schizophrenia to Decision-Making Found

Positive Childhood Experiences and Special Health Care Needs

“Humanity’s Last Exam”: The Super-Benchmark AI Is Currently Failing

A global effort to measure AI’s limits

Why a new benchmark matters

Not a threat, a tool

A future-proof exam

Research on a grand scale

Key Questions Answered:

Editorial Notes:

About this AI research news

Teenage Cannabis Use Linked to 52% Higher Schizophrenia Risk

Physical Activity Rewires the Traumatized Brain

Gene Linking Schizophrenia to Decision-Making Found

Teenage Cannabis Use Linked to 52% Higher Schizophrenia Risk

Physical Activity Rewires the Traumatized Brain

Gene Linking Schizophrenia to Decision-Making Found

Positive Childhood Experiences and Special Health Care Needs

Subscribe