Don’t Panic Yet: “Humanity’s Last Exam” Has Begun

Artificial Intelligence Data AI Problem Solving — A sweeping new global exam exposes how far even the most advanced AI systems remain from matching deep human expertise, reshaping how researchers measure machine intelligence and its limits. Credit: Stock

As artificial intelligence systems rapidly outgrow traditional academic benchmarks, researchers have unveiled an ambitious new test designed to probe the true limits of machine intelligence.

When advanced artificial intelligence systems began scoring near-perfect marks on established academic tests, researchers recognized a growing concern. The exams that once posed serious challenges were no longer difficult enough to meaningfully evaluate cutting-edge AI. Well-known benchmarks such as the Massive Multitask Language Understanding (MMLU) exam, previously viewed as rigorous, have become less effective at distinguishing true progress in AI capability.

In response, an international group of nearly 1,000 researchers, including a professor from Texas A&M University, developed a far more demanding assessment. Their goal was to design an exam so comprehensive and grounded in specialized human expertise that today’s AI systems would struggle to pass it.

The result is “Humanity’s Last Exam” (HLE), a 2,500-question test that covers mathematics, the humanities, natural sciences, ancient languages, and highly specialized academic fields. The project is described in a paper published in Nature, and additional details are available at lastexam.ai.

One of the contributors is Dr. Tung Nguyen, instructional associate professor in the Department of Computer Science and Engineering at Texas A&M. He helped write and refine questions for the assessment.

“When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human‑level understanding,” Nguyen said. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context, and specialized expertise.”

The point wasn’t to stump humans. It was to reveal, precisely and systematically, what AI cannot do, at least not yet.

A global effort to measure AI’s limits

Specialists from around the world drafted and reviewed the HLE questions. Each item was required to have one clear, verifiable answer and to resist being solved through quick online searches. The material reflects advanced scholarship, ranging from translating ancient Palmyrene inscriptions to identifying tiny anatomical structures in birds and examining the detailed sound patterns of Biblical Hebrew.

Before being included, every question was tested on leading AI systems. If a model produced the correct answer, that question was eliminated. This process ensured the final exam would remain just beyond the reach of current AI performance.

The results show how difficult the assessment is. Early testing found that even top models struggled. GPT-4o scored 2.7%. Claude 3.5 Sonnet achieved 4.1%. OpenAI’s o1 model reached 8%. More recent systems, including Gemini 3.1 Pro and Claude Opus 4.6, have improved to roughly 40-50% accuracy, but they still do not demonstrate full mastery.

Why a new benchmark matters

According to Nguyen, the fact that AI has surpassed older benchmarks carries real-world consequences. He contributed 73 of the 2,500 public questions (the second-highest author) and wrote more questions in math and computer science than any other contributor.

“Without accurate assessment tools, policymakers, developers, and users risk misinterpreting what AI systems can actually do,” he said. “Benchmarks provide the foundation for measuring progress and identifying risks.”

As explained in the team’s paper, high scores on human-designed exams do not automatically indicate genuine intelligence. Such tests measure performance on tasks originally created for people, not machines. Strong results may reflect pattern matching rather than deep understanding.

Not a threat, a tool

Despite its apocalyptic name, Humanity’s Last Exam isn’t meant to suggest the end of human relevance. Instead, it highlights how much knowledge remains uniquely human and how far AI systems still have to go.

“This isn’t a race against AI,” Nguyen said. “It’s a method for understanding where these systems are strong and where they struggle. That understanding helps us build safer, more reliable technologies. And, importantly, it reminds us why human expertise still matters.”

A future-proof exam

HLE is intended to serve as a long‑term, transparent benchmark for evaluating advanced AI systems. As part of that mission, the team has made some of the exam publicly available, while keeping most of the test questions hidden so AI models can’t memorize the answers.

“For now, Humanity’s Last Exam stands as one of the clearest assessments of the gap between AI and human intelligence,” Nguyen said, “and despite rapid technological advances, it remains wide.”

Research on a grand scale

Nguyen noted the massive project reflects the importance of interdisciplinary, international research efforts.

“What made this project extraordinary was the scale,” he said. “Experts from nearly every discipline contributed. It wasn’t just computer scientists; it was historians, physicists, linguists, medical researchers. That diversity is exactly what exposes the gaps in today’s AI systems —perhaps ironically, it’s humans working together.”

Reference: “A benchmark of expert-level academic questions to assess AI capabilities” by Center for AI Safety, Scale AI and HLE Contributors Consortium, 28 January 2026, Nature.
DOI: 10.1038/s41586-025-09962-4

Funding: The Center for AI Safety and Scale AI consortia

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.

8 Comments

maher on February 28, 2026 1:16 pm
The best test isn’t there yet & is far more simple: Make money fast.
The faster the AI can make it, the more human it is
Cheryl V Johnson on February 28, 2026 3:20 pm
How do you avoid teaching to the test. Some questions used in IQ tests became examples and even individuals who didn’t really understand the question started knowing the answer.
Marvin Rumery III on February 28, 2026 5:38 pm
using a mainframe for learning or spreading information is a good thing but AI isn’t. There is no point in making a being more intelligent than we are
- Heck on March 2, 2026 5:31 am
  A mainframe? Which century are you from?
Stanley Korn on March 1, 2026 2:51 pm
“A future-proof exam”
Really?!
When (not if) AIs master this so-called future-proof exam, maybe it will then be time to allow AIs to design the next such exam.
Stanley Korn on March 1, 2026 2:58 pm
Here’s a twist: Have an AI design an exam that tests the limits of human intelligence.
- Minh on March 18, 2026 6:52 am
  Just stupid thinking; because for now AI dataset is filled with junk and most of the time evaluate it at false. The main problem is the creative thinking, inventing. Aspartame could not exist because a scientist had not licked his fingers and so many cases. How many times did you generate code again and again in hopes for it to work? The more knowledge you have the more you will fight it.
Scott on March 2, 2026 5:46 am
We are not getting the toothpaste back in the tube, and feed or not feeding AI information/data will not slow its growth and capability. With China’s latest quantum chip being 100 time faster than Google’s, it will not be long before Quantum AI is accessing everything: behind paywalls, PINs, health account, bank accounts, even through blockchain encryption, and Government firewalls. Remember “Too Many Secrets” from the Robert Redford and Ben Kingsley movie: “Sneakers”? The issue here, is it isn’t contained in a small box transportable in a back pack. If it isn’t connected to every smart device in the world – I give it 6 months. Not going for doom and gloom, just the reality we are racing toward. The real question isn’t what is the industry going to do with it – it is what are you going to do with it. I can only change myself.

Don’t Panic Yet: “Humanity’s Last Exam” Has Begun

Machine Learning at Speed: Optimization Code Increases Performance by 5x

MIT’s New Neural Network: “Liquid” Machine-Learning System Adapts to Changing Conditions

How AI Sees Through the Looking Glass: Things Are Different on the Other Side of the Mirror

Widely Used AI Machine Learning Methods Don’t Work as Claimed

Hunting Down Cybercriminals With New Machine-Learning System

New AI System Identifies Personality Traits from Eye Movements

New Artificial Intelligence Device Identifies Objects at the Speed of Light

Machine-Learning Models Capture Subtle Variations in Facial Expressions

‘Deep Learning’ Algorithm Brings New Tools to Astronomy

8 Comments

Scientists Discover Cheap, Natural Remedy for High Blood Pressure

Earth’s Upper Atmosphere Is Cooling Fast and Scientists Finally Know Why

32,000 Olympic Pools of Magma Nearly Erupted Beneath Atlantic Island

Exercise Changes the Heart in a Way Researchers Never Expected

Too Much Sleep May Age Your Body Faster, New Study Warns

Scientists Uncover Promising New Strategy To Stop Parkinson’s in Its Tracks

Experts Reveal the Surprising Cancer Link Behind a Common Vitamin

This Strange “Golden Orb” Found 2 Miles Deep Stumped Scientists for Years

Don’t Panic Yet: “Humanity’s Last Exam” Has Begun

A global effort to measure AI’s limits

Why a new benchmark matters

Not a threat, a tool

A future-proof exam

Research on a grand scale

Related Articles

8 Comments