Findings Challenge Assumption That AI Will Soon Replace Human Doctors

Research shows that top AI models demonstrate cognitive impairments similar to early dementia symptoms when evaluated with the MoCA test. These findings underscore the limitations of AI in clinical applications, particularly in tasks requiring visual and executive skills.

Cognitive Impairments in AI

Almost all leading large language models, or “chatbots,” show signs of mild cognitive impairment when tested using assessments commonly used to detect early dementia, according to a study published in the Christmas issue of The BMJ.

The study also found that older versions of these chatbots, much like aging human patients, performed worse on the tests. The authors suggest that these findings “challenge the assumption that artificial intelligence will soon replace human doctors.”

AI Advancements and Speculations

Recent advances in artificial intelligence have sparked both excitement and concern about whether chatbots might surpass human physicians in medical tasks.

While previous research has demonstrated that large language models (LLMs) excel at various medical diagnostic tasks, their potential vulnerability to human-like cognitive impairments, such as cognitive decline, has remained largely unexplored—until now.

Evaluating AI Cognitive Abilities

To fill this knowledge gap, researchers assessed the cognitive abilities of the leading, publicly available LLMs – ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet) – using the Montreal Cognitive Assessment (MoCA) test.

The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions. The maximum score is 30 points, with a score of 26 or above generally considered normal.

AI Performance on Cognitive Tests

The instructions given to the LLMs for each task were the same as those given to human patients. Scoring followed official guidelines and was evaluated by a practicing neurologist.

ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring lowest (16 out of 30).

Challenges in Visual and Executive Functions

All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail-making task (connecting encircled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). Gemini models failed at the delayed recall task (remembering a five-word sequence).

Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots.

However, in further visuospatial tests, chatbots were unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test, which uses combinations of color names and font colors to measure how interference affects reaction time.

Implications for AI in Clinical Settings

These are observational findings and the authors acknowledge the essential differences between the human brain and large language models.

However, they point out that the uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their use in clinical settings.

As such, they conclude: “Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients – artificial intelligence models presenting with cognitive impairment.”

Reference: “Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis” by Roy Dayan, Benjamin Uliel and Gal Koplewitz, 20 December 2024, BMJ.

DOI: 10.1136/bmj-2024-081948

