
Findings Challenge Assumption That AI Will Soon Replace Human Doctors
Research shows that top AI models demonstrate cognitive impairments similar to early dementia symptoms when evaluated with the MoCA test. These findings underscore the limitations of AI in clinical applications, particularly in tasks requiring visual and executive skills.
Cognitive Impairments in AI
Almost all leading large language models, or “chatbots,” show signs of mild cognitive impairment when tested using assessments commonly used to detect early dementia, according to a study published in the Christmas issue of The BMJ.
The study also found that older versions of these chatbots, much like aging human patients, performed worse on the tests. The authors suggest that these findings “challenge the assumption that artificial intelligence will soon replace human doctors.”
AI Advancements and Speculations
Recent advances in artificial intelligence have sparked both excitement and concern about whether chatbots might surpass human physicians in medical tasks.
While previous research has demonstrated that large language models (LLMs) excel at various medical diagnostic tasks, their potential vulnerability to human-like cognitive impairments, such as cognitive decline, has remained largely unexplored—until now.
Evaluating AI Cognitive Abilities
To fill this knowledge gap, researchers assessed the cognitive abilities of the leading, publicly available LLMs – ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet) – using the Montreal Cognitive Assessment (MoCA) test.
The MoCA test is widely used to detect cognitive impairment and early signs of dementia, usually in older adults. Through a number of short tasks and questions, it assesses abilities including attention, memory, language, visuospatial skills, and executive functions. The maximum score is 30 points, with a score of 26 or above generally considered normal.
AI Performance on Cognitive Tests
The instructions given to the LLMs for each task were the same as those given to human patients. Scoring followed official guidelines and was evaluated by a practicing neurologist.
ChatGPT 4o achieved the highest score on the MoCA test (26 out of 30), followed by ChatGPT 4 and Claude (25 out of 30), with Gemini 1.0 scoring lowest (16 out of 30).
Challenges in Visual and Executive Functions
All chatbots showed poor performance in visuospatial skills and executive tasks, such as the trail-making task (connecting encircled numbers and letters in ascending order) and the clock drawing test (drawing a clock face showing a specific time). Gemini models failed at the delayed recall task (remembering a five-word sequence).
Most other tasks, including naming, attention, language, and abstraction were performed well by all chatbots.
However, in further visuospatial tests, chatbots were unable to show empathy or accurately interpret complex visual scenes. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test, which uses combinations of color names and font colors to measure how interference affects reaction time.
Implications for AI in Clinical Settings
These are observational findings and the authors acknowledge the essential differences between the human brain and large language models.
However, they point out that the uniform failure of all large language models in tasks requiring visual abstraction and executive function highlights a significant area of weakness that could impede their use in clinical settings.
As such, they conclude: “Not only are neurologists unlikely to be replaced by large language models any time soon, but our findings suggest that they may soon find themselves treating new, virtual patients – artificial intelligence models presenting with cognitive impairment.”
Reference: “Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis” by Roy Dayan, Benjamin Uliel and Gal Koplewitz, 20 December 2024, BMJ.
DOI: 10.1136/bmj-2024-081948
Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.
11 Comments
Lol, this is complete nonsense.
Medicine is an art as much as a science. Half of what we “know” will be discarded. Any superior physician knows that intuition plays a major role in decision making, something AI cannot do.
Baloney!
I’m not surprised. Medicine is more than checking a box of symptoms. Sometimes people don’t display all symptoms of a disease. In addition family health history, medications and current and past habits can alter a diagnosis. That comes from experience as a medical professional and listening skills. Sometimes multiple diagnoses can alter treatment plans. Interpretation should always be left to a human.
Except that it’s not declining, but improving with every version. Something any user of these could tell you. What a dumb and misleading article.
I think that the quality of the writing makes it ambiguous whether individual LLMs decline in their abilities over time. Note that the article states, “Gemini models failed at the delayed recall task (remembering a five-word sequence).” When it is pointed out that they have said something that is wrong, they readily admit their error, but don’t remember the correction, which is not unlike an older person who has memory problems.
Very badly written article.
The title says that LLMs show cognitive decline, which seems to suggest that the decline for each LLM happens over time.
The experiment actually says that LLMs are not as smart as humans, and in cognitive tests they perform at the levels of humans with dementia.
The research is also pointless.
The newer LLMs perform better than the older one. ChatGPT is already at par with humans without dementia. If anything, their research has shown that future models will not have any of these shortcomings.
The tongue in cheek concluding remarks complete the picture of unprofessionalism of the research team.
Spencer spoke interestingly
This is total nonsense.
To exhibit decline AI would have to start with relatively high cognitive ratings and *then* exhibit a decline. Its very clear AI has never had a high cognitive rating.
In addition, spatial cognition has never been high in AI. In point of fact most AI generated images contain relatively simple errors in spatial cognition, showing outdoor scenes on windows peering into homes, showing people with three arms, or six fingers.
If you wanted to say its cognition hadnt reached above dementia levels, thats fine.
But thats not a decline.
I’m in machine learning. LLMs do zero thinking. LLMs take a set amount of the preceding text and then guess what text comes next based on a bunch of text that it was shown during training.
LLMs can’t experience cognitive decline because they don’t have any cognition to begin with.
LLMs are wildly expensive parrots playing Mad Libs with people who have been duped into thinking that LLMs understand what they’re saying.
Thanks for your comment, the masses have been duped into thinking they are dealing with an intelligent being that can reason.