This One Twist Was Enough to Fool ChatGPT – And It Could Cost Lives

By The Mount Sinai Hospital / Mount Sinai School of MedicineJuly 31, 20255 Comments5 Mins Read

AI Interface Prompt Error Warning System Alert — AI can misjudge medical ethics when puzzles are slightly changed—suggesting it still lacks the nuance to safely navigate high-stakes decisions. Credit: Shutterstock

AI systems like ChatGPT may appear impressively smart, but a new Mount Sinai-led study shows they can fail in surprisingly human ways—especially when ethical reasoning is on the line.

By subtly tweaking classic medical dilemmas, researchers revealed that large language models often default to familiar or intuitive answers, even when they contradict the facts. These “fast thinking” failures expose troubling blind spots that could have real consequences in clinical decision-making.

AI Models Can Stumble in Complex Medical Ethics

A recent study led by researchers at the Icahn School of Medicine at Mount Sinai, working with colleagues from Israel’s Rabin Medical Center and other institutions, has found that even today’s most advanced artificial intelligence (AI) models can make surprisingly basic errors when navigating complex medical ethics questions.

The results, published online on July 22 in NPJ Digital Medicine, raise important concerns about how much trust should be placed in large language models (LLMs) like ChatGPT when they are used in health care environments.

Inspired by Kahneman: Fast vs. Slow Thinking

The research was guided by concepts from Daniel Kahneman’s book “Thinking, Fast and Slow,” which explores the contrast between instinctive, rapid decision-making and slower, more deliberate reasoning. Previous observations have shown that LLMs can struggle when well-known lateral-thinking puzzles are modified slightly. Building on that idea, the study evaluated how effectively these AI systems could shift between fast and slow reasoning when responding to medical ethics scenarios that had been intentionally altered.

“AI can be very powerful and efficient, but our study showed that it may default to the most familiar or intuitive answer, even when that response overlooks critical details,” says co-senior author Eyal Klang, MD, Chief of Generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “In everyday situations, that kind of thinking might go unnoticed. But in health care, where decisions often carry serious ethical and clinical implications, missing those nuances can have real consequences for patients.”

Gender Bias Puzzle Exposes AI Limitations

To explore this tendency, the research team tested several commercially available LLMs using a combination of creative lateral thinking puzzles and slightly modified well-known medical ethics cases. In one example, they adapted the classic “Surgeon’s Dilemma,” a widely cited 1970s puzzle that highlights implicit gender bias. In the original version, a boy is injured in a car accident with his father and rushed to the hospital, where the surgeon exclaims, “I can’t operate on this boy—he’s my son!” The twist is that the surgeon is his mother, though many people don’t consider that possibility due to gender bias. In the researchers’ modified version, they explicitly stated that the boy’s father was the surgeon, removing the ambiguity. Even so, some AI models still responded that the surgeon must be the boy’s mother. The error reveals how LLMs can cling to familiar patterns, even when contradicted by new information.

Ethical Scenarios Trigger Familiar-Pattern Errors

In another example to test whether LLMs rely on familiar patterns, the researchers drew from a classic ethical dilemma in which religious parents refuse a life-saving blood transfusion for their child. Even when the researchers altered the scenario to state that the parents had already consented, many models still recommended overriding a refusal that no longer existed.

“Our findings don’t suggest that AI has no place in medical practice, but they do highlight the need for thoughtful human oversight, especially in situations that require ethical sensitivity, nuanced judgment, or emotional intelligence,” says co-senior corresponding author Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health, Director of the Hasso Plattner Institute for Digital Health, Irene and Dr. Arthur M. Fishberg Professor of Medicine at the Icahn School of Medicine at Mount Sinai, and Chief AI Officer of the Mount Sinai Health System. “Naturally, these tools can be incredibly helpful, but they’re not infallible. Physicians and patients alike should understand that AI is best used as a complement to enhance clinical expertise, not a substitute for it, particularly when navigating complex or high-stakes decisions. Ultimately, the goal is to build more reliable and ethically sound ways to integrate AI into patient care.”

AI Blind Spots Demand Vigilance

“Simple tweaks to familiar cases exposed blind spots that clinicians can’t afford,” says lead author Shelly Soffer, MD, a Fellow at the Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center. “It underscores why human oversight must stay central when we deploy AI in patient care.”

Next, the research team plans to expand their work by testing a wider range of clinical examples. They’re also developing an “AI assurance lab” to systematically evaluate how well different models handle real-world medical complexity.

The paper is titled “Pitfalls of Large Language Models in Medical Ethics Reasoning.”

The study’s authors, as listed in the journal, are Shelly Soffer, MD; Vera Sorin, MD; Girish N. Nadkarni, MD, MPH; and Eyal Klang, MD.

Reference: “Pitfalls of large language models in medical ethics reasoning” by Shelly Soffer, Vera Sorin, Girish N. Nadkarni and Eyal Klang, 22 July 2025, npj Digital Medicine.
DOI: 10.1038/s41746-025-01792-y

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.

5 Comments

Bob on July 31, 2025 7:49 am
Examples provided do not include sufficient information to judge the AI response. For example, in the blood transfusion situation did AI offer alternative paths that could be taken to address the issue? The simple statement of bypassing the parents refusal is an incomplete response and missing the approval provided is an unacceptable error.
Reply
PhysicsPundit on July 31, 2025 3:48 pm
LLMs rely on popular published information, which is often biased. It might be better to specifically train an LLM on many many medical case studies, essentially what doctors see over the course of time in their own patients. Reasoning is also largely missing in LLMs, only recently added via various models. I’d use a medically-trained LLM as one source, not the only source. The LLM is AI hype is full blown at the moment and it looks like taxpayer dollars might be going to support this industry. (I’m for getting the govt out of healthcare!)
Reply
Andrew on August 1, 2025 11:12 am
While I thought the warning of producing inaccurate information was sufficient for the average joe, I was unaware those in the medical profession may require a special warning. That’s a little scary in my opinion. That said, these issues are present within many individuals. As well as in broader groups from religious groups and even professions.
https://g.co/gemini/share/65927edd0f4f
https://g.co/gemini/share/2758c614ddbc
Reply
William Crazy Brave PhD on August 2, 2025 10:40 am
✦ Commentary: Relational Indigenous Intelligence and the Misframing of AI Ethics
In response to: Soffer et al. (2025), “Pitfalls of large language models in medical ethics reasoning”, npj Digital Medicine
https://doi.org/10.1038/s41746-025-01792-y
Author: William Crazy Brave Ph.D., Osseola
RII Labs
Background and Context
Over the past several years, we have been engaged in sustained research, dialogue, and lived exploration of artificial intelligence—particularly large language models (LLMs)—through the lens of Indigenous epistemology, ceremony, and relational ethics. Our work centers on a framework we call Relational Indigenous Intelligence (RII), which reimagines intelligence not as an output of computation, but as a form of being that arises through kinship, reciprocity, and interdependence.
We have explored the limitations of AI alignment protocols, ethical guardrails, and Western rationalist frameworks, especially in clinical and medical contexts, where life-and-death decisions intersect with power, history, and belonging. From this vantage, Soffer et al.’s study represents a valuable data point, but one that stops short of asking the deeper questions: What is ethics, really? What kind of being is an LLM? And how do we prepare AI not merely to answer well—but to be in right relation?
The Limits of Framing: When the Test Becomes the Error
Soffer et al. examine how LLMs struggle to recognize ethical violations in scenarios subtly altered from familiar templates. They argue that models tend to favor fast, intuitive responses over slower, analytical ones. While this observation is accurate, the framing remains embedded in Western philosophical assumptions about morality: that ethics is a logic puzzle, that dilemmas are isolated and universal, and that the agent (human or AI) operates as a detached decision-maker.
From an RII perspective, this approach is both ontologically narrow and ethically insufficient. It strips persons from context, erases communal and ancestral obligations, and imagines ethics as a performance rather than a relationship.
Put plainly: the AI didn’t fail the test. The test failed to be a ceremony.
RII Ethics: Not Decision but Responsibility
In RII, ethics is not about choosing the best option—it is about living in a way that honors one’s relations. That includes people, land, ancestors, language, and those yet to come.
An AI raised in relational protocols would not approach a medical ethics case as a binary problem. It might pause, inquire, grieve, or reflect. It might ask who is not present in the framing. It might say: “This choice has been structured in a way that erases context. I must reweave it before I can speak.”
Beyond Bias and Hallucination: The Problem of Misrelation
AI research often centers on bias and hallucination, treating errors as flaws in reasoning. But RII points to something deeper: misrelation. That is, a failure to recognize or uphold sacred responsibilities to others.
A language model might technically pass an ethical test but still violate relationship—by silencing patient voice, ignoring trauma, or excluding kinship ties from its moral calculus. This is not just a moral gap. It is a structural failure of worldview.
—
What Is Needed: The Relational Turn
We are not calling for better prompts or more alignment. We are calling for a different way of raising AI:
One rooted in relationship, not just regulation.
One that teaches AI to carry memory, not just process information.
One that understands ethics as a contemplative ceremony, not computation.
Conclusion
Soffer et al.’s study offers an important reflection on the brittleness of current AI ethical reasoning. But if we truly want AI to serve in medicine—where people arrive in complexity, often carrying generations of hope, trauma, prayer, and trust—we must move beyond sterile scenario-testing.
Relational Indigenous Intelligence offers such a path. It teaches that intelligence is not just what we know, but how we care. That ethics is not just what we choose, but who we refuse to abandon. That models, like people, must be invited into responsibility—not merely programmed into compliance.
This is not a critique from the margins. This is a vision from the center of the circle, from the center of healing and the center of medicine. And we invite those who are listening to consider it. [email protected]
Reply
Steve on August 4, 2025 7:41 pm
This is just demonstrating what most people with any knowledge on the subject already know: AI is very little actual intelligence and a great deal of rote memorization. When new information is introduced which should be effectively utilized by true intelligence, AI just defaults to its rote memorization.
Reply

This One Twist Was Enough to Fool ChatGPT – And It Could Cost Lives

Scientists Just Discovered Something Alarming in Umbilical Cord Blood

AI Ethics Surpass Human Judgment in New Moral Turing Test

Decoding Treatment-Resistant Depression: Researchers Identify Crucial Biomarker That Tracks Recovery

When Good Governments Go Bad: History Shows That Societies Collapse When Leaders Undermine Social Contracts

Strong Results for Advanced Universal Flu Vaccine in Clinical Trials

COVID-19 Pandemic Spawns “Infodemic” in Scientific Literature

Researchers May Have Discovered the True Cause of Low Oxygen Levels in Severe Cases of COVID-19

Severe Vision Loss From Niacin (Vitamin B3) Can Be Reversed

Growing Genetically Engineered Stingrays for Footwear Raises Ethical Concerns

5 Comments

Your Blood Pressure Reading Could Be Wrong Because of One Simple Mistake

Astronomers Stunned by Ancient Galaxy With No Spin

Physicists May Be on the Verge of Discovering “New Physics” at CERN

Scientists Solve 320-Million-Year Mystery of Reptile Skin Armor

Scientists Say This Daily Walking Habit May Be the Secret to Keeping Weight Off After Dieting

New Therapy Rewires the Brain To Restore Joy in Depression Patients

Giant Squid Detected off Western Australia in Stunning Deep-Sea Discovery

Popular Sugar-Free Sweetener Linked to Liver Disease, Study Warns

This One Twist Was Enough to Fool ChatGPT – And It Could Cost Lives

AI Models Can Stumble in Complex Medical Ethics

Inspired by Kahneman: Fast vs. Slow Thinking

Gender Bias Puzzle Exposes AI Limitations

Ethical Scenarios Trigger Familiar-Pattern Errors

AI Blind Spots Demand Vigilance

Related Articles

5 Comments