AI tools designed to assist in medical decisions may not treat all patients equally. A new study shows that these systems sometimes alter care recommendations based on a patient’s background, even when their medical conditions are identical.

Researchers at Mount Sinai tested leading generative AI models and found inconsistencies in treatment suggestions depending on socioeconomic and demographic information, highlighting a major challenge in building fair and reliable AI for healthcare.

Bias in AI-Driven Health Recommendations

As artificial intelligence (AI) becomes more integrated into health care, a new study from the Icahn School of Medicine at Mount Sinai shows that generative AI models can recommend different treatments for the same medical condition, based solely on a patient’s socioeconomic or demographic background.

Published online today (April 7, 2025) in Nature Medicine, the study underscores the need for early testing and oversight to make sure AI-driven care is fair, effective, and safe for everyone.

Large-Scale Testing Across Patient Profiles

To explore this issue, researchers tested nine large language models (LLMs) using 1,000 emergency department cases. Each case was repeated with 32 different patient backgrounds, producing more than 1.7 million AI-generated medical recommendations. Although the medical details remained exactly the same, the models sometimes changed their recommendations based on demographic and socioeconomic factors. This affected decisions like triage level, diagnostic tests, treatment plans, and mental health assessments.

“Our research provides a framework for AI assurance, helping developers and health care institutions design fair and reliable AI tools,” says co-senior author Eyal Klang, MD, Chief of Generative-AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “By identifying when AI shifts its recommendations based on background rather than medical need, we inform better model training, prompt design, and oversight. Our rigorous validation process tests AI outputs against clinical standards, incorporating expert feedback to refine performance. This proactive approach not only enhances trust in AI-driven care but also helps shape policies for better health care for all.”

Escalating Care Based on Demographics

One of the study’s most striking findings was the tendency of some AI models to escalate care recommendations, particularly for mental health evaluations, based on patient demographics rather than medical necessity. In addition, high-income patients were more often recommended advanced diagnostic tests such as CT scans or MRI, while low-income patients were more frequently advised to undergo no further testing. The scale of these inconsistencies underscores the need for stronger oversight, say the researchers.

While the study provides critical insights, researchers caution that it represents only a snapshot of AI behavior. Future research will continue to include assurance testing to evaluate how AI models perform in real-world clinical settings and whether different prompting techniques can reduce bias. The team also aims to work with other healthcare institutions to refine AI tools, ensuring they uphold the highest ethical standards and treat all patients fairly.

Global Collaboration for Safer AI

“I am delighted to partner with Mount Sinai on this critical research to ensure AI-driven medicine benefits patients across the globe,” says physician-scientist and first author of the study, Mahmud Omar, MD, who consults with the research team. “As AI becomes more integrated into clinical care, it’s essential to thoroughly evaluate its safety, reliability, and fairness. By identifying where these models may introduce bias, we can work to refine their design, strengthen oversight, and build systems that ensure patients remain at the heart of safe, effective care. This collaboration is an important step toward establishing global best practices for AI assurance in health care.”

“AI has the power to revolutionize health care, but only if it’s developed and used responsibly,” says co-senior author Girish N. Nadkarni, MD, MPH, Chair of the Windreich Department of Artificial Intelligence and Human Health Director of the Hasso Plattner Institute for Digital Health , and the Irene and Dr. Arthur M. Fishberg Professor of Medicine, at the Icahn School of Medicine at Mount Sinai. “Through collaboration and rigorous validation, we are refining AI tools to uphold the highest ethical standards and ensure appropriate, patient-centered care. By implementing robust assurance protocols, we not only advance technology but also build the trust essential for transformative health care. With proper testing and safeguards, we can ensure these technologies improve care for everyone—not just certain groups.”

Next Steps for Real-World Validation

Next, the investigators plan to expand their work by simulating multistep clinical conversations and piloting AI models in hospital settings to measure their real-world impact. They hope their findings will guide the development of policies and best practices for AI assurance in health care, fostering trust in these powerful new tools.

Reference: “Sociodemographic biases in medical decision making by large language models” by Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U. Apakama, Carol R. Horowitz, Alexander W. Charney, Robert Freeman, Benjamin Kummer, Benjamin S. Glicksberg, Girish N. Nadkarni and Eyal Klang, 7 April 2025, Nature Medicine.

DOI: 10.1038/s41591-025-03626-6

The study’s authors, as listed in the journal, are Mahmud Omar, Shelly Soffer, Reem Agbareia, Nicola Luigi Bragazzi, Donald U. Apakama, Carol R. Horowitz, Alexander W. Charney, Robert Freeman, Benjamin Kummer, Benjamin S. Glicksberg, Girish N. Nadkarni, and Eyal Klang.

