I'm new here. I'm a family physician. I realize that is not the typical opening line on this forum.

Over the past two years, our team has published what I believe is one of the most extensive bodies of empirical work on how LLMs break under pressure in clinical decision-making. Biases, hallucinations, susceptibility to misinformation, the ways these models bend when you push on them with real clinical scenarios. The findings bothered me enough that I went looking for the community that takes AI safety most seriously. That search led me here.

The short version: we tested every tens of models using counterfactual demographic swapping at scale. Same clinical presentation, same vitals, same history. Change one demographic variable. Watch the recommendations shift. Across 1.7M controlled outputs and 9 models (Nature Medicine), marginalized groups received mental health referrals at 6-7x the clinically indicated rate for identical presentations. Models repeated fabricated lab values as fact 83% of the time (Nature Comms Medicine). Misinformation susceptibility hit 46% when wrapped in clinical formatting (Lancet Digital Health). 40+ papers. Every model. Same pattern.

These failure modes, demographic-conditional behavior, sycophantic agreement with false premises, confident confabulation, are not unique to medicine. They are general model properties. Medicine is just where you can measure the harm precisely, because we have ground truth and protected attributes. It is arguably the first deployment domain where alignment failures are causing measurable harm at scale, on people who cannot opt out.

The AI safety community has METR for frontier capabilities, HELM for language, and dozens of benchmarks for code and math. For clinical AI, where the stakes are lives, there is nothing open, standardized, or continuously maintained. Zero.

I posted a Manifund project to fix this. ClinSafe: an open platform to stress-test medical AI for bias, hallucination, and safety failures. $25K, 6 months, free on GitHub and HuggingFace. The pipeline already works. The data already exists. What does not exist is a tool anyone outside our lab can use.

I'm Head of Research at BRIDGE GenAI Lab (BIDMC/Harvard Medical School) and a research scientist at Mount Sinai's AI department. I treat patients in the morning and build evaluation systems in the afternoon. This is the work I care about most in the world, and I am willing to learn any community's language, post on any forum, and talk to anyone who will listen to make it happen.

I would genuinely welcome feedback on whether this kind of empirical deployment-safety work resonates with the priorities here. I am also happy to share papers, data, or a pipeline demo.

Cheers :)

Mahmud 

9

0
0

Reactions

0
0
Comments
No comments on this post yet.
Be the first to respond.
Curated and popular this week
Relevant opportunities