A friend sent this article to me today. It feels like a good moment to remind us: the LLM outperformed physicians on STRUCTURED cases. Someday the AIs may perform better than us in the unstructured, messy, real-world patient encounter...but that day is not today.🩺🛟 jamanetwork.com/journals/jam... - ThreadSky

meganranney.bsky.social • 89 days ago

A friend sent this article to me today.

It feels like a good moment to remind us: the LLM outperformed physicians on STRUCTURED cases.

Someday the AIs may perform better than us in the unstructured, messy, real-world patient encounter...but that day is not today.🩺🛟

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

Comments

kbfischer.bsky.social•89 days ago

If it’s the same paper I read last week, I had significant problems with it.
1. It mixed multiple specialties in a way that wasn’t helpful
2. It had a small sample size, which included trainees
3. They provided one vignette as an example and the differential diagnosis wasn’t an EM ddx

duanehere.bsky.social•89 days ago

If AI can serve as a form of post-medical school education, then I'm all for it. But MD's should *not* be looking over their shoulders at digital overlords because they're never going to match the fluidity of the human mind in synthesizing information in real-time.

lakeeffectsnow.bsky.social•89 days ago

LLM's only know what's common or likely. They do not handle well things that are rare, obscure, or emergent. They do not handle nuance and have no concept of context.

terraformer.bsky.social•89 days ago

At least for now, humans outperform potential rivals at identifying patterns amongst a sea of information

punkrockscience.bsky.social•89 days ago

I have spent a lot of time trying to get AIs to work with novel, unstructured data, and a lot of them aren’t even very good at that. An actual squishy human? That day is FAR away.

joannapenabickley.bsky.social•89 days ago

I agree. We don’t have predict it’s one day in the far future - if we get this right it’s in the near future with representative data models that help remove human physician availability and recency bias. We are believe we are a few years away with the current compute

jbriscoe.bsky.social•89 days ago

I’ve met plenty of folks who were correct diagnosed but in the worst ways: news delivered off-handedly, or buried in jargon, or by someone they don’t trust. Diagnosis is only part of being a good clinician.

meganranney.bsky.social•89 days ago

Great point.

veganelaine.bsky.social•89 days ago

The main finding of this study showed that the doctors with the most real experience (attendings) and with the most LLM experience outperformed all the other doctors.

abqslowrunner.bsky.social•89 days ago

There are not many textbook cases in real life:)

meganranney.bsky.social•89 days ago

(I urge that everyone read this article's very good accompanying editorial - a mix of skepticism & optimism - as well)
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825399

@jama.com @sumantranji.bsky.social

retiredrph.bsky.social•89 days ago

And some flies in the ointment:
All patients are not good communicators.
They remember things later and call back.
The power distance between MD and patient may mean that patients hold back essential info.
And a thousand other things.
LLM = not ready for prime time.

erictopol.bsky.social•89 days ago

There are now 5 publications with similar findings, but it's unclear whether this is (a) automation neglect, (b) lack of familiarity of physicians using AI, or (c) the nature of the contrived experiments, not representing real world medicine

sumantranji.bsky.social•89 days ago

A, B, and C will need to be overcome for AI to reach its potential in healthcare. I’m not sure AI proponents grasp the scale of these workflow and human factors challenges.

Thanks @meganranney.bsky.social for your kind words about my editorial

meganranney.bsky.social•89 days ago

It’s outstanding :)

adanbecerraphd.bsky.social•89 days ago

Probably a mix of all of these with (c) being the biggest culprit in my opinion. For example, in the study above, the authors used clinical vignettes of a computer-based diagnostic system from 1994. Seems a bit outdated. We need new performance benchmarks for AI in today’s healthcare system.

meganranney.bsky.social•89 days ago

I think it's just easier to do this type of study, than real-world evaluations? But the grandiose conclusions drawn from simulated patients, drive me batty.

meducate.bsky.social•89 days ago

I think that is a real issue. There’s no doubt that we will be better because of these technologies but we need to find better ways to find the proof.

mikejohansenmd.medsky.social•89 days ago

What do you find grandiose about the conclusions?

meganranney.bsky.social•89 days ago

Sorry for not being more precise: I didn’t mean this particular study was grandiose [their discussion is pretty measured], but rather, that many (including this one) are portrayed as if they have discovered something bigger than they did. I blame the abstracts 😊

punkrockscience.bsky.social•89 days ago

I bet c) is a big part of it. If the people writing the evaluations know what the AI needs to know to make a good diagnosis, that’s going to show up in the scenarios, consciously or not - the same as it would when writing a test question for a human student. You *want* them to get the answer.

punkrockscience.bsky.social•89 days ago

In a student, you’d expect them to eventually recognize the gaps in what they’re told and ask questions to fill them in - as well as understand why certain gaps might exist and what *that* might tell them.

AIs are poor at recognizing data gaps and more prone to assume based on past data.

alexflow.bsky.social•89 days ago

Yep. Why go through all that heartache with the ethics committee and hospital admins when this study can get published all the same?

erictopol.bsky.social•89 days ago

me too, Megan!

emlitofnote.bsky.social•89 days ago

Ah, I wish I could fish out the article – but one of my favorite examples of the flaws in these sorts of simulated cases/MCQ evaluations is that LLMs perform better when being told "this is a test".

Naturally elevates its tendency to fish out the associations underlying uncommon disease patterns.

mdwallach.bsky.social•89 days ago

It’s beyond surprising how many times clinical journals will publish a paper which is just “we trained a model to do this specific thing and it does this specific thing” it’s thr AI equivalent of it being possible to fit a trend line to anything

adanbecerraphd.bsky.social•89 days ago

This happens all the time with clinical risk prediction models. There are so many papers out there, but very few clinical prediction models are actually deployed in real-world practice.

mdwallach.bsky.social•89 days ago

It bugs me so much because these tools really do have a ton of utility and we keep showing off weird mechanical Turks.

thesgem.bsky.social•89 days ago

Need to be careful not to over-interpret this publication for a variety of reasons. They included 5 EM physicians in total, with some being residents. These participants were recruited via email and paid $ (selection bias). Single LLM. No training. Two sites.

veganelaine.bsky.social•89 days ago

No specific training but they were asked how often they use language learning models and those who used them most out performed the rest of of the doctors, on average.

keithgrimes.bsky.social•89 days ago

It’s also less impressive than the jan ‘24 paper from Google on their AIME model. Still a rather narrow and artificial setting, but larger cohort and number of scenarios. Similar findings:

http://research.google/blog/amie-a-research-ai-system-for-diagnostic-medical-reasoning-and-conversations/

meganranney.bsky.social•89 days ago

Agree!

lafskyonblusky.bsky.social•88 days ago

All the demo’s I’ve seen have been very structured cases, problem, list in order, time sequence in order. In real life with a fresh problem case, the patient dumps a 300 piece jigsaw puzzle on the floor, and when you ask the necessary questions about what/when you feel like a police detective.

bluegumby.bsky.social•89 days ago

I mean what could go wrong? 2 decades into voice recognition transcription I still occasionally stumble upon one of my prior charts full of non-sensical gibberish - lets put the machines (that are still worse than the transcriptionists were) in charge of diagnosis and treatment!

lafskyonblusky.bsky.social•88 days ago

Dragon still can’t handle “and” vs “in” no matter how obvious.

alherix.bsky.social•89 days ago

Problem is that at least in U.K. your consultation with GP is only 10’ and they allow you to address just one problem in this period.

meganranney.bsky.social•89 days ago

oh the system is so broken, and not just in the UK

LLMs have promise

but they have not proven themselves yet, at least not for triage

creilly94010.medsky.social•89 days ago

We been testing LLMs for 2 years on public triage cases from the ESI manual, which they train on, and results are ACC < 55%. Worse on our clinician adjudicated gold cases.

emlitofnote.bsky.social•89 days ago

From having read a lot of studies, I think the big picture view is LLMs have really democratised ML/prediction for a lot of non-technical users, and without the need to perform a lot of domain-specific training to get a "decent" result.

Better data science and mixed methods will do better.

creilly94010.medsky.social•89 days ago

Look for software as a medical device (SaMD) announcements

alherix.bsky.social•89 days ago

Thank you - I have not seen this as a separate category for certification by MHRA

alherix.bsky.social•89 days ago

That is awful! As a public governor of a local hospital I would have raised any usage of this as a tool as a significant risk.

Where can one get best current data on the safety of AI use for diagnostics with break up of type of information it used on?

meganranney.bsky.social•89 days ago

https://chai.org/ is the best source that I know of - but far from perfect

alherix.bsky.social•89 days ago

Thank you. Passed it on.

creilly94010.medsky.social•89 days ago

That is a big topic, you can start by following @erictopol.bsky.social and beginning your learning journey there. For more mainstream, look for the FDAs take on a product

alherix.bsky.social•89 days ago

Thanks for pointed to Eric Topol, I have been following him for years now and subscribe to his Substack. Good idea re checking FDA. Have been doing that for vaccines but have not thought to check for AI tools applications

Comments

Posting Rules

Reply