Working on a "special" kind of data (health; medical images etc.) means experiencing the disappointment of realising the method in a paper relies critically on "having a very good image captioner", "asking GPT-4V", etc.
Comments
Log in with your Bluesky account to leave a comment
there's a huge huge problem for reproducibility also, the gpt "models" are not models, they are services which do not guarantee identical performance over time
concretely oai updates the thing in ways that i am sure makes their numbers look better but that breaks automation workflows. they do this silently with no announcement, usually on weekends
I don't use OAI endpoints so haven't experienced this directly, but yes indeed there are serious issues with reproducibility; even assuming researchers actually specified which version of a given service they were using when they wrote their paper...
oai behavior is fundamentally delivering reliability and improvement for a chatbot and only a chatbot. other uses are not considered in enough detail for them to notice they are breaking them
Lots of people are building such tools for medical data (I myself have spent some time trying to caption chest X-rays; my colleagues built a model to promptably edit CXRs), but you can't assume their performance to be unquestionably good.
Comments