Structured outputs can decrease LLM's performance in some tasks
I replicated @willkurt.bsky.social / @dottxtai.bsky.social rebuttal of Let Me Speak Freely? (LMSF) using gpt-4o-mini
The rebuttal correctly highlights many flaws with the original study, but ironically, LMSF's conclusion still holds
I replicated @willkurt.bsky.social / @dottxtai.bsky.social rebuttal of Let Me Speak Freely? (LMSF) using gpt-4o-mini
The rebuttal correctly highlights many flaws with the original study, but ironically, LMSF's conclusion still holds
Comments
I was able to reproduce the results and, after tweaking a few minor prompt issues, achieved a slight improvement in most metrics.
Tweaked the prompts and improved all LMSF metrics except for NL in GSM8k.
GSM8k and Last Letter looked as expected (no diff).
But in Shuffled Obj. unstructured outputs clearly surpassed structured ones.
But it’s clear to me that neither structured nor unstructured outputs are always better, and choosing one or the other can often make a difference.
Test things yourself. Run your own evals and decide.
Once or twice per month I write a technical article about AI here: https://subscribe.dylancastillo.co/