Structured outputs can decrease LLM's performance in some tasks I replicated @willkurt.bsky.social / @dottxtai.bsky.social rebuttal of Let Me Speak Freely? (LMSF) using gpt-4o-mini The rebuttal correctly highlights many flaws with the original study, but ironically, LMSF's conclusion still holds - ThreadSky

About ThreadSky

dylancastillo.co • 89 days ago

Structured outputs can decrease LLM's performance in some tasks

I replicated @willkurt.bsky.social / @dottxtai.bsky.social rebuttal of Let Me Speak Freely? (LMSF) using gpt-4o-mini

The rebuttal correctly highlights many flaws with the original study, but ironically, LMSF's conclusion still holds

Comments

dylancastillo.co•89 days ago

I began by replicating .txt's results using LLaMA-3-8B-Instruct (the model considered in the rebuttal).

I was able to reproduce the results and, after tweaking a few minor prompt issues, achieved a slight improvement in most metrics.

dylancastillo.co•89 days ago

Then I switched to GPT-4o-mini, using LMSF's results as a reference.

Tweaked the prompts and improved all LMSF metrics except for NL in GSM8k.

GSM8k and Last Letter looked as expected (no diff).

But in Shuffled Obj. unstructured outputs clearly surpassed structured ones.

dylancastillo.co•89 days ago

I’m not saying you should default to unstructured outputs. In fact, I usually go with structured.

But it’s clear to me that neither structured nor unstructured outputs are always better, and choosing one or the other can often make a difference.

Test things yourself. Run your own evals and decide.

dylancastillo.co•89 days ago

Here's the post with all the code required to replicate the results: https://dylancastillo.co/posts/say-what-you-mean-sometimes.html

Once or twice per month I write a technical article about AI here: https://subscribe.dylancastillo.co/

Posting Rules

Be respectful to others
No spam or self-promotion
Stay on topic
Follow Bluesky's terms of service

Comments

Posting Rules

Reply