Found 2 big issues with Gemini's structured outputs (SO): 1. Using constrained decoding seems to lower performance in reasoning tasks. 2. The Generative AI SDK can break your model's reasoning. Just re-ran Let Me Speak Freely benchmarks with Gemini and got some interesting news - ThreadSky

dylancastillo.co • 63 days ago

Found 2 big issues with Gemini's structured outputs (SO):

1. Using constrained decoding seems to lower performance in reasoning tasks.
2. The Generative AI SDK can break your model's reasoning.

Just re-ran Let Me Speak Freely benchmarks with Gemini and got some interesting news

Comments

dylancastillo.co•63 days ago

Gemini has 3 ways of generating SO:

1. Forced function calling (FC): https://ai.google.dev/gemini-api/tutorials/extract_structured_data
2. Schema in prompt (SO-Prompt): https://ai.google.dev/gemini-api/docs/structured-output?lang=python#supply-schema-in-prompt
3. Schema in model config (SO-Schema): https://ai.google.dev/gemini-api/docs/structured-output?lang=python#supply-schema-in-config

SO-Prompt works well. But FC and SO-Schema have a major flaw.

dylancastillo.co•63 days ago

Before generation, they reorder the schema keys. SO-Schema does it alphabetically and FC does it in a random manner (?). This can break your CoT.

You can fix SO-Schema by being smart with keys. Instead of "reasoning" and "answer" use something like "reasoning and "solution".

dylancastillo.co•63 days ago

There's a propertyOrdering param documented in Vertex AI that should solve this: https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/control-generated-output

But it doesn't work in the Generative AI SDK. Other users have already reported this issue.

For the benchmarks, I excluded FC and used already sorted keys for JSON-Schema.

dylancastillo.co•63 days ago

Still, I could workaround the issue and re-run the benchmarks. NL and JSON-Prompt are tied.

But JSON-Schema performed worse than NL in 5 out of 6 tasks in my tests. Plus, in Shuffled Objects, it did so with a huge delta: 97.15% for NL vs. 86.18% for JSON-Schema.

dylancastillo.co•63 days ago

So, if you only consider constrained decoding (JSON-Schema), performance decreases across the board vs. NL.

Given this result and the key sorting issue, I'd suggest avoiding using JSON-Schema, unless you really need to. JSON-Prompt seems like a better alternative.

dylancastillo.co•63 days ago

In any case, for me, the key takeaway is that SO can decrease (or increase!) the performance in some tasks. Be conscious of that.

For now, there are no clear guidelines on where each method works better.

Your best bet is testing your LLM running your own evals.

Comments

Posting Rules

Reply