dylancastillo.co - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

Thank you, I'll update the article!

submitted 15 hours ago

comment in response to post

Here's the full post: dylancastillo.co/posts/gemin... and the github code: github.com/dylanjcasti...

submitted 63 days ago

comment in response to post

In any case, for me, the key takeaway is that SO can decrease (or increase!) the performance in some tasks. Be conscious of that. For now, there are no clear guidelines on where each method works better. Your best bet is testing your LLM running your own evals.

submitted 63 days ago

comment in response to post

So, if you only consider constrained decoding (JSON-Schema), performance decreases across the board vs. NL. Given this result and the key sorting issue, I'd suggest avoiding using JSON-Schema, unless you really need to. JSON-Prompt seems like a better alternative.

submitted 63 days ago

comment in response to post

Still, I could workaround the issue and re-run the benchmarks. NL and JSON-Prompt are tied. But JSON-Schema performed worse than NL in 5 out of 6 tasks in my tests. Plus, in Shuffled Objects, it did so with a huge delta: 97.15% for NL vs. 86.18% for JSON-Schema.

submitted 63 days ago

comment in response to post

There's a propertyOrdering param documented in Vertex AI that should solve this: cloud.google.com/vertex-ai/g... But it doesn't work in the Generative AI SDK. Other users have already reported this issue. For the benchmarks, I excluded FC and used already sorted keys for JSON-Schema.

submitted 63 days ago

comment in response to post

Before generation, they reorder the schema keys. SO-Schema does it alphabetically and FC does it in a random manner (?). This can break your CoT. You can fix SO-Schema by being smart with keys. Instead of "reasoning" and "answer" use something like "reasoning and "solution".

submitted 63 days ago

comment in response to post

Gemini has 3 ways of generating SO: 1. Forced function calling (FC): ai.google.dev/gemini-api/... 2. Schema in prompt (SO-Prompt): ai.google.dev/gemini-api/... 3. Schema in model config (SO-Schema): ai.google.dev/gemini-api/... SO-Prompt works well. But FC and SO-Schema have a major flaw.

submitted 63 days ago

comment in response to post

Pi by Darren Aronofsky

submitted 68 days ago

comment in response to post

Here's the post with all the code required to replicate the results: dylancastillo.co/posts/say-w... Once or twice per month I write a technical article about AI here: subscribe.dylancastillo.co/

submitted 89 days ago

comment in response to post

I’m not saying you should default to unstructured outputs. In fact, I usually go with structured. But it’s clear to me that neither structured nor unstructured outputs are always better, and choosing one or the other can often make a difference. Test things yourself. Run your own evals and decide.

submitted 89 days ago

comment in response to post

Then I switched to GPT-4o-mini, using LMSF's results as a reference. Tweaked the prompts and improved all LMSF metrics except for NL in GSM8k. GSM8k and Last Letter looked as expected (no diff). But in Shuffled Obj. unstructured outputs clearly surpassed structured ones.

submitted 89 days ago

comment in response to post

I began by replicating .txt's results using LLaMA-3-8B-Instruct (the model considered in the rebuttal). I was able to reproduce the results and, after tweaking a few minor prompt issues, achieved a slight improvement in most metrics.

submitted 89 days ago

comment in response to post

Good stuff! Will be useful soon. I'm about to jump ship from Poetry but old habits die hard.

submitted 97 days ago

comment in response to post

ML is a subset of AI

submitted 101 days ago

comment in response to post

I believe you're the one creating the strawman. People are lynching a researcher for publishing a dataset of publicly available data that, if anything, will be used to improve this same social network where they're doing the lynching. I'm trying to make clear that AI has tons of positive use cases

submitted 101 days ago

comment in response to post

I often do it the other way around. I start with o1-mini. If it's unable to solve the issue, I ask o1-preview. It usually saves me time.

submitted 110 days ago