Then I switched to GPT-4o-mini, using LMSF's results as a reference.

Tweaked the prompts and improved all LMSF metrics except for NL in GSM8k.

GSM8k and Last Letter looked as expected (no diff).

But in Shuffled Obj. unstructured outputs clearly surpassed structured ones.
Post image

Comments