This is excellent - crammed with practical advice about how to build useful systems that use LLMs to run tools in a loop to achieve a goal. Wrote some short notes here: https://simonwillison.net/2025/Jan/11/agents/
Comments
Log in with your Bluesky account to leave a comment
I need to read this again more deeply. But what do you make of this? I can't understand the value proposition. The explanations are nice but the setup and validation steps seem immense, time consuming and tightly coupled to the systems they work in. While the correct output is not even guaranteed.
This is interesting, but there are some really bad base assumptions about the 'reasoning' AI is doing that fundamentally misunderstands the technology. It can't reason, it's giving you the average of the data it's been given.
Not that there isn't a use case for that, but when it comes to things like forecasting it can be extremely dangerous. Assuming today will be like yesterday is how you get wiped out by black swans.
These tools are not accurate, but that isn't a problem when you have _inputs_ (say parsing whether images are galaxies or cells are cancerous). You account for the accuracy and have a human interpretation based on that.
This article is proposing _outputs_, which means ANY mistakes by the AI will be high cost, with no chance for a human to correct.
Best I've seen, with max cloud compute costs, is 90%ish accurate.
Whether or not LLMs can "reason" very much depends on which definition of "reasoning" you are using
I'm confident that they can perform an imitation of "reasoning" that's good enough for things like executing a plan to run some tools with a high enough success rate to be useful
They are predictive language models, they cannot reason. That's why every single one breaks when you give it trick questions.
In this paper they clearly identify the success rate needed and it's way past what any current model is capable of.
Even a 1% error rate compounds to gibberish really fast. Meanwhile AI is slowing down, each iteration is less of a step above the previous. Realistically, we have to assume 90-95% is the best accuracy we're ever going to get.
And that rules out agents.
Yes, but then you don't have "agents". The gist is fine when you're translating a menu for you to eat, terrible if you're the restaurant and want to reach customers speaking another language.
Thank you for sharing this. It's going to take me a bit to get through it, but skimming the plan bits aligns nicely with a few things I have thought a lot about, but was slightly overwhelmed to commit it to code.
With GPT4(ish) level LLMs being locally accessible, I wonder what embedding an Agent in some of my Django projects might look like. It is kind of like giving them some tools and data and then maybe small jobs they can do.
Comments
Best I've seen, with max cloud compute costs, is 90%ish accurate.
I'm confident that they can perform an imitation of "reasoning" that's good enough for things like executing a plan to run some tools with a high enough success rate to be useful
In this paper they clearly identify the success rate needed and it's way past what any current model is capable of.
And that rules out agents.
the longer the context window, the longer the plan can be due to context memory
the stronger the interpretability, the better the plan due to proper causal inference