Tech companies and engineering teams rolling out customer-facing AI agents will very quickly realize a massive problem with them: They are non-deterministic. The classic engineering approach of build-->test-->run automated tests to ensure there are no regressions -- this will NOT work here! - ThreadSky

gergely.pragmaticengineer.com • 40 days ago

Tech companies and engineering teams rolling out customer-facing AI agents will very quickly realize a massive problem with them:

They are non-deterministic.

The classic engineering approach of build-->test-->run automated tests to ensure there are no regressions -- this will NOT work here!

Comments

mattwatsonkc.bsky.social•39 days ago

There has to be a way to also talk to a human.

adaloveslace.bsky.social•40 days ago

If things go awry, the only option is to withdraw it from service.

jasonroy.com•40 days ago

Yeah, the E2E tests have to be pretty forgiving or expect different paths. We run them with multiple prompts and look for certain components or skip. If the components never show, the underlying template prompts may need tweaking and the tests really should fail.

brooklynkid53bskys.bsky.social•40 days ago

Does anyone care anymore ?
abt 6 months ago, we bought a new washer dryer; consumers reports gave hi marks to many LG models, so I went with one of those

there is no manual

I repeat: there is no manual for the washer

unreal

letarumavza.bsky.social•39 days ago

An other funny thing will be pentesting them. Human agents are already exploitable, but ML ones can possibly be even funnier.

martindotnet.bsky.social•40 days ago

If only there were a concept dedicated to understanding how systems run when users interact with them.

Maybe something about how engineers "observe" how things run.

We could call it something like "observability" maybe.

dangb.me•39 days ago

This. I would argue that the classic build-->unit test-->int test approach hasn't been enough for quite a while.

When you deploy 100s of times a day on 100s of components, you need to observe your system holistically (and effectively) and shorten feedback loops to truly be able to control it.

martindotnet.bsky.social•40 days ago

I just, but the idea behind Observability was never about the telemetry signals themselves, as much as the big telemetry storage companies would want you to believe.

It was about understanding what is "actually" going on, being able to ask *new* questions about what went on in the past.

mjbellantoni.bsky.social•40 days ago

Yes!

Do you have any insight in how people have been tackling this? Or have they just not?

victor.earth•40 days ago

Set temperature to 0.0, and you get reproducible output from the LLM, if it's properly done

sinner.lu•40 days ago

Add to this the fast evolution of LLM versions, breaking backwards compatibility and short deprecation windows etc. You will need new tools and approaches to operational excellence for such systems. This in turn means more engineers operating them.

lucianadrian.bsky.social•40 days ago

Speaking about this around 3-4 years ago In conference. Testing chatbots teaches a lot about compounding probabilities and how fast things can degrade.
I wonder how will these agents be versioned, as I believe they need to. Will temperature and token window be versioned?

markmadsen.bsky.social•40 days ago

The new interesting problem is reliability. I keep hearing “it just needs to be good enough” with no definition of good enough.
How often is it ok to not record a sale, or cancel your insurance?
The software industry is unprepared for indeterminism.

ericmoderbacher.bsky.social•39 days ago

Ummmm but wait... a neural network can be ran deterministically.

andrew.aylett.co.uk•39 days ago

There's nothing inherently non-deterministic about an LLM. As a practical matter, providers don't seem to want to provide repeatability but I'd be surprised if they don't have repeatable internal tests.
I wrote about this a couple of years ago: https://www.aylett.co.uk/thoughts/llm_repeatability

lawrencejones.dev•40 days ago

Yeah, get real comfortable with pass rates and what that means for your CI!

benjiweber.com•40 days ago

and by extension some security vulns only manifest a small percentage of the times you test for them, making security testing harder.

djscruggs.com•40 days ago

We’re gonna need a bigger package.json

wspittman.bsky.social•40 days ago

Yes, but to be fair, customer-facing human agents are also non-deterministic.

tophatcroat.bsky.social•40 days ago

I would compare that with how automation tests only started getting wider adoption in 2000s. It took time for best practices and tooling to catch up, and I think we are there right now with AI.

juwariah.bsky.social•38 days ago

I mean the whole premise of agents is to tackle deterministic workflows 🧐

billkarwin.geek.org•40 days ago

I didn't come up with this, but the best test I read for AI agents is:

"Would you trust it to approve customer refund requests?"

It's going to be hilarious when companies replace customer service with AI agents, and people figure out how to trick those agents into giving them free stuff.

vartec.bsky.social•40 days ago

I expect classifiers are going to become a standard tool in testing toolkits.

vartec.bsky.social•40 days ago

FWIW, classifiers, while still AI, can use much smaller, lighter, and faster models.

yawaramin.bsky.social•40 days ago

The big problem we have with a third-party bot is that they charge per user query so we have to buy a limited number of questions. Because I guess LLM queries are so computationally expensive. It's such a waste. I could use SQLite full text search to build something similar that takes milliseconds

gergely.pragmaticengineer.com•40 days ago

Good spending time and effort trying to tame a non-deterministic system into a deterministic one. It will be anything but trivial to do!

j-tomanik.bsky.social•39 days ago

But once that knowledge is refined business will have “agent” managers same way as they have technical managers

j-tomanik.bsky.social•39 days ago

Human employees are also non-deterministic yet that never stopped companies from using them. Non-determinism is only a problem for developers not for businesses people. Yes LLMs are non-deterministic in different ways than humans and we’re still learning how to manage them.

dgoldman.bsky.social•39 days ago

Not even that different. We see humans "hallucinate" i.e. one shot a response that is statistically reasonable, but not context reasonable. For instance, if Picard accidentally said "shut up Wesley" to someone who wasn't Wesley because the most probable name following "shut up" is "Wesley."

andi-palo.bsky.social•40 days ago

I somehow agree with this thesis. The counter argument is that in software you are able to define the determinism with tests. Even if the code is not dev-friendly, the contract is the unit tests passing. System level tests and behavior might be more difficult to define.

deejayy.hu•40 days ago

I'm currently playing with LLMs and faced the same problem. Although you have some control over determinism: temperature and seed.

mhusseinlibrarian.bsky.social•40 days ago

Non-determinism in AI breaks traditional automation pipelines. The future of testing lies in stochastic approaches: simulate variability, analyze patterns, and identify failure modes before they cascade.

mhusseinlibrarian.bsky.social•40 days ago

AI systems don’t regress—they evolve. This requires a shift in engineering: focus less on static testing and more on adaptive validation models that account for probabilistic behavior and edge cases.

Comments

Posting Rules

Reply