AI folks - how are you make sure your GenAI stuff isn't hallucinating in production?
Evals, or just vibes?
Evals, or just vibes?
Comments
You can’t
Long answer
Evals, grounding the answer using citations, and logprobs
https://cookbook.openai.com/examples/using_logprobs
or tell it "don't hallucinate"
multiple passes can also help
It most definitely can't magically fix itself, if it doesn't know what's the problem.
https://www.youtube.com/watch?v=oblqGaOy_qI
Hallucination is the feature.
It's only when the hallucinations aren't factually correct that people don't like them, and complain. But as long as you're using LLMs, hallucination is all you're gonna get.
Open source, with growing TS support
Disclaimer: I work here
There was a great talk by https://honeycomb.io at Kubecon 2023 that went into detail about how they used o11y to check for the results of queries generated from A.I.
TLDR is you can't stop it hallucinating, but you can always tweak it if you observe it.
This is what Apple does with apple intelligence for instance https://x.com/burkov/status/1852169539124965490
Also useful to add a testing layer — here’s a good video on it https://youtu.be/TSNAvFJoP4M?si=QpljUjGWawOysVBR
Worked better than it sound :P
I’ve also seen the approach of chat X times, and then pick the best result out of the batch. Of course this increases costs substantially if you need to do that for every interaction.
At what stage in the process? While the user is waiting? Or as an eval?
I think this is called agentic workflows, but I struggle with terminology
But it's not 100% — so in case of OpenAI, we go through threads randomly
Do you check the DB itself, or do you have a UI over the top you use for making the checking easier?
For other providers, yes - having such storage+UI makes sense.
Kind of like a test suite, but for a probabilistic system.
If I were building something like this from scratch I'd look into OpenAI Schemas: https://openai.com/index/introducing-structured-outputs-in-the-api/