AI folks - how are you make sure your GenAI stuff isn't hallucinating in production? Evals, or just vibes? - ThreadSky

mattpocock.com • 114 days ago

AI folks - how are you make sure your GenAI stuff isn't hallucinating in production?

Evals, or just vibes?

Comments

nickalico.me•114 days ago

Literally facing this in my master's research currently. Very very complex

brett-lamy.bsky.social•114 days ago

Short answer
You can’t

Long answer
Evals, grounding the answer using citations, and logprobs

https://cookbook.openai.com/examples/using_logprobs

techsavvytravvy.com•114 days ago

you do as much as you can without ai and validate the ai you sprinkle in

or tell it "don't hallucinate"

multiple passes can also help

keichinger.dev•114 days ago

Telling an hallucinating system to "don't hallucinate" should achieve exactly what functionality? It's like telling a confused person to not be confused.

It most definitely can't magically fix itself, if it doesn't know what's the problem.

keichinger.dev•114 days ago

Plus, why would the AI respond with problematic content in the first place if it already knows that it's problematic and hallucinated? That doesn't even make sense.

techsavvytravvy.com•114 days ago

was just a lil joke there

keichinger.dev•114 days ago

Ah, gotcha! Sorry, I did not perceive it that way. My bad :) No harm intended!

techsavvytravvy.com•114 days ago

no it's valid i should have put that last. my comedic timing was terrible

callumkloos.dev•114 days ago

disclaimer: I don't use AI in prod, but https://builder.io do and I remember this video having some great advice on this:
https://www.youtube.com/watch?v=oblqGaOy_qI

elander.bsky.social•114 days ago

You can't.

Hallucination is the feature.

It's only when the hallucinations aren't factually correct that people don't like them, and complain. But as long as you're using LLMs, hallucination is all you're gonna get.

miguelalonsojr.bsky.social•114 days ago

RAG, evals/benchmarks and unit tests.

konradkruk.com•114 days ago

Just add to the prompt “do not hallucinate”, like Apple did

anthonypowell.me•112 days ago

Checkout https://github.com/Arize-ai/phoenix
Open source, with growing TS support

Disclaimer: I work here

marypcbuk.bsky.social•114 days ago

CIOs I talk to use RAG for grounding as part of their guardrails and quality checks and safety services and feedback; I wrote about it here

untestcomplique.com•114 days ago

Vides lol 😅 but still early so i will get feedback from clients !

riva.wtf•114 days ago

At Orama we build our own evaluation system and give you a dashboard to control how the AI performed and what actions you can take to prevent and correct hallucinations

alexz90.bsky.social•113 days ago

beg/threaten it to not hallucinate 😆

dreamsofcode.bsky.social•114 days ago

Lots and lots of observability.

There was a great talk by https://honeycomb.io at Kubecon 2023 that went into detail about how they used o11y to check for the results of queries generated from A.I.

TLDR is you can't stop it hallucinating, but you can always tweak it if you observe it.

francois.robichet.com•114 days ago

This seems silly but you tell it not to.
This is what Apple does with apple intelligence for instance https://x.com/burkov/status/1852169539124965490

mattpocock.com•114 days ago

Yes, but this sadly doesn't work

pdevito3.bsky.social•114 days ago

Evals — tests post generation angst the business op if possible (ex try the api call, template application, etc). Also say asking for receipts from a doc lookup by providing the line and a link to it

Also useful to add a testing layer — here’s a good video on it https://youtu.be/TSNAvFJoP4M?si=QpljUjGWawOysVBR

pdevito3.bsky.social•114 days ago

I could also see building fault tolerance into your UI to have the LLM generate something and have user interactions in the UI to confirm the action. Minimizes the bottleneck

filrakow.ski•114 days ago

Depends on what you're using it for but if you're injecting any data then you can tell specifically to only use this data and if it's not there answer "I don't know". This helped me a lot

mattpocock.com•114 days ago

Yes, but how do you KNOW about this in production? You need a process for making sure that your prompts are mostly doing what you think they're doing.

filrakow.ski•114 days ago

Gotcha! We had that problem with AI docs bot we were building at the very early days of gpt 3. The way we got around this is having a model answers for specific types of questions and then a prompt that way checking against them. So kind of e2e test.

Worked better than it sound :P

mattpocock.com•114 days ago

So you had an LLM-as-a-judge system? Am I understanding that correctly?

filrakow.ski•114 days ago

Yes, but working in an almost imperative way: Ask the questions, get the answer, save, validate with LLM if it contains x,y,z

filrakow.ski•114 days ago

Funny enough when I explicitly ask to include x y z for let's say 500 questions it is definitely not helping. I feel that too many case-specific rules make it hallucinate more in all the other ones.

dgoldman.bsky.social•114 days ago

I definitely rely on evals a lot.

vinnymac.dev•114 days ago

I’m currently doing it using json schema, Zod validation, a database, and a variety of heuristics.

I’ve also seen the approach of chat X times, and then pick the best result out of the batch. Of course this increases costs substantially if you need to do that for every interaction.

mellson.dev•114 days ago

For a current, legal documents, parsing project we do the initial work with an expensive model, then verify with multiple runs of an inexpensive model. And lastly a, manual, visual check from a human.

mattpocock.com•114 days ago

Nice!

At what stage in the process? While the user is waiting? Or as an eval?

mellson.dev•114 days ago

We do the human in the loop at the end, in an internal tool. Since we’re parsing legal documents it’s quite easy for a human to compare the original with the parsed version. If something doesn’t look right, they can just edit on the page. But edits are rare, the AI is crazy good for our needs.

mellson.dev•114 days ago

Since this is an internal tool, and something that we don’t run often wait time for the user is not really a concern. We’re lucky our work fits well with what the current LLMs can achieve

mellson.dev•114 days ago

So to sum it up we use inexpensive LLMs for eval and a human at the end for a final QA

natenorberg.bsky.social•114 days ago

We do it by chaining prompts together. We have a prompt to generate a message based on some knowledge, then other prompts to check tone and verify that the claim in the answer can be verified by the information we gave it.

I think this is called agentic workflows, but I struggle with terminology

vinayakakv.com•114 days ago

Vibes and prompt guards

But it's not 100% — so in case of OpenAI, we go through threads randomly

mattpocock.com•114 days ago

Manual checking?

Do you check the DB itself, or do you have a UI over the top you use for making the checking easier?

prowe.ca•114 days ago

Have been using langfuse to monitor and score a model for a small client. Has been quite handy to manually check, add notes, and iterate to make sure our models aren’t doing anything wacky.

vinayakakv.com•114 days ago

OpenAI threads provide a UI, so we don't have to worry about DB+UI.

For other providers, yes - having such storage+UI makes sense.

erikras.com•114 days ago

You can't?

mattpocock.com•114 days ago

You can't stop it completely, but it seems that you can get more confident in your outputs with an evaluation framework.

Kind of like a test suite, but for a probabilistic system.

akoskm.bsky.social•114 days ago

I don't consider myself an AI folk but in a SaaS I tried to launch I had this and it worked.

If I were building something like this from scratch I'd look into OpenAI Schemas: https://openai.com/index/introducing-structured-outputs-in-the-api/

Comments

Posting Rules

Reply