That’s just the probabilistic nature of LLM. Even with eval, there is no guarantee that it will work as expected in production.
That’s why we gotta add monitoring, human in the loop, and other strategies.
That’s why we gotta add monitoring, human in the loop, and other strategies.
Comments
Building with LLMs is a lot of fun but can also be quite frustrating when you thought you’d fixed a bug/bad output path but it turns out you just got lucky 10 times in a row and really it’s still broken. DX can be rough.