That’s just the probabilistic nature of LLM. Even with eval, there is no guarantee that it will work as expected in production.

That’s why we gotta add monitoring, human in the loop, and other strategies.

Comments