every LLM task needs a curated, iteratively developed quant eval.

everyone who deploys a single prompt is a data scientist now, whether they realize it or not.

there is a lot of money to be made in helping orgs internalize this.
Reposted from Eugene Yan
Repeat after me:

I will build evals for my tasks.
I will build evals for my tasks.
I will build evals for my tasks.

Comments