I have an AI generated proof of a statistics lemma beyond my measure theory paygrade. Question: how many samples from LLMs do I need in practice to be 95% confident the proof is correct? Do different LLMs help ensure that there is no correlation between these draws?
Comments