RM@n (use the reward model to select one of the n answers, can be useful if the reward model is open of if there's a way to use this specific kind of sampling), BoN (best-of-n, assumes you have some kind of oracle that can check if any of the n answers is correct, e.g. online judge or Thm Prover)
Comments