RM@n (use the reward model to select one of the n answers, can be useful if the reward model is open of if there's a way to use this specific kind of sampling), BoN (best-of-n, assumes you have some kind of oracle that can check if any of the n answers is correct, e.g. online judge or Thm Prover) - ThreadSky

joao.omg.lol • 7 days ago

RM@n (use the reward model to select one of the n answers, can be useful if the reward model is open of if there's a way to use this specific kind of sampling), BoN (best-of-n, assumes you have some kind of oracle that can check if any of the n answers is correct, e.g. online judge or Thm Prover)

Comments

Posting Rules

Comments

Posting Rules

Reply