This papers' findings about testing LLMs on NLI aligns with many of personal thoughts:

1) NLI remains a difficult task for LLMs
2) Having more few-shot examples is helpful (in my view, helping LLMs better understand class boundaries)
3) Incorrect predictions are often a result of ambiguous labels
Post image

Comments