I began by replicating .txt's results using LLaMA-3-8B-Instruct (the model considered in the rebuttal).

I was able to reproduce the results and, after tweaking a few minor prompt issues, achieved a slight improvement in most metrics.
Post image

Comments