In multi-turn conversations things do eventually diverge. We see this even with Llama 405b where the fp8 and fp16 seem to only have non-semantic differences (given identical benchmark scores), but in long conversations the difference is stark.

Comments