To be fair the extra prompt engineering trick may have made the LLM do better at the boolean but we don't know. Pretty sure research on LLMs people rely on don't test the narrow task of coming up with boolean

Comments