Also, your original problem of gcc adding the copy to al step is likely because it doesn't want to clobber rbx since you have fixed it. I imagine the register selection algorithm for inline assembly blindly excludes all fixed registers with selecting the output regs. (1/2)
Comments
If I try to _read_ from rbx.u8, it doesn't add the superfluous mov to al step.