Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel.

Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.

Comments