Two days ago, Deepseek surprised everyone with an "undefined-behavior" PTX optimization speeding up particular ML workloads on a Hopper NVIDIA GPU Kernel.
Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.
Let's reverse engineer the hack, implement it ourselves, and benchmark the speedup on an H100.
Comments
https://www.youtube.com/watch?v=iEda8_Mvvo4
https://github.com/LaurieWired/BenchmarkCustomPTX