This is a down-and-dirty look at building your own high-performance #AI #LLM inference engine, from raw #CUDA kernels on up. The result? Beating top-shelf libraries at their own game. Still probably best to use a supported library in production, though.
Comments
Only had time to skim it, but the thought process is similar to how I had to approach writing homomorphic implementations of neural network operators: it’s a ground-up rewrite of each operator from first principles.