I was wondering, have you considered the idea of rasterizing in tiles instead of rows? I know gpus do something similar, and I've been curious for a while whether the same approach would work well on the cpu.
No, I haven't! Afaik mobile GPUs do that, mostly because it somehow reduces power requirements. My guess would be that it's a hardware thing that makes little sense in CPU rasterization, but I might be wrong ofc.
Ah, I see. The tile thing is mostly so accesses to the framebuffer always hit L1. Like, imagine you have a rgba8 framebuffer, so 4 bytes per pixel. If you split the framebuffer into 4x4 tiles, where tile rows are contiguous in memory, then each tile would fit entirely in a cache line.
It can also help because you can early cull those smaller tiles. Of course you can also speed up scanline order processing too (we did this in Dreamcast hardware) by finding start and end locations, but square tiles are better IMO.
Then, if you bin triangles per tile, you can process tiles independently, and read/writes to the framebuffer should always hit L1, which would make things like depth-testing and blending a ton faster.
I think the only annoying part then is to convert from this tiled format to a linear one for presentation, but the access pattern for this is pretty regular, so the prefetcher should pick up on it.
Yeah, but that requires so much preprocessing and extra code (all this triangle binning and conversions) that it's not obvious to me that it will be an overall win!
Comments