Profile avatar
haroldaptroot.bsky.social
https://mastodon.gamedev.place/@harold | http://haroldbot.nl
44 posts 91 followers 83 following
Regular Contributor
Active Commenter

New blog post: From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs bitmath.blogspot.com/2025/06/from...

@instlatx64.bsky.social I had some trouble accessing instlatx64.atw.hu and it turned out that uBO "ad blocks" the whole thing, apparently "users.atw.hu^$all" made it onto an urlhaus list of malicious websites That's all I know and sorry if you're being spammed to death with reports about this

AVX512 implementation of MMIX MOR, works out fairly well

Played around with some _addcarryx_u64, being careful to have at most 2 live carries. Out of the Big Three, the only compiler that delivered a decent result was Clang. The others messed up in such a way that they needed to reify the carry with setcc.

Would be cool if the Arm ISA docs also mentioned the corresponding* intrinsics in the lemmas of applicable instructions It's fairly predictable from the instruction name, but not completely.

A bunch of integer operation that used to have a latency of 3 cycles (popcnt, lzcnt, bsr) take 4 cycles on LionCove (could all take 1 cycle but whatever), not a huge deal but maybe some code wants to be reshuffled a bit. The latency of SIMD integer multiplication went down, nice.

Every "trailing bit manipulation" operation (from BMI1 and TBM and extra ones even more than in this table: programming.sirrida.de/programming.... ) could be unified into an instruction that performs ternlog(x, -x, x + 1, imm8)

Apparently shlx is a "medium latency" (3 cycles) instruction on Alder Lake. My disappointment is immeasurable, and my day is ruined.

"the main issue of std::unordered_map (not only libstdc++ implementation, but all standard compatible ones) is cache unfriendliness" OK. But there's also a 64-bit division in libstdc++ specifically. Should it really work that way?

Let's say you have a 4-input ternlog that takes the LUT as a variable, let's call it ternlogvb. ternlogvb(0xF0, 0xCC, 0xAA, x) just maps x (a byte) to itself. Those masks probably look familiar: 0..7 but transposed. If you permute that transposed list, you permute the bits of x.

Have the "wow, Rust PNG decoder is faster than libpng" people ever looked at libpng? Look at filter_sse2_intrinsics.c in particular. You've seen that right, it uses SSE2 to process one pixel at the time. Probably written back when unaligned loads were a no-no, and before SSSE3.

New blog post: Bit-permuting 16 u32s at once with AVX-512 bitmath.blogspot.com/2024/12/bit-...

Looks like C++26 is going to have saturating arithmetic. std::add_sat and sub_sat autovectorize well for narrow integer types, as you might hope. Saturating manually with `min` or `if` still sucks. godbolt.org/z/ffq8nb83h

Just had a look at "binius multiplication", the 8-bit case can be done with vgf2p8affineqb, vpandq, vpopcntq, then extract the bits. Easier than expected. Maybe 64-bit next, but it'll be more annoying: 8-bit happened to be a bilinear form with a nice matrix in the middle.

New blog post: Histogramming bytes with positional popcount (GF2P8AFFINEQB edition) bitmath.blogspot.com/2024/11/hist...

I don't know if I want to actively juggle 3 social media sites at the same time, but I made an account here at least.