haroldaptroot.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

haroldaptroot.bsky.social

https://mastodon.gamedev.place/@harold | http://haroldbot.nl

44 posts 91 followers 83 following

Posts 16 Comments 29

New blog post: From Boolean logic to bitmath and SIMD: transitive closure of tiny graphs bitmath.blogspot.com/2025/06/from...

submitted 11 days ago • 0 comments

@instlatx64.bsky.social I had some trouble accessing instlatx64.atw.hu and it turned out that uBO "ad blocks" the whole thing, apparently "users.atw.hu^$all" made it onto an urlhaus list of malicious websites That's all I know and sorry if you're being spammed to death with reports about this

submitted 49 days ago • 2 comments

AVX512 implementation of MMIX MOR, works out fairly well

submitted 61 days ago • 0 comments

Played around with some _addcarryx_u64, being careful to have at most 2 live carries. Out of the Big Three, the only compiler that delivered a decent result was Clang. The others messed up in such a way that they needed to reify the carry with setcc.

submitted 128 days ago • 1 comment

Would be cool if the Arm ISA docs also mentioned the corresponding* intrinsics in the lemmas of applicable instructions It's fairly predictable from the instruction name, but not completely.

submitted 143 days ago • 0 comments

A bunch of integer operation that used to have a latency of 3 cycles (popcnt, lzcnt, bsr) take 4 cycles on LionCove (could all take 1 cycle but whatever), not a huge deal but maybe some code wants to be reshuffled a bit. The latency of SIMD integer multiplication went down, nice.

submitted 146 days ago • 1 comment

Every "trailing bit manipulation" operation (from BMI1 and TBM and extra ones even more than in this table: programming.sirrida.de/programming.... ) could be unified into an instruction that performs ternlog(x, -x, x + 1, imm8)

submitted 164 days ago • 0 comments

Apparently shlx is a "medium latency" (3 cycles) instruction on Alder Lake. My disappointment is immeasurable, and my day is ruined.

submitted 171 days ago • 1 comment

"the main issue of std::unordered_map (not only libstdc++ implementation, but all standard compatible ones) is cache unfriendliness" OK. But there's also a 64-bit division in libstdc++ specifically. Should it really work that way?

submitted 178 days ago • 0 comments

Let's say you have a 4-input ternlog that takes the LUT as a variable, let's call it ternlogvb. ternlogvb(0xF0, 0xCC, 0xAA, x) just maps x (a byte) to itself. Those masks probably look familiar: 0..7 but transposed. If you permute that transposed list, you permute the bits of x.

submitted 180 days ago • 0 comments

Have the "wow, Rust PNG decoder is faster than libpng" people ever looked at libpng? Look at filter_sse2_intrinsics.c in particular. You've seen that right, it uses SSE2 to process one pixel at the time. Probably written back when unaligned loads were a no-no, and before SSSE3.

submitted 188 days ago • 1 comment

New blog post: Bit-permuting 16 u32s at once with AVX-512 bitmath.blogspot.com/2024/12/bit-...

submitted 190 days ago • 0 comments

Looks like C++26 is going to have saturating arithmetic. std::add_sat and sub_sat autovectorize well for narrow integer types, as you might hope. Saturating manually with `min` or `if` still sucks. godbolt.org/z/ffq8nb83h

submitted 193 days ago • 2 comments

Just had a look at "binius multiplication", the 8-bit case can be done with vgf2p8affineqb, vpandq, vpopcntq, then extract the bits. Easier than expected. Maybe 64-bit next, but it'll be more annoying: 8-bit happened to be a bilinear form with a nice matrix in the middle.

submitted 210 days ago • 0 comments

New blog post: Histogramming bytes with positional popcount (GF2P8AFFINEQB edition) bitmath.blogspot.com/2024/11/hist...

submitted 220 days ago • 1 comment

I don't know if I want to actively juggle 3 social media sites at the same time, but I made an account here at least.

submitted 221 days ago • 1 comment