I wrote a blog post about how we can make FFI faster in CRuby railsatscale.com/2025-02-12-t... - ThreadSky

tenderlove.dev • 80 days ago

I wrote a blog post about how we can make FFI faster in CRuby https://railsatscale.com/2025-02-12-tiny-jits-for-a-faster-ffi/

Comments

fryguy9.bsky.social•75 days ago

This is awesome. Part of the reason we use ffi is because we target different architectures, and so we don't need to recompile a c extension shim for each one. Also, the binding is delayed, so on arches that are unsupported, we can load the ffi code and just not call it.

fryguy9.bsky.social•75 days ago

Also the performance difference is negligible when only calling it once in a while.

tenderlove.dev•75 days ago

Ya I think it's helpful for stuff like libxml2 or sqlite3. There's lots of "hot loop" use cases (I think)

calicoday.bsky.social•79 days ago

📌 tiny jits for a faster ffi

calicoday.bsky.social•79 days ago

Great! We need FFI so as to get the benefit of all the C without rewriting up front but the more work in Ruby, the healthier! And me, I'm thrilled your proof-of-concept is ARM64 and FFI gem not handrolled bc I'm on mac and it's so uncommon we're not a belated afterthought. 😹

eregon.bsky.social•79 days ago

I ran the benchmark on 3.4.1 on linux-x86_64 and interestingly strlen-ffi is even slower (3.48x), but strlen-cext is closer (1.13x), relative to the baseline: https://gist.github.com/eregon/0ab50eccef3db658634905d1fbf8f60c

eregon.bsky.social•79 days ago

BTW I wonder why using libffi is much slower than the tiny JIT's code.
Looking at https://www.chiark.greenend.org.uk/doc/libffi-dev/html/Simple-Example.html libffi seems to have ffi_prep_cif() which takes the signature and so should be able to generate an efficient downcall stub, to be used by ffi_call(). The FFI gem uses those.

tenderlove.dev•79 days ago

My assumption was that 1) we might be pushing more native frames with libffi, 2) type unboxing isn't JIT'd, 3) there might be a loop the JIT code can unroll (a loop to unbox params). I've not investigated deeply though

eregon.bsky.social•79 days ago

Good points. Roland Schatz, the author of Truffle NFI gave me another hint: the FFI gem needs to convert from the Ruby arguments to the void**arguments ffi_call() takes, and then internally convert those to the actual ABI to call the function so basically an extra buffer and copies in between.

eregon.bsky.social•79 days ago

I profiled it with samply (thanks @byroot.bsky.social for the tip): https://share.firefox.dev/4hzOQhb
14% is spent in rb_string_value_cstr() because of [:string] in `attach_function :strlen, [:string], :int`, which does an extra check to ensure no \0 byte.
So it's like an extra strlen() compared to the rest

eregon.bsky.social•79 days ago

Changing to [:pointer] to avoid that check speeds it up from 5.5Mi/s to 6.1Mi/s and gives https://share.firefox.dev/4hRLkhC.
7.8% in rbffi_SetupCallParams() vs 18% before.
3.1% rbffi_NativeValue_ToRuby().
30% in fun_1e66a() which is one of the (!!!) 5 generated stubs by libffi. Only 5.8% in _strlen_avx2().

eregon.bsky.social•79 days ago

3.5% rbffi_frame_push() (not sure why FFI needs frames) and 3.9% rbffi_save_errno() (need to save errno).
Both use pthread_getspecific() which are likely more expensive compared to __thread variables (but not all compilers like those).
Likely could avoid those multiple thread lookup and do just one.

tenderlove.dev•79 days ago

Ya, my PoC uses `rb_string_value_cstr` too. It doesn't save errno though. But also the baseline test basically does nothing (I think String#bytesize just returns the length field that's cached on the object)

eregon.bsky.social•79 days ago

Ah I totally missed the PoC uses that too, that's some proper attention to detail!

BTW small detail but strlen returns size_t which is unsigned long on LP64. And indeed bytesize is just one read vs the strlen() checking 4 bytes for "foo" but close enough, most of the time will be outside those.

tenderlove.dev•79 days ago

The cext difference might come down to the compiler. It could be gcc is generating better machine code for pushing and popping C frames than clang (it wouldn't be the first time I've seen an issue like that in micro benchmarks). Unsure about FFI 😵‍💫

eregon.bsky.social•80 days ago

Cool stuff!
Do you think once complete this could be integrated in the FFI gem so then it would automatically be faster?

Graal & TruffleRuby have been exploring making native calls faster for a while:

eregon.bsky.social•80 days ago

The TruffleRuby 24.1 release will use the Java Foreign Function and Memory (FFM) API which generates stubs per native signature for both downcalls and upcalls.
That's much faster than JNI and libffi, especially for upcalls.

eregon.bsky.social•80 days ago

As https://github.com/oracle/truffleruby/pull/3714 shows, TruffleRuby is using FFM for upcalls from C extensions (because Ruby C API functions need to call back to Ruby or Java on TruffleRuby, so each is an upcall), and that change in turn made C extensions like sqlite3, trilogy and json 2 to 3 times faster!

eregon.bsky.social•80 days ago

IIRC Graal started optimizing native calls with this 2013 paper: https://dl.acm.org/doi/10.1145/2500828.2500832
And Native Image added SystemJava, a way to call native functions from Java efficiently.

eregon.bsky.social•80 days ago

Interestingly RubyVM::RJIT::C https://github.com/ruby/ruby/blob/f32d5071b7b01f258eb45cf533496d82d5c0f6a1/rjit_c.rb from @k0kubun.com looks quite similar to Truffle::CExt https://github.com/oracle/truffleruby/blob/f1c77805d8d25d65c5dba811f1e9b3c58f4667ca/lib/truffle/truffle/cext.rb#L406 even though they do opposite things: the former makes it notably easy to call C API functions from Ruby, the latter implements C API functions in Ruby.

tenderlove.dev•80 days ago

Thank you! Yes, I don't see why this couldn't be integrated with the FFI gem. This approach would be too hard to get working portably. It's not going to work on JRuby / Truffle, and also I don't want to support esoteric architectures, so falling back to libffi would be ideal

tenderlove.dev•80 days ago

I'd probably start off with a gem that's got the same API as FFI, iterate until it works well, then move the code the FFI proper (if I work on this in earnest, it's only a prototype for now)

calicoday.bsky.social•79 days ago

This seems sensible. I've been doing a lot of work on FFI convenience code and this seem like it would be great incorporated, when the time comes.

eregon.bsky.social•75 days ago

One concern is how to use the assembler gems (aarch64, fisk) without always depending on them since RubyGems doesn't really have optional or per RUBY_ENGINE or per platform dependencies

tenderlove.dev•75 days ago

Those gems implement all instructions, and I doubt we'd need them all for this case. Might be good enough to just add encoding only for the instructions we actually need (then depend on nothing)

Comments

Posting Rules

Reply