This is awesome. Part of the reason we use ffi is because we target different architectures, and so we don't need to recompile a c extension shim for each one. Also, the binding is delayed, so on arches that are unsupported, we can load the ffi code and just not call it.
Great! We need FFI so as to get the benefit of all the C without rewriting up front but the more work in Ruby, the healthier! And me, I'm thrilled your proof-of-concept is ARM64 and FFI gem not handrolled bc I'm on mac and it's so uncommon we're not a belated afterthought. 😹
BTW I wonder why using libffi is much slower than the tiny JIT's code.
Looking at https://www.chiark.greenend.org.uk/doc/libffi-dev/html/Simple-Example.html libffi seems to have ffi_prep_cif() which takes the signature and so should be able to generate an efficient downcall stub, to be used by ffi_call(). The FFI gem uses those.
My assumption was that 1) we might be pushing more native frames with libffi, 2) type unboxing isn't JIT'd, 3) there might be a loop the JIT code can unroll (a loop to unbox params). I've not investigated deeply though
Good points. Roland Schatz, the author of Truffle NFI gave me another hint: the FFI gem needs to convert from the Ruby arguments to the void**arguments ffi_call() takes, and then internally convert those to the actual ABI to call the function so basically an extra buffer and copies in between.
I profiled it with samply (thanks @byroot.bsky.social for the tip): https://share.firefox.dev/4hzOQhb
14% is spent in rb_string_value_cstr() because of [:string] in `attach_function :strlen, [:string], :int`, which does an extra check to ensure no \0 byte.
So it's like an extra strlen() compared to the rest
Changing to [:pointer] to avoid that check speeds it up from 5.5Mi/s to 6.1Mi/s and gives https://share.firefox.dev/4hRLkhC.
7.8% in rbffi_SetupCallParams() vs 18% before.
3.1% rbffi_NativeValue_ToRuby().
30% in fun_1e66a() which is one of the (!!!) 5 generated stubs by libffi. Only 5.8% in _strlen_avx2().
3.5% rbffi_frame_push() (not sure why FFI needs frames) and 3.9% rbffi_save_errno() (need to save errno).
Both use pthread_getspecific() which are likely more expensive compared to __thread variables (but not all compilers like those).
Likely could avoid those multiple thread lookup and do just one.
Ya, my PoC uses `rb_string_value_cstr` too. It doesn't save errno though. But also the baseline test basically does nothing (I think String#bytesize just returns the length field that's cached on the object)
Ah I totally missed the PoC uses that too, that's some proper attention to detail!
BTW small detail but strlen returns size_t which is unsigned long on LP64. And indeed bytesize is just one read vs the strlen() checking 4 bytes for "foo" but close enough, most of the time will be outside those.
The cext difference might come down to the compiler. It could be gcc is generating better machine code for pushing and popping C frames than clang (it wouldn't be the first time I've seen an issue like that in micro benchmarks). Unsure about FFI 😵💫
The TruffleRuby 24.1 release will use the Java Foreign Function and Memory (FFM) API which generates stubs per native signature for both downcalls and upcalls.
That's much faster than JNI and libffi, especially for upcalls.
As https://github.com/oracle/truffleruby/pull/3714 shows, TruffleRuby is using FFM for upcalls from C extensions (because Ruby C API functions need to call back to Ruby or Java on TruffleRuby, so each is an upcall), and that change in turn made C extensions like sqlite3, trilogy and json 2 to 3 times faster!
IIRC Graal started optimizing native calls with this 2013 paper: https://dl.acm.org/doi/10.1145/2500828.2500832
And Native Image added SystemJava, a way to call native functions from Java efficiently.
Thank you! Yes, I don't see why this couldn't be integrated with the FFI gem. This approach would be too hard to get working portably. It's not going to work on JRuby / Truffle, and also I don't want to support esoteric architectures, so falling back to libffi would be ideal
I'd probably start off with a gem that's got the same API as FFI, iterate until it works well, then move the code the FFI proper (if I work on this in earnest, it's only a prototype for now)
One concern is how to use the assembler gems (aarch64, fisk) without always depending on them since RubyGems doesn't really have optional or per RUBY_ENGINE or per platform dependencies
Those gems implement all instructions, and I doubt we'd need them all for this case. Might be good enough to just add encoding only for the instructions we actually need (then depend on nothing)
Comments
Looking at https://www.chiark.greenend.org.uk/doc/libffi-dev/html/Simple-Example.html libffi seems to have ffi_prep_cif() which takes the signature and so should be able to generate an efficient downcall stub, to be used by ffi_call(). The FFI gem uses those.
14% is spent in rb_string_value_cstr() because of [:string] in `attach_function :strlen, [:string], :int`, which does an extra check to ensure no \0 byte.
So it's like an extra strlen() compared to the rest
7.8% in rbffi_SetupCallParams() vs 18% before.
3.1% rbffi_NativeValue_ToRuby().
30% in fun_1e66a() which is one of the (!!!) 5 generated stubs by libffi. Only 5.8% in _strlen_avx2().
Both use pthread_getspecific() which are likely more expensive compared to __thread variables (but not all compilers like those).
Likely could avoid those multiple thread lookup and do just one.
BTW small detail but strlen returns size_t which is unsigned long on LP64. And indeed bytesize is just one read vs the strlen() checking 4 bytes for "foo" but close enough, most of the time will be outside those.
Do you think once complete this could be integrated in the FFI gem so then it would automatically be faster?
Graal & TruffleRuby have been exploring making native calls faster for a while:
That's much faster than JNI and libffi, especially for upcalls.
And Native Image added SystemJava, a way to call native functions from Java efficiently.