Profile avatar
fernbear.bsky.social
Neural network speedrunner and community-funded open source researcher. Set the CIFAR-10 record several times. Send me consulting/contracting work! she/they❤️
37 posts 184 followers 507 following
Regular Contributor
Active Commenter

This is a classic example of _why_ choose-one-of-n datasets need to have large-scale, crowd-sourced statistics and should use the KL-divergence instead of cross-entropy. Reviewers will be more biased than a crowd, it's a high variance+bias estimator, it can harm research.

Did you know that attention across the whole input span was inspired by the time-negating alien language in Arrival? Crazy anecdote from the latest Hard Fork podcast (by @kevinroose.com and @caseynewton.bsky.social). HT nwbrownboi on Threads for the lead.

it's crazy to me that RoPE's issue with BF16 wasn't noticed earlier. For a reasonable N of 2048, these are the computed frequencies prior to cos(x) & sin(x) for fp32 above and bf16 below. Given how short the period is of simple trig functions, this difference is catastrophic for large values.

Just added FSDP2 support for MARS and Muon!

Thanks for 100 followers, y'all! Happened so fast and can't wait to put out more research on here! 😊❤️

New NanoGPT training speed record: 3.28 FineWeb val loss in 4.66 minutes Previous record: 5.03 minutes Changelog: - FlexAttention blocksize warmup - hyperparameter tweaks

NATTEN just added fused support for self-cross attention! so you can attend to local neighbourhood and registers or text condition. it lets you reduce partial attention results (e.g. logsumexp provided by xformers APIs) into its LSE. github.com/SHI-Labs/NAT...

❤️ my MNIST socks

Radon Transform (RT) was formulated in 1917 but remained useless in practice until CT scanners were invented in the 60s But RT isn't just for CTs. It's a sort of generalization of marginals in probability RT g(p,θ): Shoot rays at θ+90 & offset p, measure line integrals of f(x,y) along the ray 1/n

Here, have PSGD-Kron and SOAP with FSDP2 support. Please go wild with it, let's see something finally replace ADAM. github.com/ethansmith20...

probably the best in-depth explanation i've seen on FSDP at the most granular levels, props to the authors dev-discuss.pytorch.org/t/fsdp-cudac...