Profile avatar
lhl.bsky.social
Easily distracted, currently building open source AI. Living online since FidoNet
86 posts 416 followers 324 following
Regular Contributor
Active Commenter

HF_TRANSFER gud

I've been impressed with OpenAI's Deep Research, using it for a dozens of research tasks. It's especially good at focused tasks, but rather mid when it comes to broader more general topic tasks. An example of how good it can be (reviewing an astrophysics thesis): www.youtube.com/watch?v=Eh-C...

You can now reproduce DeepSeek-R1's reasoning on your own local device! Experience the "Aha" moment with just 7GB VRAM. Unsloth reduces GRPO training memory use by 80%. 15GB VRAM can transform Llama-3.1 (8B) & Phi-4 (14B) into reasoning models. Blog: unsloth.ai/blog/r1-reas...

I’m currently going through/organizing DeepSeek takes and when it comes to market impact, down-ranking those that don’t account for TSMC tariff (insider) trading as a major (potentially primary) driver for the big price move this week.

So I was wondering, where are all the people who were so confident “we’ve hit a wall” on AI from uh … one month back? And for those getting carried away by the new AI narrative of the week (DeepSeek with a box of rocks/cope), maybe worth catching your breath and reflecting on that.

Posted by @vgel.me on the other site

With DeepSeek-R1 being among the strongest released frontier models in the world (and MIT licensed to boot!) there’s been a lot of heated discussion about its Chinese state censorship. Last did some poking at Qwen2 that afaik is still on of the few analyses online: huggingface.co/blog/leonard...

I'm obsessed with this artpiece: giant mirrors that shine sun onto a town under shadow half of the year. Until it was built, the townspeople hated it. What a metaphor for the denial of major social, political, or technological change: you cannot coexist with your own yearning for something better.

I was doing a quick skim of RTX 5090 reviews and it seems almost all hardware reviewers have no idea how to benchmark LLM performance, I wonder if a simple guide would be useful...

I've been doing some inference throughput/latency testing (focused on lowest TTFT) and testing various quants and engines. The bs=1 optimized (but server-capable) kernels scale pretty poorly. (Also, while vLLM and SGLang both can use Marlin kernels but SGLang's latency seems better across the board)

This recent essay andymasley.substack.com/p/individual... got me to do a more thorough writeup of my recent empirical inference efficiency testing. Full writeup: fediverse.randomfoo.net/notice/AqCTD... - basically, currently, inference is >100X more efficient than the most commonly cited numbers.

There's been a lot of speculation and excitement in r/LocalLlama about the new Nvidia Project DIGITS www.nvidia.com/en-us/projec... but think it's more likely than not that MBW will be lower than people are hyping themselves up for... www.reddit.com/r/LocalLLaMA...

As a followup to my earlier DeepSeek-v3 performance testing, here's what basically OOTB (2xH100, tp=16) vLLM (0.6.6.post2.dev5+g5ce4627a) vs SGLang (0.4.1.post4) looks like (this is concurrency=64, but scales similarly up to 1024). atm sglang has +125% better throughput and ~10X lower mean TTFT.

Some of you might get a kick out of this (I got the FP8 running on vLLM w/ slurm-to-ray on 2 x H100 nodes as well, more on that later...)

New year, new blog post: I had a random question, what happens when LLMs are prompted to write better code, again and again? Do they actually write better code? The answer is yes*! minimaxir.com/2025/01/writ...

The Claude Desktop app has MCP support: modelcontextprotocol.io/quickstart/u... - I decided to see if I could get Claude Desktop installed on Arch Linux (yes, but ironically, had to use ChatGPT o1 to step in and clean up the forked script/get it working): github.com/lhl/claude-d...

Unless you have 400GB of memory for DeepSeek-V3 (Q4), Qwen2.5-Coder-32B is probably still the best local code assistant available (fits in a 24GB consumer GPU). I was curious and did some testing w/ llama.cpp's speculative decoding. Results/discussion here: www.reddit.com/r/LocalLLaMA...

I will be retiring/deprecating this version of Shaberi (GPT4-judged JA functional testing) for something new early next year, but possibly of interest, DeepSeek-v3 just slotted into first place. (I tested close to 100 models with this eval this year)

So, not only QvQ, but DeepSeek-V3 just dropped. It's a massive model, by my calcs 29B activation params / 453B weights: www.reddit.com/r/LocalLLaMA... - it reportedly scores 48.9%, above Sonnet, on aider's new polyglot code leaderboard: aider.chat/2024/12/21/p... (as ref: Qwen2.5-Coder scores 8%)

Lately I've been doing vLLM performance on some H100 nodes in preparation for generating lots of synthetic tokens. , Interestingly, I found that throughput was significantly lowered when max_num_seqs or max_num_batched_tokens was specified, even when the setting was the same as the default (512/512)