Profile avatar
dimitrisp.bsky.social
Researcher @MSFTResearch; Prof @UWMadison (on leave); learning in context; thinking about reasoning; babas of Inez Lily. https://papail.io
171 posts 1,804 followers 293 following
Regular Contributor
Active Commenter

What if for most of your findings you just post a thread and share a GitHub repo, rather than submitting a 15 page NeurIPS paper with < 1/100 the reach?

LLMs learn world models, beyond a reasonable doubt. It's been the case since GPT-3, but now it should be even more clear. Without them "Guess and Check" would not work. The fact that these "world models" are approximate/incomplete does not disqualify them.

Is 1948 widely acknowledged as the birth of language models and tokenizers? In "A Mathematical Theory of Communication", almost as an afterthought Shannon suggests the N-gram for generating English, and that word level tokenization is better than character level tokenization.

🎉The Phi-4 reasoning models have landed on HF and Azure AI Foundry. The new models are competitive and often outperform much larger frontier models. It is exciting to see the reasoning capabilities extend to more domains beyond math, including algorithmic reasoning, calendar planning, and coding.

I am afraid to report, RL works. I think 2-3 years ago, I said I will not work on two ML sub-areas. RL was one of them. I am happy to say that I am not strongly attached to my beliefs.

Re: The Chatbot Arena Illusion Every eval chokes under hill climbing. If we're lucky, there’s an early phase where *real* learning (both model and community) can occur. I'd argue that a benchmark’s value lies entirely in that window. So the real question is what did we learn?

Fun trivia now that “sycophant” became common language to describe LLMs flattering users: In Greek, συκοφάντης (sykophántēs) most typically refers to a malicious slanderer, someone spreading lies, not flattery! Every time you use it, you’re technically using it wrong :D

Come work with us at MSR AI Frontiers and help us figure out reasoning! We're hiring at the Senior Researcher level (eg post phd). Please drop me a DM if you do! jobs.careers.microsoft.com/us/en/job/17...

o3 can't multiply beyond a few digits... But I think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement Below is the acc of a tiny model teaching itself how to add and multiply

o3 can't multiply beyond a few digits... But he think multiplication, addition, maze solving and easy-to-hard generalization is actually solvable on standard transformers... with recursive self-improvement, as presented by @dimitrisp.bsky.social

Self-improving Transformers can overcome easy-to-hard and length generalization challenges. Paper on arxiv coming on Monday. Link to a talk I gave on this below 👇 Super excited about this work! Talk : youtube.com/watch?v=szhE... slides: tinyurl.com/SelfImprovem...

Two months before R1 came out, I wrote this in my small notebook of ideas as something to test #schmidhuber

Now that we have reasoner LLMs, let's think about how to GRPO problem generators that generate instances that sit right outside the frontier of current capabilities.

🚀 🇬🇷 A year in the making! I’ve just completed a set of 21 lectures in Machine Learning, in Greek, designed for high school students. The course introduces key ML concepts, coding in Python & PyTorch, and real-world AI applications. 👉 WebPage: tinyurl.com/ye2awe8m 🎥 YouTube: tinyurl.com/2wwjru6z

If you wanted to collect 1 mil reasoning traces from human subjects on say math, that would cost ~$50m, assuming ~50$/person/hour. Interesting to compare with the cost to generate them from a reasoning LLM, with say with cost per trace ~$0.5 (say 10k tokens).. That's 100x cheaper

Ok we've read a lot about test-time compute being the new scaling axis, but what's the next scaling axis?

2014 GoogLeNet: The best image classifier was only trainable using weeks of Google's custom infrastructure. 2018 ResNet: A more accurate model is trainable in a 1/2 hour on a single GPU. What stops this from happening for LLMs?