Profile avatar
mrdrozdov.com
Research Scientist @ Mosaic x Databricks. Adaptive Methods for Retrieval, Generation, NLP, AI, LLMs https://mrdrozdov.github.io/
170 posts 5,026 followers 604 following
Prolific Poster
Conversation Starter

It was a real pleasure talking about effective IR approaches with Brooke and Denny on the Data Brew podcast. Among other things, I'm excited about embedding finetuning and reranking as modular ways to improve RAG pipelines. Everyone should use these more!

We're probably a little too obsessed with zero-shot retrieval. If you have documents (you do), then you can generate synthetic data, and finetune your embedding. Blog post lead by @jacobianneuro.bsky.social shows how well this works in practice. www.databricks.com/blog/improvi...

I do want to see aggregate stats about the model’s generation and total reasoning tokens is perhaps the least informative one.

"All you need to build a strong reasoning model is the right data mix." The pipeline that creates the data mix:

Using 100+ tokens to answer 2 + 3 =

It’s pretty obvious we’re in a local minima for pretraining. Would expect more breakthroughs in the 5-10 year range. Granted, it’s still incredibly hard and expensive to do good research in this space, despite the number of labs working on it.

Word of the day (of course) is ‘scurryfunging’, from US dialect: the frantic attempt to tidy the house just before guests arrive.

... didn't know this would be one of the hottest takes i've had ... for more on my thoughts, see drive.google.com/file/d/1sk_t...

feeling a but under the weather this week … thus an increased level of activity on social media and blog: kyunghyuncho.me/i-sensed-anx...

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Introduces ModernBERT, a bidirectional encoder advancing BERT-like models with 8K context length. 📝 arxiv.org/abs/2412.13663 👨🏽‍💻 github.com/AnswerDotAI/...

State Space Models are Strong Text Rerankers Shows Mamba-based models achieve comparable reranking performance to transformers while being more memory efficient, with Mamba-2 outperforming Mamba-1. 📝 arxiv.org/abs/2412.14354

Reasoning is fascinating but confusing. Is reasoning a task? Or is reasoning a method for generating answers, for any task?

The future of AI is models that generate graphical interfaces. Instead of the linear, low-bandwidth metaphor of conversation, models will represent themselves to us as computers: rich visuals, direct manipulation, and instant feedback. willwhitney.com/computing-in...

Three must read papers for PhD students. #scisky #PhD #science #research #academicsky 1. The importance of stupidity in scientific research Open Access journals.biologists.com/jcs/article/...

In any given year there are between one and three Friday the 13ths. This year there are two. 👻

Reflections on NeurIPS: There's always a big theme people seem to be preoccupied with. This year, it was the continuation of scaling/progress. Will it continue? What will the next generation of models hold? I even got to sass Dylan Patel (not on bsky) over it. Here are my personal thoughts 🧵

The databricks have arrived

Slides are up! I presented on "Presentation & Consumption in the context of REML" The full deck is here. There's a lot of gems if you're interested in this space! retrieval-enhanced-ml.github.io/sigir-ap2024...

Today we'll be presenting the Tutorial on Retrieval-Enhanced Machine Learning (REML). Come by to learn about the emerging design patterns in this space and see how to use retrieval beyond RAG. In collaboration w/ the amazing @841io.bsky.social @teknology.bsky.social Alireza Salemi and Hamed Zamani.

Today we'll be presenting the Tutorial on Retrieval-Enhanced Machine Learning (REML). Come by to learn about the emerging design patterns in this space and see how to use retrieval beyond RAG. In collaboration w/ the amazing @841io.bsky.social @teknology.bsky.social Alireza Salemi and Hamed Zamani.

Few things that are revolutionizing retrieval research right now: 1. content-based models have gotten much better 2. synthetic data has increased the value of small specialized datasets 3. retrieval is becoming a more important component in a variety of AI applications

Reading articles that 🦋 might get into advertising, and I'm not sure this is what they meant 😅

Seen in NYC

2024: For really hard problems, you go to that one friend down the street who has an o1 pro subscription. 1960: For really important calls, you go to that one friend down the street who was able to get a telephone.

👀

Somehow missed this thread from @sungkim.bsky.social --- thanks for the interest in our work!

I’m on the academic job market this year! I’m completing my @uwcse.bsky.social @uwnlp.bsky.social Ph.D. (2025), focusing on overcoming LLM limitations like hallucinations, by building new LMs. My Ph.D. work focuses on Retrieval-Augmented LMs to create more reliable AI systems 🧵

RAG still has a way to go. (this book doesn’t exist)

An underlooked feature of arXiv is it provides a unified interface for all the conferences.

Every individual agent is just a sparse instantiation of the mother agent that represents all agents.

Social media idea where you wouldn't need verification: Identity Roleplay 1. Bootstrap the network by adding 1000s or millions of profiles. 2. Users get assigned random-ish accounts for a temporary time. Like, you'd get to publish a few posts as Hugh Jackman before rotating to someone else. 3. ...

Creating a slide and applying a layout is now two separate steps on Google Slides. (you have to right-click the slide after creation to change the layout)

Swapneel's guide to writing an SoP is so good. docs.google.com/document/d/1...