bennokrojer.bsky.social - Profile | ThreadSky | a Reddit-style client for Bluesky

comment in response to post

But that's the second bullet point no? I meant the first one

submitted 17 hours ago

comment in response to post

Did anyone actually figure out how to do this? Can't seem to see these buttons anywhere

submitted 1 day ago

comment in response to post

Overall I loved the paper, got lots of inspiration from it and would love to be part of a similar project in the future: for example an empirical investigation of many AI papers to answer "To what extent is AI is a science?"

submitted 12 days ago

comment in response to post

2) For me IA might be most valuable because it gives us new concepts and frames

submitted 12 days ago

comment in response to post

Let me highlight two passages i found very cool: 1) maybe people *outside* of a field only see the field's best papers and thus think it is impactful while people *outside* the field are exposed to all the chaotic average paper

submitted 12 days ago

comment in response to post

A funny observation is that people outside of IA research think it has bigger impact (!) Maybe this is normal? We are often most critical and skeptical of the things we are most familiar with

submitted 12 days ago

comment in response to post

So what are the results? Overall it looks pretty good for I&A research! There's tons of interesting nuanced insights in the paper, so here are just some I noted down:

submitted 12 days ago

comment in response to post

So the next challenge is how to define and operationalize "impact": For that the authors adopt a citation graph analysis and surveys (plus qualitative analyses of those) --> this is called a mixed-methods analysis in social sciences

submitted 12 days ago

comment in response to post

More detailed notes (there was much to think about: every second sentence was probably worth a minute-long ponder): Arguably the hardest challenge is how to define what "interpretability and analysis" means. They adopt a quite broad definition but do a good job imo, also including some eval work

submitted 12 days ago

comment in response to post

The paper also happens to be by written by several of my friends, colleagues or people whose work i admire: @mariusmosbach.bsky.social @dippedrusk.com @tomvergara.bsky.social , Dietrich Klakow and @megamor2.bsky.social It also fits all criteria for a paper I want to feature in this thread:

submitted 12 days ago

comment in response to post

Day 12: From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP arxiv.org/abs/2406.12618 Genuinely one of my favourite papers in recent years! It tries to answer one question that every phd student often asks themselves: Does this research matter?

submitted 12 days ago

comment in response to post

I have no clue why it was on hold on arxiv for more than a week; better late than never!

submitted 16 days ago

comment in response to post

Amazing, great news, excited to see your lab develop!

submitted 19 days ago

comment in response to post

But we know from lots of commonsense tasks that LLMs can talk well about visual concepts, it's just hard to say if they copy this from the training data or built up some form of visual world model internally

submitted 23 days ago

comment in response to post

There is also some research on this question you ask here which is a bit different: if you train on text alone, do you learn some vision knowledge implicitly? Nothing recent comes to mind but this was cool: openreview.net/forum?id=gJc...

submitted 23 days ago

comment in response to post

this question is very dear to my research heart and i've been following it a lot! BabyLLM challenge also introduced a multi-modal track 1-2 years ago to see if multi-modal data helps with language acquisition but again not really positive results came out of it: babylm.github.io

submitted 23 days ago

comment in response to post

caveat: i am not sure how much Chameleon explicitly discusses this synergy vs competition but in a recent talk from Luke Zettlemoyer he emphasized this (www.youtube.com/watch?v=JYMX...)

submitted 23 days ago

comment in response to post

The hope for a lot of these models were modalities are tightly integrated (often called "natively multi-modal" or "early fusion") is that there is synergy between modalities but one might also find that they compete instead inside the model This paper discusses it: arxiv.org/abs/2405.09818

submitted 23 days ago

comment in response to post

The other direction is very easy and works very well: improving vision performance with language See CLIP where the a main goal was to get a better zero-shot image classifier: arxiv.org/abs/2103.00020

submitted 23 days ago

comment in response to post

A lot of people in the vision-and-language community have tried to make this work and it rarely really helps (so lots of neg results dont get published afaik). The earliest work in this direction was Vokenizer: arxiv.org/abs/2010.06775

submitted 23 days ago

comment in response to post

A recent paper is this: arxiv.org/abs/2310.13257

submitted 23 days ago

comment in response to post

Somehow the link above is now broken, this should work: www.youtube.com/watch?v=IeCS...

submitted 24 days ago

comment in response to post

That was 100% my plan when I got one

submitted 25 days ago

comment in response to post

Let's have a call sometime to set you up 😉

submitted 26 days ago

comment in response to post

It reminds of the clean up after a big party: You do your own little part but it feels so much faster because every 10 minutes you turn around to the rest of the room and see that so much has gotten done

submitted 26 days ago

comment in response to post

Looking back to the last weeks of writing, the most satisfying thing was to see everything evolve in parallel: I worked on my own section but saw others grow more coherent every day. Figures popped up, an introduction appeared. Every time I checked the edit history someone had done cool work!

submitted 26 days ago

comment in response to post

The final generation of ACII frames after the thinking process is still sometimes decent, but the very verbose reasoning did not actually provide the model with drafts to build upon Overall: 1. R1 slightly better than V3 2. Yet still many failure modes, especially a lack of iterative refinement

submitted 26 days ago

comment in response to post

Finally, we study world modeling, asking the model to e.g. "generate 10 more ASCII frames of 2 balls on a pool table colliding, given an initial frame of the balls on the pool table" We find that R1 often "gets lost" in symbolic/math reasoning instead of generating actual drafts of the "video"

submitted 26 days ago

comment in response to post

Another finding: A more complex task does not necessarily lead to longer reasoning chains. We ask the model to generate single objects (e.g. dog); and then also object compositions (e.g. half-dog half-shark) as a more complex However this second task leads to slightly shorter chains!

submitted 26 days ago

comment in response to post

We find that it rarely performs proper editing: R1 often completely discards its initial drafts, starting from scratch over and over. Similarly, the final output after "<thinkink>" might not be faithful to the drafts from the thinking process

submitted 26 days ago

comment in response to post

For our Visual Reasoning experiments, we ask DeepSeek-R1 to e.g. "draw a detailed ASCII art of a lacrosse stick" To solve this, we expected R1 to perform some form of image editing: generate an initial draft, find flaws, edit it, and repeat a few times However we find that ... 2/N

submitted 26 days ago

comment in response to post

Overall this paper covered tons of ground, made me excited to investigate similar question further, and taught me a lot about how people formally think about embedding spaces and transformer-sub-blocks in the LLM era. Great read!

submitted 28 days ago

comment in response to post

After all these experiments to see *what* is roughly going, the authors turn to the *why/how* This part is expected to be tricky to answer and some of their answers seem quite vague, but nonetheless some good insights here:

submitted 28 days ago

comment in response to post

They also investigate if different modalities/tasks use different subsets of weights ("subnetworks") and whether one can prune weights based on this question:

submitted 28 days ago

comment in response to post

Playing reviewer here a bit, in fact the paper is so dense with findings that i think it would have benefited from less of them:

submitted 28 days ago

comment in response to post

In general this paper has a *lot* of findings and random facts that are good to remember as a VL researcher:

submitted 28 days ago

comment in response to post

For me the main takeaways are that the visual embeddings live in a different "cone" and thus are not similar (measured via cosine sim) to LLM vocab tokens This similarity gets larger though in later layer and is also highest right after the Self-Attention block

submitted 28 days ago