bennokrojer.bsky.social
AI PhDing at Mila/McGill (prev FAIR intern). Happily residing in Montreal 🥯❄️
Academic: language grounding, vision+language, interp, rigorous & creative evals, cogsci
Other: many sports, urban explorations, puzzles/quizzes
bennokrojer.com
1,710 posts
2,564 followers
937 following
Regular Contributor
Active Commenter
comment in response to
post
But that's the second bullet point no? I meant the first one
comment in response to
post
Did anyone actually figure out how to do this? Can't seem to see these buttons anywhere
comment in response to
post
Overall I loved the paper, got lots of inspiration from it and would love to be part of a similar project in the future: for example an empirical investigation of many AI papers to answer "To what extent is AI is a science?"
comment in response to
post
2) For me IA might be most valuable because it gives us new concepts and frames
comment in response to
post
Let me highlight two passages i found very cool:
1) maybe people *outside* of a field only see the field's best papers and thus think it is impactful while people *outside* the field are exposed to all the chaotic average paper
comment in response to
post
A funny observation is that people outside of IA research think it has bigger impact (!)
Maybe this is normal? We are often most critical and skeptical of the things we are most familiar with
comment in response to
post
So what are the results? Overall it looks pretty good for I&A research!
There's tons of interesting nuanced insights in the paper, so here are just some I noted down:
comment in response to
post
So the next challenge is how to define and operationalize "impact": For that the authors adopt a citation graph analysis and surveys (plus qualitative analyses of those)
--> this is called a mixed-methods analysis in social sciences
comment in response to
post
More detailed notes (there was much to think about: every second sentence was probably worth a minute-long ponder):
Arguably the hardest challenge is how to define what "interpretability and analysis" means. They adopt a quite broad definition but do a good job imo, also including some eval work
comment in response to
post
The paper also happens to be by written by several of my friends, colleagues or people whose work i admire:
@mariusmosbach.bsky.social @dippedrusk.com @tomvergara.bsky.social , Dietrich Klakow and @megamor2.bsky.social
It also fits all criteria for a paper I want to feature in this thread:
comment in response to
post
Day 12:
From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
arxiv.org/abs/2406.12618
Genuinely one of my favourite papers in recent years!
It tries to answer one question that every phd student often asks themselves: Does this research matter?
comment in response to
post
I have no clue why it was on hold on arxiv for more than a week; better late than never!
comment in response to
post
Amazing, great news, excited to see your lab develop!
comment in response to
post
But we know from lots of commonsense tasks that LLMs can talk well about visual concepts, it's just hard to say if they copy this from the training data or built up some form of visual world model internally
comment in response to
post
There is also some research on this question you ask here which is a bit different: if you train on text alone, do you learn some vision knowledge implicitly?
Nothing recent comes to mind but this was cool:
openreview.net/forum?id=gJc...
comment in response to
post
this question is very dear to my research heart and i've been following it a lot!
BabyLLM challenge also introduced a multi-modal track 1-2 years ago to see if multi-modal data helps with language acquisition but again not really positive results came out of it:
babylm.github.io
comment in response to
post
caveat: i am not sure how much Chameleon explicitly discusses this synergy vs competition but in a recent talk from Luke Zettlemoyer he emphasized this (www.youtube.com/watch?v=JYMX...)
comment in response to
post
The hope for a lot of these models were modalities are tightly integrated (often called "natively multi-modal" or "early fusion") is that there is synergy between modalities but one might also find that they compete instead inside the model
This paper discusses it:
arxiv.org/abs/2405.09818
comment in response to
post
The other direction is very easy and works very well: improving vision performance with language
See CLIP where the a main goal was to get a better zero-shot image classifier:
arxiv.org/abs/2103.00020
comment in response to
post
A lot of people in the vision-and-language community have tried to make this work and it rarely really helps (so lots of neg results dont get published afaik). The earliest work in this direction was Vokenizer:
arxiv.org/abs/2010.06775
comment in response to
post
A recent paper is this:
arxiv.org/abs/2310.13257
comment in response to
post
Somehow the link above is now broken, this should work:
www.youtube.com/watch?v=IeCS...
comment in response to
post
That was 100% my plan when I got one
comment in response to
post
Let's have a call sometime to set you up 😉
comment in response to
post
It reminds of the clean up after a big party: You do your own little part but it feels so much faster because every 10 minutes you turn around to the rest of the room and see that so much has gotten done
comment in response to
post
Looking back to the last weeks of writing, the most satisfying thing was to see everything evolve in parallel: I worked on my own section but saw others grow more coherent every day. Figures popped up, an introduction appeared. Every time I checked the edit history someone had done cool work!
comment in response to
post
The final generation of ACII frames after the thinking process is still sometimes decent, but the very verbose reasoning did not actually provide the model with drafts to build upon
Overall:
1. R1 slightly better than V3
2. Yet still many failure modes, especially a lack of iterative refinement
comment in response to
post
Finally, we study world modeling, asking the model to e.g. "generate 10 more ASCII frames of 2 balls on a pool table colliding, given an initial frame of the balls on the pool table"
We find that R1 often "gets lost" in symbolic/math reasoning instead of generating actual drafts of the "video"
comment in response to
post
Another finding:
A more complex task does not necessarily lead to longer reasoning chains.
We ask the model to generate single objects (e.g. dog); and then also object compositions (e.g. half-dog half-shark) as a more complex
However this second task leads to slightly shorter chains!
comment in response to
post
We find that it rarely performs proper editing:
R1 often completely discards its initial drafts, starting from scratch over and over. Similarly, the final output after "<thinkink>" might not be faithful to the drafts from the thinking process
comment in response to
post
For our Visual Reasoning experiments, we ask DeepSeek-R1 to e.g. "draw a detailed ASCII art of a lacrosse stick"
To solve this, we expected R1 to perform some form of image editing: generate an initial draft, find flaws, edit it, and repeat a few times
However we find that ...
2/N
comment in response to
post
Overall this paper covered tons of ground, made me excited to investigate similar question further, and taught me a lot about how people formally think about embedding spaces and transformer-sub-blocks in the LLM era. Great read!
comment in response to
post
After all these experiments to see *what* is roughly going, the authors turn to the *why/how*
This part is expected to be tricky to answer and some of their answers seem quite vague, but nonetheless some good insights here:
comment in response to
post
They also investigate if different modalities/tasks use different subsets of weights ("subnetworks") and whether one can prune weights based on this question:
comment in response to
post
Playing reviewer here a bit, in fact the paper is so dense with findings that i think it would have benefited from less of them:
comment in response to
post
In general this paper has a *lot* of findings and random facts that are good to remember as a VL researcher:
comment in response to
post
For me the main takeaways are that the visual embeddings live in a different "cone" and thus are not similar (measured via cosine sim) to LLM vocab tokens
This similarity gets larger though in later layer and is also highest right after the Self-Attention block