williamheld.com
Modeling Linguistic Variation to expand ownership of NLP tools
Views my own, but affiliations that might influence them:
ML PhD Student under Prof. Diyi Yang
2x RS Intern🦙 Pretraining
Alum NYU Abu Dhabi
Burqueño
he/him
87 posts
2,137 followers
450 following
Regular Contributor
Active Commenter
comment in response to
post
As far as I can tell, the models aren't good enough right now that they can replace VFX at any high quality commercial scale.
They are exactly good enough to generate fake viral videos for ad revenue on TikTok/Instagram & spread misinformation. Is there any serious argument for their safe release??
comment in response to
post
I don't really see an argument for releasing such models with photorealistic generation capabilities.
What valid & frequent business use case is there for photorealistic video & voice generation like Veo 3 offers?
comment in response to
post
Now, I wouldn't do research on LLMs if I thought that was true in the long term!
But I think it's reasonable for skeptics to question whether advances in inference efficiency, hardware efficiency, and even core energy infrastructure will happen soon enough for current companies to capitalize.
comment in response to
post
The underlying assumption being that they can (a la Uber/Lyft) eventually increase prices once the core customers are fundamentally reliant on AI.
The real question then is "what is demand once you start charging the true unit costs?". Personally, I found this article sobering but well reasoned.
comment in response to
post
Without knowing all the model details or with transparent financials, it's hard to say but I would naively suspect most AI companies are in the red both on a cost per query basis (for API services) and on a cost per user basis (for subscription services).
comment in response to
post
I haven't seen people mocking the revenue forecasts, but I agree with your take w.r.t. demand. The bigger question is whether demand is the constraint?
Unlike standard software or even manufacturing businesses, I'm not sure the economies of scale look great if you factor in cost per query.
comment in response to
post
Given that they published the same work in both the ICLR workshop and ACL... I am skeptical of the claim that "The current version of Zochi represents a substantial advancement over our earlier systems that published workshop papers at ICLR 2025" 😂
comment in response to
post
Looks like they simultaneously submitted the same paper to an ICLR workshop: openreview.net/forum?id=rDC...
comment in response to
post
Learn more about the project in Percy's blog post: marin.community/blog/2025/05...
And about the Models we are releasing in @dlwh.bsky.social's training retro: marin.readthedocs.io/en/latest/re...
comment in response to
post
Last August, I chatted with
@dlwh.bsky.social
about the need for an open-source set of scaling law checkpoints!
Since then, I was lucky to play a (small) role in building Marin-8B. Check out the model (including intermediate checkpoints) here:
huggingface.co/marin-commun...
comment in response to
post
We have trained some respectable models from scratch!
- Marin-8B-Base: beats Llama 3.1 8B on 14/19 benchmarks
- Marin-8B-Instruct: try it out on HuggingFace: huggingface.co/spaces/WillH...
comment in response to
post
Marin repurposes GitHub, which has been successful for open-source *software*, for AI:
1. Preregister an experiment as a GitHub issue
2. Submit a PR, which implements the experiment in code
3. PR is reviewed by experts in the community
4. Watch the execution of the experiment live!
comment in response to
post
Want to add your model to CAVA? If it runs on VLLM, it runs on CAVA - no extra code needed.
We’ve open-sourced everything on GitHub:
🔗 github.com/SALT-NLP/CAVA
We’re open to collaborations --- test, extend, and help with large audio model evaluation! (5/5)
comment in response to
post
Why CAVA matters?
We talked with people who are building voice products and found most benchmarks don't capture their concerns!
→ Which model gives you low-latency conversations?
→ Which model can execute functions to go beyond chat?
→ Which model is the easiest to adjust and improve via prompts?
comment in response to
post
Results?
We tested
✅ GPT-4o (end-to-end audio)
✅ GPT pipeline (transcribe + text + TTS)
✅ Gemini 2.0 Flash
✅ Gemini 2.5 Pro
We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence.
No model wins everything. (3/5)
comment in response to
post
Most benchmarks test either core chat or broader audio analysis abilities.
But voice assistants need to handle turn-taking, interpret tone, execute tasks via function calls, and respect instructions and safety constraints—all in real-time.
CAVA tests models each of these capabilities (2/5)
comment in response to
post
AxBench makes the argument that most of the excitement around SAEs for steering lacked systematic evals which over hyped their effectiveness.
This is echoed by Google moving away from them after negative results in more systematic evals: www.lesswrong.com/posts/4uXCAJ...
comment in response to
post
FWIW, this line of research seems to have largely been shown to be ineffective for model steering in practice!
arxiv.org/abs/2501.17148 from @aryaman.io is my reference but several others have shown similar results!
comment in response to
post
Code if you are interested in running your own Claude Realtime Voice: github.com/Helw150/mcp-...
Mostly dead simple logic, but requires launching Claude from the terminal because the Claude Desktop app won't request microphone permissions otherwise!
comment in response to
post
aclanthology.org/2023.acl-lon... is a great interview study!
comment in response to
post
As always, we open source everything. Even our nicely made website: egonormia.org Please check out the leaderboard, the blog (w/Bibtex support), the code, data, as well as a data viewer.
comment in response to
post
Now most *urgently*: we review the history of these models. A straight line can be traced to modern AI from basic science. Not in engineering but in the cognitive science of language. Much of it funded by NSF, whose funding has now been paused. www.goldengooseaward.org/01awardees/pdp