Profile avatar
williamheld.com
Modeling Linguistic Variation to expand ownership of NLP tools Views my own, but affiliations that might influence them: ML PhD Student under Prof. Diyi Yang 2x RS Intern🦙 Pretraining Alum NYU Abu Dhabi Burqueño he/him
87 posts 2,137 followers 450 following
Regular Contributor
Active Commenter
comment in response to post
As far as I can tell, the models aren't good enough right now that they can replace VFX at any high quality commercial scale. They are exactly good enough to generate fake viral videos for ad revenue on TikTok/Instagram & spread misinformation. Is there any serious argument for their safe release??
comment in response to post
I don't really see an argument for releasing such models with photorealistic generation capabilities. What valid & frequent business use case is there for photorealistic video & voice generation like Veo 3 offers?
comment in response to post
Now, I wouldn't do research on LLMs if I thought that was true in the long term! But I think it's reasonable for skeptics to question whether advances in inference efficiency, hardware efficiency, and even core energy infrastructure will happen soon enough for current companies to capitalize.
comment in response to post
The underlying assumption being that they can (a la Uber/Lyft) eventually increase prices once the core customers are fundamentally reliant on AI. The real question then is "what is demand once you start charging the true unit costs?". Personally, I found this article sobering but well reasoned.
comment in response to post
Without knowing all the model details or with transparent financials, it's hard to say but I would naively suspect most AI companies are in the red both on a cost per query basis (for API services) and on a cost per user basis (for subscription services).
comment in response to post
I haven't seen people mocking the revenue forecasts, but I agree with your take w.r.t. demand. The bigger question is whether demand is the constraint? Unlike standard software or even manufacturing businesses, I'm not sure the economies of scale look great if you factor in cost per query.
comment in response to post
Given that they published the same work in both the ICLR workshop and ACL... I am skeptical of the claim that "The current version of Zochi represents a substantial advancement over our earlier systems that published workshop papers at ICLR 2025" 😂
comment in response to post
Looks like they simultaneously submitted the same paper to an ICLR workshop: openreview.net/forum?id=rDC...
comment in response to post
Learn more about the project in Percy's blog post: marin.community/blog/2025/05... And about the Models we are releasing in @dlwh.bsky.social's training retro: marin.readthedocs.io/en/latest/re...
comment in response to post
Last August, I chatted with @dlwh.bsky.social about the need for an open-source set of scaling law checkpoints! Since then, I was lucky to play a (small) role in building Marin-8B. Check out the model (including intermediate checkpoints) here: huggingface.co/marin-commun...
comment in response to post
We have trained some respectable models from scratch! - Marin-8B-Base: beats Llama 3.1 8B on 14/19 benchmarks - Marin-8B-Instruct: try it out on HuggingFace: huggingface.co/spaces/WillH...
comment in response to post
Marin repurposes GitHub, which has been successful for open-source *software*, for AI: 1. Preregister an experiment as a GitHub issue 2. Submit a PR, which implements the experiment in code 3. PR is reviewed by experts in the community 4. Watch the execution of the experiment live!
comment in response to post
Want to add your model to CAVA? If it runs on VLLM, it runs on CAVA - no extra code needed. We’ve open-sourced everything on GitHub: 🔗 github.com/SALT-NLP/CAVA We’re open to collaborations --- test, extend, and help with large audio model evaluation! (5/5)
comment in response to post
Why CAVA matters? We talked with people who are building voice products and found most benchmarks don't capture their concerns! → Which model gives you low-latency conversations? → Which model can execute functions to go beyond chat? → Which model is the easiest to adjust and improve via prompts?
comment in response to post
Results? We tested ✅ GPT-4o (end-to-end audio) ✅ GPT pipeline (transcribe + text + TTS) ✅ Gemini 2.0 Flash ✅ Gemini 2.5 Pro We find GPT-4o shines on latency & tone while Gemini 2.5 leads in safety & prompt adherence. No model wins everything. (3/5)
comment in response to post
Most benchmarks test either core chat or broader audio analysis abilities. But voice assistants need to handle turn-taking, interpret tone, execute tasks via function calls, and respect instructions and safety constraints—all in real-time. CAVA tests models each of these capabilities (2/5)
comment in response to post
AxBench makes the argument that most of the excitement around SAEs for steering lacked systematic evals which over hyped their effectiveness. This is echoed by Google moving away from them after negative results in more systematic evals: www.lesswrong.com/posts/4uXCAJ...
comment in response to post
FWIW, this line of research seems to have largely been shown to be ineffective for model steering in practice! arxiv.org/abs/2501.17148 from @aryaman.io is my reference but several others have shown similar results!
comment in response to post
Code if you are interested in running your own Claude Realtime Voice: github.com/Helw150/mcp-... Mostly dead simple logic, but requires launching Claude from the terminal because the Claude Desktop app won't request microphone permissions otherwise!
comment in response to post
aclanthology.org/2023.acl-lon... is a great interview study!
comment in response to post
As always, we open source everything. Even our nicely made website: egonormia.org Please check out the leaderboard, the blog (w/Bibtex support), the code, data, as well as a data viewer.
comment in response to post
Now most *urgently*: we review the history of these models. A straight line can be traced to modern AI from basic science. Not in engineering but in the cognitive science of language. Much of it funded by NSF, whose funding has now been paused. www.goldengooseaward.org/01awardees/pdp