shahanmemon.bsky.social
Researching {science of science #SciSci, #AI4Science, computational social science, generative #AI, LLMs, agents, alignment, misinformation in science}
PhD @ UW.
Visiting scholar @ NYU.
Alum @ Carnegie Mellon
Academic webpage: https://samemon.github.io
547 posts
2,775 followers
1,203 following
Regular Contributor
Active Commenter
comment in response to
post
That option has been available for some time if I am not wrong.
comment in response to
post
I think there might be a confusion. They are not revealing the identity of the reviewers, just the reviewer reports.
comment in response to
post
Yeah eLife model is interesting. I have been ambivalent about it too. That said, I do appreciate publishers and journals willingness to experiment new models. I feel like the way we review papers is quite archaic. I was attending ICSSI this year and can tell that many researchers echo this thought.
comment in response to
post
So I am guessing this was an informed choice. The reviewer identity still remains anonymous.
comment in response to
post
What do you mean by “processed”? I am guessing this wasn’t an adhoc step on Nature’s part. Since 2020 or so, they have been giving a choice to publish reports.
Nature communications took this step of making all reports public three years ago (www.nature.com/articles/s41...).
comment in response to
post
Because a calculator is a device made for a specific purpose, our trust in it would be immediately corrupted if it were to give an incorrect answer. It would be useless.
If you take away the specific purpose you also take away the criteria to determine if something works or not.
comment in response to
post
That said, the evidence seems mixed. See for example this: arxiv.org/pdf/2305.13534 and this: arxiv.org/abs/2505.236... and this: arxiv.org/abs/2505.21523
According to system card, o3 and o4 seem to hallucinate much more than o1 🤷♂️
comment in response to
post
CoT + RAG could potentially be helpful. See for example this recent preprint: arxiv.org/pdf/2505.09031 where RAG+CoT performs better than each alone, though each alone seems better than the base model too. An earlier paper in ACL also points to the same re. CoT+RAG: aclanthology.org/2023.acl-lon...
comment in response to
post
Paradoxically, had the study not attracted so much attention, it likely would not have been retracted. Yet this shows the need for more responsible norms and systems for engaging with preprints, especially in fast-moving, hype-driven fields like #AI, where the stakes are exceptionally high.
🧵 4/4
comment in response to
post
When that foundation turns out to be misleading, we are left with wasted engagement and and a long trail of cleanup in an already strained system. Plus Information ecosystem gets affected.
Attention, credibility, and labor were all spent on something that should not have commanded it.
🧵 3/4
comment in response to
post
It was covered by 10s of news outlets & has been cited 50 times across working papers, published articles, policy reports, and a dissertation. Many of these cited it as evidence.
It shows how deeply a paper can become entangled with science and public discourse before formal publication.
🧵 2/4
comment in response to
post
An earlier blog: thebsdetector.substack.com/p/ai-materia...
comment in response to
post
In a way yes. DR is not a single shared MHA network but is a system/agent workflow of multiple components that may themselves be based on it.
As for validation, you may be right; I am not sure. They may be fine-tuned models, sometimes using code interpretors, but it may just be LLM-asking-LLMs.
comment in response to
post
And this thread might be useful too.
threadreaderapp.com/thread/18872...
comment in response to
post
Actually, it's a bit different than an LLM model. These screenshots are from a thread on twitter than I have found useful in understanding DeepResearch + its failure modes.
comment in response to
post
Though even that is not impervious to hallucinations.
comment in response to
post
As for knowledge graphs, they may not be enough, no?, or even feasible in many cases. Take temporality for example. Facts change, relevance shifts, and context matters. Isn't that one reason RAG is somewhat superior i.e. it brings in up-to-date, contextual info at the time when the model needs it.
comment in response to
post
That's an interesting thought i.e. it hallucinates not randomly, but in service of its internal coherence.
So do smarter people "make more claims overall" as well as highlighted in the evaluation doc as one of the reasons :p?
comment in response to
post
mashable.com/article/open...
"OpenAI doesn't know the underlying cause"
comment in response to
post
🤦♂️
comment in response to
post
That said, I have seen deep research to have fewer hallucinations than other models. So chain-of-thought + access to the web (action space), could potentially help.
Though DeepResearch has other issues (not quite useful for "deep" research)
comment in response to
post
more..
comment in response to
post
There is empirical evidence that it does.
But I still would not trust it. Today I searched for something, and within its thought process, it looked for a paper that did not exist. (or atleast I could not find that paper; see screenshot).
comment in response to
post
It does not “know” what it does not know.. 🤷
In some models (like DeepResearch), they have somewhat better guardrails in place that avoid hallucinations to some extent.
comment in response to
post
OSPI’s human-centered AI #Ethics guidance link: ospi.k12.wa.us/student-succ...
Fun fact: The state of #Washington is one of the first five states to draft an ethics guidance around AI usage in classroom and the only state to have revised it twice already.
#AIEducation #AI #Education
3/3
comment in response to
post
One key issue we raised in the webinar is the growing misalignment between students and educators around what is an acceptable #AI use in the classroom. Without shared norms or clarity, this tension creates confusion, inconsistent enforcement, and lost opportunities for meaningful learning.
2/3
comment in response to
post
Case in point the increasing prevalence of puzzle-solving & mop-up work around “Can GPT do X”. Has a “general purpose” instrument ever in the history become such a wide-spread object of study, disrupting what gets studied, who gets to & how?
#ScAISci #SciSci #ScienceOfAIMediatedScience #AI4Science
comment in response to
post
I sometimes think large empirical papers are like magic. From the outside, it's easier to imagine that the authors are magicians than that they actually slogged through all the steps their works seem to imply being necessary. This work wasn't magic, it was just hard work. 10/
comment in response to
post
In a very early writing phase. Will definitely share to get your feedback :)
comment in response to
post
Font is Minion 3 I think. Since you tagged Professor Crockett, here's one of the chapters inspired from their writing.