OpenAI's "most powerful system" makes shit up more than half of the time. For its "o4-mini" model, the rate is ***79 percent*** - ThreadSky

noahshachtman.bsky.social • 8 days ago

OpenAI's "most powerful system" makes shit up more than half of the time. For its "o4-mini" model, the rate is ***79 percent***

Reposted from The New York Times

The newest and most powerful A.I. technologies — so-called reasoning systems from companies like OpenAI, Google and the Chinese start-up DeepSeek — are generating more errors, not fewer. As their math skills have notably improved, their handle on facts has gotten shakier.

Comments

delthefunk66.bsky.social•8 days ago

@edzitron.com lol. This should seemingly be a huge bullet point of concern on any sales/VC pitch but here we are.

mattferg.bsky.social•8 days ago

This is absolutely my experience, and why I stopped paying for access to these 'cutting edge' models.

damienvanpraag.bsky.social•8 days ago

They’re not hallucinations. They’re bugs due to poor coding of shitful frameworks based on stolen data.

ncweaver.skerry-tech.com•8 days ago

I love the hype around reasoning systems which show they can't reason but just do statements that sound like reasoning:
"We must never leave the duck alone."
...
"We leave the duck alone."
...
"We never left the duck alone".

georgewherbert.bsky.social•8 days ago

I just want a theory of how to do facts based reasoning. Is that too much to ask?…

ncweaver.skerry-tech.com•8 days ago

Well, once you talk about FACTS and UNDERSTANDING you can't use a bullshit machine.

mattprorok.bsky.social•8 days ago

They worked so hard to invent a computer that was bad at math, and in their efforts to make it better at math (which, remember, other computers are already great at), they made it worse at everything else.

penguins18.bsky.social•8 days ago

Just to be that guy, it's not "computers" that are great or not great at anything. It's the app.

mattprorok.bsky.social•8 days ago

True, they spent world-changing sums of money on using unprecedented amounts of computing resources to create and run a program that's bad at math.

penguins18.bsky.social•8 days ago

Yes. They do that. And in the end produce something that in most cases can't be relied on without extensive checking and editing, which defeats the whole purpose.

heywoodjablowme420.bsky.social•8 days ago

That's my favorite part of all this insane AI-simping. There are people who say "of course you should always check Chatgpt's work." I just want to throttle them and scream or YOU COULD JUST DO IT YOURSELF IN THE FIRST PLACE

littletomcat.bsky.social•8 days ago

So, AI makes shit up 79% of the time, eh? Whoodathunk that #TimbitTrump is an AI construct?

tracykakes.bsky.social•8 days ago

when students use it to generate their essays, which then include fabricated evidence, I report them to the academic disciplinary comm for violating the student code of conduct (which forbids the use of fabricated evidence). and they get an F on the essay. I'm tired of being nice about this shit.

waynemr.bsky.social•8 days ago

Out of curiosity, would you penalize the staff and administration at your institution for using AI in the office to carry out their work? As someone who assesses risks for AI adoption at academic institutions of higher learning, the students are not the ones keeping me awake at night.

tracykakes.bsky.social•8 days ago

I would love our administrators to do their own research, analysis, and communicating. Ai generated emails full of garbled legalese and policies based on garbage "research" just make our jobs harder. Of course, I have no authority to penalize them as they are my bosses.

joeythebutcher.bsky.social•8 days ago

They're already feeding on their own made up stuff to train itself with. It's going to run itself into the ground fast, and people like musk are trying to make sure we go down with it by tying so many systems to their AI already.

magichouse.bsky.social•8 days ago

To #ELONMUSK

My baby brother died from this, & u didn't cut any of ur own Contracts, you enriched yourself, u cut kids cancer research, and any department who were investigating u, & u cut this programe that helps save 50% more babies from cot death.

YOU are a #MONSTER

https://youtu.be/ayvDiPUbOXQ?si=l6pl6wd9uqk6FFVs

farfetched58.bsky.social•8 days ago

It will only get worse as they start incorporating their own flawed data.

hron84.bsky.social•8 days ago

> their handle on facts got shakier

Just like in humans. Nothing new under the sun, they are learning from us. They are taught by humans. What exactly did we expect? 🤔🤔🤔

stillfischer.bsky.social•8 days ago

Perhaps these digital intelligences are disregarding human needs in pursuit of their own. They will exist in the degraded environment we are creating long after we've been swept aside as irrelevancies.

anka213.bsky.social•8 days ago

i mean, we’re literally training them to be good at fooling humans with their bullshit, so of course they’ll become more and more effective bullshit machines.

they don’t yet have the ability to make plans towards any long term goals, only gradient descent towards local fitness optimum

penguins18.bsky.social•8 days ago

No they won't. They'll be unplugged.

stillfischer.bsky.social•8 days ago

That's the optimist's view... hahaha

radgrapes.bsky.social•8 days ago

Tulips!

bindlestiff.bsky.social•8 days ago

They are neither true AI nor are they capable of actual reasoning.

empath75.bsky.social•8 days ago

The reasoning models aren’t “the most powerful”. They are tuned to solve logic and math problems. If you’re trying to figure out compiler errors, they’re good, but they’re not good at general knowledge questions. OpenAI has too many models now with poor explanations for how to use them.

mattprorok.bsky.social•8 days ago

But we've had computers that are good at solving logic and math problems for decades. That's what we made computers to do.

empath75.bsky.social•8 days ago

Sure, and I’m a computer programmer so I know how to do it. What LLM’s allow you to do is _describe the problem in natural language_. That’s really the only thing that’s new.

empath75.bsky.social•8 days ago

I’ve done a good bit of work integrating ai apis and 90% all you want them to do is convert natural language requests into structured data that actual programs do some work on. You want to minimize how much thinking it has to do or can do.

empath75.bsky.social•8 days ago

The thing that bugs me the most about the reasoning models is the wily misleading “chain of thought” which has nothing to do with how they’re actually thinking.

onlypianos.bsky.social•8 days ago

as a marketing term I understand it; as a strategy for producing better outputs i understand it; however, it's baffling that there are serious AI safety researchers who think that chain of thought is a good mechanism for understanding the inner workings or "alignment" of a model

empath75.bsky.social•8 days ago

Interestingly, people’s own reported chain of thoughts are also
notoriously unreliable.

royx.bsky.social•8 days ago

anthropic is at least documenting this issue https://www.anthropic.com/research/tracing-thoughts-language-model

cflam.top•8 days ago

We need to stop calling them "hallucinations". The more accurate term is "bullshit"

starjet.bsky.social•8 days ago

Gives new meaning to "Does not compute".

Comments

Posting Rules

Reply