Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. - ThreadSky

fchollet.bsky.social • 65 days ago

Today OpenAI announced o3, its next-gen reasoning model. We've worked with OpenAI to test it on ARC-AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks.

Comments

pekka.bsky.social•65 days ago

Were the tasks given to o3 in json format or as images?

scellus.bsky.social•64 days ago

json

jasonrute.bsky.social•63 days ago

No. There were given as ascii grids of numbers: https://x.com/GregKamradt/status/1870208490096218244

pekka.bsky.social•63 days ago

Oh, cool. That seems like the sensible way to do it, since I'm guessing image analysis still has the familiar limitations and json just makes the structure less clear. Although I don't know if that would be a problem for it anymore. It would be for a human.

scellus.bsky.social•60 days ago

Even that bare ascii representation seems to be challenging due to perception problems, incl. tokenization.
https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi

scellus.bsky.social•63 days ago

Yeah found that out later but didn’t care to correct. Non-images anyway.

ethanreedy.bsky.social•65 days ago

Thanks for posting the cost efficiency. I'm not seeing anyone talking about the fact that o3 Low costs an OoM more than o1, and o3 High costs three OoMs more. Yes, this is an impressive feat, but I'm beginning to doubt OpenAi's commitment to "intelligence that is too cheap to measure."

dlmanning.bsky.social•65 days ago

How much you want to bet that o3 _is_ o1 turned up a bit and fine tuned for a few particular benchmarks?

ethanreedy.bsky.social•65 days ago

I'm not so cynical as to think that they just did it for a particular benchmark, but yes, I'm assuming it is a tuning of GPT4, not based on an entirely new pre-training run. My guess is that they have learned some tricks to make it get more out of increased test-time compute.

dlmanning.bsky.social•65 days ago

It’s definitely not a fundamentally new model at its core. It’s an open question to what extent it is even “just” a model anymore.

I have no trouble imagining them targeting specific benchmarks.

alex.barcelona•64 days ago

Same. Additionally, it is possible that scrapping hard, you might have accidentally trained on some quiz which has some overlap with ARC. I don't know how data scrapping works at the scale of OpenAI, but I suspect no one can control it that well.

davidgerard.co.uk•64 days ago

like, literally everyone targets the benchmarks, it should be expected

nafnlaus.bsky.social•65 days ago

"arc-agi-tuned" :Þ

Until the model can tune itself to arbitrary tasks in realtime, that shouldn't count.

franklyfrancis.bsky.social•65 days ago

In your statement you mention analysing the results that o3 wasn't able to solve. Where is the best place to post findings on this?

alex.barcelona•65 days ago

It's at the end of the article. TIL they have a Discord group.

franklyfrancis.bsky.social•65 days ago

Thanks, I missed that

ch1n3du.bsky.social•65 days ago

@mutual-a.bsky.social

fchollet.bsky.social•65 days ago

It scores 75.7% on the semi-private eval in low-compute mode (for $20 per task in compute ) and 87.5% in high-compute mode (thousands of $ per task). It's very expensive, but it's not just brute force -- these capabilities are new territory and they demand serious scientific attention.

fchollet.bsky.social•65 days ago

My full statement here: https://arcprize.org/blog/oai-o3-pub-breakthrough

alxfed.bsky.social•65 days ago

The website is unreadable, sorry. If you have something to say to humans post a black on white static text on a web-page.
And I'm grave serious, get real.

danjacobson.bsky.social•57 days ago

Don’t be a jerk to Francois, please

alxfed.bsky.social•57 days ago

GFYS, Danny Jacobson.

danjacobson.bsky.social•57 days ago

Really appreciate the post, great write-up.

Can you clarify what exactly the “samples” are? Are they basically N-shot? Or are they more like “max depth” for the generated CoT?

Hoping it’s not N-shot, that would make these results less impressive to me

omermano.bsky.social•65 days ago

Thanks for the writeup! Is there more you can say about the "tuning" of o3? Was it specifically finetuned on the public dataset or was the public dataset just part of the training corpus? I guess the line here is a bit blurry.

raymyers.bsky.social•64 days ago

Appreciate this clarification!

In your understanding is o3 better understood to be a model, or an agent that includes a model?

johnhatchard.bsky.social•63 days ago

On Friday, OpenAI unveiled the o3 model family, the successor to the o1 “reasoning” model it released earlier in 2023 ...

Please click below ⬇️

robb.doering.ai•65 days ago

Sooo they win the grand prize just 10 days before the end of the year, yeah? 88% > 85% :(. Will they be getting the award money? That’s a pretty damning moment for the future of FOSS, if so… regardless, really.

robb.doering.ai•65 days ago

Also, this makes it even more ludicrous that OpenAI is still denying they’ve passed AGI, so as to not trigger the doomsday clause (“give away everything”) in their articles of incorporation. This benchmark was specifically designed to test general human reasoning…

Ugh. We’re so fucked 🥲

alex.barcelona•65 days ago

ARC it's not bullet proof. You can look at the public set, then put 12 people in a room for some weeks, and they can produce tests which, most likely, will end up having some overlap with the private test set that they have not seen. I still don't see how DL can "reason" outside the training data.

fchollet.bsky.social•65 days ago

Deep learning can't, but search can

robb.doering.ai•65 days ago

How could they do so without reasoning? Could 12 monkeys create such tests?

sternst.bsky.social•65 days ago

what's the expected score of o3 for ARC-AGI-2? will you be adapting v2 in terms of its perfomance? tbh, i suspect openai might be targeting those kinds of benchmarks on purpose, especially since every new version of gpts fares worse than its predecessors on some tasks, like gpt4 turbo for instance.

sternst.bsky.social•65 days ago

anyway, looking forward to reading your full analysis of o3's performance! i realise my previous comment was mean-spirited and presomptuous towards openai. so i'm genuinely interested in your opinion about this! have a good day/evening

hoitab.bsky.social•65 days ago

Thanks for the post.

Can you please clarify what's the basis for the "retail price" column? Is that a) the OpenAI API cost (something like o1 it's $60 per million token), or b) raw GPU costs?

The math works for a) $60 per million x 55k token per sample x 6 sample per task = $19.8 per task.

danjacobson.bsky.social•57 days ago

I am also very interested in whether these are OAI token costs, or gpu/hr costs

emilevankrieken.com•65 days ago

What performance would you expect to get with pure program search on that budget?

danieljbutler.bsky.social•65 days ago

Were you given access to the Chain-of-Thought transcripts from o3 running on ARC-AGI? Would be really interesting to see what type of reasoning it's doing

jbarbosa.org•65 days ago

Why is it called semi private and not just private? Are there guarantees that these questions are not in the training set?

alex.barcelona•65 days ago

That's why it's semi-private. That set is not publicly exposed, but whoever is monitoring their API calls can harvest the tests, and it's no longer private.

I think the private is only run on the ones which are open source.

avikdey.bsky.social•65 days ago

Are you theorizing it’s not brute force or you know it’s not brute force?

The data you shared seems to suggest it is brute force given that for 11.8% better accuracy you need to spend at least 50x more in compute.

iamjonjackson.bsky.social•65 days ago

It sounds like a different type of brute force, no?

robb.doering.ai•65 days ago

In the same way that humans brute force cognitive tasks by having a larger/denser cortex, I suppose…

bindlestiff.bsky.social•58 days ago

It sounds like you actually believe that a probabilistic approach to AI is eventually going to achieve acceptable levels of accuracy / reliability, which is irrational nonsense.

jonpo.bsky.social•65 days ago

This is great news, sounds like we are making fast progress...

sharphall.org•64 days ago

This chart is basically saying that the o3 models fine tuned on "ARC-AGI" problems do better answering "ARC-AGI" problems than older models without that fine tuning. Is that correct?

peterlj.bsky.social•63 days ago

Have you updated your existential concerns?

Clearly things are going at best unpredictably and faster than you thought.

dmewes.com•65 days ago

That's some seriously good scores!

mcalliph.bsky.social•64 days ago

Is there anywhere we can see how non-OpenAI models are performing against this benchmark?

joshphillips.info•64 days ago

Right!? I hadn’t even heard of ARC before yesterday and I feel like I have a decent pulse on AI and live in the Bay Area.

https://arcprize.org/

joshphillips.info•64 days ago

johnhatchard.bsky.social•63 days ago

On Friday, OpenAI unveiled the o3 model family, the successor to the o1 “reasoning” model it released earlier in 2023 ...

Please click below ⬇️

manuelmh.bsky.social•65 days ago

These results seem way ahead of the expected time-line, no? I'm quite surprised to be honest!

timhosseini.bsky.social•65 days ago

It’s truly amazing that it took 4 years to get to 5% & then in one year to 87.5%
I think most probably o1 helped OpenAI to create high quality synthetic reasoning dataset similar to Arc AGI tasks to get this huge jump from o1 to o3, but it completely changed my perspective towards test-time scaling

spellbanisher.bsky.social•64 days ago

A smaller open source model running on less than .10$ per task managed 56% on arc-agi. O3 used 30,000x as much compute to get 88%. Wouldn't be surprised if used similar methods, with difference being compute. Openai did train the model for this domain.

doghomie.bsky.social•61 days ago

Do we know wheter they trained on synthetic ARC Tasks or Not?

timhosseini.bsky.social•60 days ago

No one know for sure! On the Arc benchmark results they mentioned that they used O3 “tuned” version which probably means that they have fine tuned it on Arc training tasks and used O1 for generating synth data similar to ARC dataset

maxganzii.bsky.social•64 days ago

I may be wrong, but this is a mis-use of the word "reasoning", no? reasoning means *understanding* and deduction from understanding. There remains no actual understanding.

bill-of-lefts.bsky.social•64 days ago

How do you know that? I mean, do we really have a philosophically rigorous definition of “understanding” and “reasoning” such that we can definitely say these models have neither?

maxganzii.bsky.social•64 days ago

I may be wrong, but I think so, yes, and unequivocally. If I build lego machine which constructs lego machines, we could see how it was working and why. We wouldn't think reasoning was involved. With what now termed AI, how it works is known, and it is not reasoning, any more than a lego machine is.

bill-of-lefts.bsky.social•64 days ago

“How it works” is *not* entirely understood. I mean, we understand in broad strokes, but what neural nets actually do is still largely black box.

But beyond that—yes, maybe a sufficiently complicated machine can reason. Our brains are extremely complicated biological machines, we can reason.

bill-of-lefts.bsky.social•64 days ago

I don’t think there’s a slam dunk argument either way—and I used to be lean much harder towards your position when I studied philosophy of mind in undergrad—but I don’t think intuitions about simple lego machines are sufficient to tell us about neural nets with billions of weights

maxganzii.bsky.social•64 days ago

But this seems like a combination of two things - that these machines are producing behavour which is thought-like, rather than mechanical-like, and that how they work is unknown. I would say though how they work is known in the larger sense (or indeed how could they be constructed).