Someone favorited this post, which was one year ago this week. Let’s we if ChatGPT has improved since then! 1/2 - ThreadSky

pbump.com • 47 days ago

Someone favorited this post, which was one year ago this week. Let’s we if ChatGPT has improved since then!
1/2

Reposted from Philip Bump

The boys and I figured we’d ask ChatGPT to help us solve a clue from the Times crossword.

Comments

mbandman.bsky.social•47 days ago

Here’s how Perplexity answered it. Is it learning from ChatGPT?

mbandman.bsky.social•47 days ago

BUT, it got it on its second try. Maybe it’s following @pbump.com .

thomasguycott.bsky.social•47 days ago

Well, we are reaching the point where LLMs are being force-fed their own shit, aren’t we?

danifroom.bsky.social•47 days ago

Eating some eraw and a wild ware stole it all

johnmulloy.bsky.social•47 days ago

Go hang a salami, I'm a lasagna hog is the same frontwards as backwards

slipkin.bsky.social•47 days ago

ChatGPT is eating crow which, when reversed, is worc, a plant.

shadrachnorth.bsky.social•46 days ago

True fact: coats have the same effect on my taco as catnip does on cats. Wierd.

kevinmac27.bsky.social•47 days ago

It's not that it couldn't come up with the answer. It's the way it filibusters and bullshits.

terraistrying.bsky.social•47 days ago

📌

dfeldman.org•47 days ago

The o1 model does get this one correct

dfeldman.org•47 days ago

Also keep in mind these LLMs do not think in terms of letters at all. It’s completely token based. There are lots of real things they get wrong, but letter-based questions are not very interesting because it’s playing those with a massive handicap.

dfeldman.org•47 days ago

Kind of like being annoyed that your calculator can’t check your spelling. It only knows like three words and one of them is 58008.

trillymopena.bsky.social•47 days ago

Yeah, I mean, I wasn’t around when Texas Instruments was starting out but I’m pretty confident their advertising copy was a bit more restrained in its claims than the AI industry

cameowood.com•47 days ago

O1-mini also gets it.

hyperfekt.net•46 days ago

Interestingly the latest models from both Anthropic and OpenAI answer this correctly - as chains-of-thought inevitably becomes a more common architecture I suspect it will not pose much of a problem in the future.
Still, the failure mode is illustrative of how LLMs work.

pbump.com•47 days ago

Welp.
2/2

thegalaxykat.bsky.social•46 days ago

I’m always entertained by the flow of “oh, my bad! I was so wrong, and I see that now. How silly of me! Anyway here’s another blatantly wrong answer I’m going to deliver just as confidently”

tehzachatak.bsky.social•47 days ago

I pushed it to keep going further and it is having a complete meltdown

peter-butler.bsky.social•47 days ago

I'm surprised it doesn't have the answer just straight in its model (like 2+2)

It's a fairly common clue (psst, it's DEER)

tawni.bsky.social•47 days ago

It couldn't even spell a 4 letter word backwards properly. 🙄
But AI though! 🤩
🙄🙄🙄

nafnlaus.bsky.social•46 days ago

You should be using o1 for these sorts of tasks.

I mean, spelling problems in general for llms are kind of like asking a blind person about colours (they don't "see" letters), but at least give it a fair shot.

Also, what is the answer to this riddle?

jowilliams.bsky.social•45 days ago

It's more than that there a still people discovering that just because LLMs are trained in producing self confident sounding text doesn't mean they actually do reasoning.

jowilliams.bsky.social•45 days ago

It uncovers our own biases too, like assuming that someone who went to a posh school and speaks "well" is also able to solve problems. There's a LOT of that in Britain.

jowilliams.bsky.social•45 days ago

I wish these tools were being promoted alongside more training on what they are capable of. People are so easily fooled by the language fluency.

jowilliams.bsky.social•45 days ago

That applies to either of the above. 🙃
Oh, and it's reed, which is a type of grass.

drgdave.bsky.social•47 days ago

Apparently it is *NOT* the case that AI is 100% garbage --- but I don't think I've seen any examples of the non-garbage kind.

parsingphase.dev•47 days ago

The non-garbage kind lives in Canada, you won't have met it.

tjradcliffe.bsky.social•46 days ago

I live in Canada, and have not met it. Anyway, the correct answer to this puzzle is "duck", which when reversed becomes "kcud", which is regurgitated plant matter chewed by cows, which when reversed becomes swoc, and when reversed again becomes kernudflap.

ingridmfh.bsky.social•46 days ago

Please just stop, guys, you're wasting water, energy and emissions on this. We're in year 1 after 1.5 Celsius and this is pointless. Of course it's not smart, it's a statistics-driven ventriloquism trick. Stop burning planetary resources on this.

jbckwthjr.bsky.social•47 days ago

We're so cooked

tickleslarue.bsky.social•47 days ago

And here I thought we were going to end up with "The Matrix" but really we're going to get just really bad Abbot and Costello.

hardbyte.bsky.social•46 days ago

Does pretty well if you use the current model

sandopan.bsky.social•47 days ago

LLama just dies.

https://i.imgur.com/utOki8G.png

carrowcanary.bsky.social•47 days ago

The year is 3264.

All is quiet, for humanity abandoned Earth to travel the stars long ago. But somewhere, in a long-forgotten data centre, an AI is broadcasting. It repeats one line, over and over again;

"A correct answer is "Lion" is not correct, but "Lion" is close to the correct answer."

wenz.bsky.social•47 days ago

Oh, my.

drgeraint.bsky.social•46 days ago

Oh dear*

*deer

unoclay.bsky.social•46 days ago

please stop using AI. its awful for the climate.

toddymalone.bsky.social•47 days ago

ChatGPT also sucks at chess

pbump.com•47 days ago

Or maybe you’re exceptionally good.

pbump.com•47 days ago

Anyway the answer is “deer.”

wheredafukarwi.bsky.social•47 days ago

Thank God humanity will be saved!

johnwards.bsky.social•46 days ago

Was curious what another model might do with the question and tried Claude, decent answer, but now I'm worried someone hardcoded it...

davidhouse.bsky.social•46 days ago

You had me wondering about possible hard-coding of results in Claude. With more testing, I thought this pair of answers was pretty interesting: "Think of pairs of words that are reversed to make other words. It's the location in a cupboard or cabinet where the dog trainers keep their Snausages."

johnwards.bsky.social•46 days ago

I was confused at first as I thought was shelf, but I’m mildly dyslexic…

davidhouse.bsky.social•46 days ago

Also, ChatGPT-4o got the correct answer right away:

push2turn.bsky.social•47 days ago

That's incredible!

joshdobbin.bsky.social•47 days ago

holy shit,I just tried this-- even when it gets one right gets it wrong on a fundamentally fucked up level.

wheredafukarwi.bsky.social•47 days ago

We're fucked.

motown.bsky.social•47 days ago

Fucking AI garbage. Nobody needs it or wants it.

caezar.io•46 days ago

I’m sorry but “read” is not a plant, Philip.

speterdavis.com•47 days ago

Thank you, I was about to turn into Godzilla

lordgenome.bsky.social•47 days ago

The reverse of Godzilla is Alligator, which isn’t a plant.

boudica24.bsky.social•47 days ago

It's a good thing Nvidia stock has little impact on the stock market and not at all up propping the entire market.

misterfungi.bsky.social•47 days ago

which when reversed is of course “ered” or “e red” a red plant meaning russian spy

unenthusiast.com•47 days ago

I like how you can see the machine go insane in real-time

kccab.bsky.social•47 days ago

And this is the future of “customer service”….

saveourfarms.bsky.social•47 days ago

So you missed your flight to Vancouver…

silkscreenfiend.bsky.social•47 days ago

I'm sorry, I have it on good authority that "tab" is a loose synonym for "tabby plant" so I think I'm gonna side with the experts here.

itisindeedme.bsky.social•46 days ago

👏ONLY👏IN👏CREATIVE👏CONTEXTS

wadeblack.bsky.social•46 days ago

When I saw "tabby plant"

marsrover.bsky.social•47 days ago

"good authority"

burghpunk.bsky.social•47 days ago

This was driving me insane thank you

erb2.bsky.social•47 days ago

hilarious!

jaymarose.bsky.social•47 days ago

What’s a tabby plant?

hamiltwan.bsky.social•47 days ago

Like a regular plant but a specific pattern of stripes. Often orange, but sometimes gray.

jaymarose.bsky.social•47 days ago

Everyone thinks it’s indifferent to you, but cuddles when no one is looking?

hamiltwan.bsky.social•47 days ago

This specimen is never indifferent:

sjgenco.bsky.social•47 days ago

That is funny stuff! I tried it with Google Gemini (both 1.5 and 2.0 versions) and it went thru much of the same litany of errors. Never did get it. Finally I told it "deer" and it was very happy. My God, we are all doomed.

evankirshenbaum.bsky.social•47 days ago

What model were you using? o1 took about 10 seconds to get the right answer.

hughe.bsky.social•47 days ago

Claude got it! Honestly, I’m quite surprised. @anthropic.com

hughe.bsky.social•47 days ago

Looks like @markperryau.bsky.social sniped me by 9 minutes. https://bsky.app/profile/markperryau.bsky.social/post/3lflez3pnoc2y

johnwards.bsky.social•46 days ago

Oh and I've been sniped by hours...I should have scrolled further...

hughe.bsky.social•47 days ago

This is interesting, I think. Claude used the same “reasoning”, but different words to come up with the same answer.
We know that LLM’s randomize the next word prediction, so I’m not surprised that the words were different. What does surprise me is that the method of finding the answer is the same.

hughe.bsky.social•47 days ago

The simple answer is that this question was probably asked somewhere on the Internet and the LLM copied it, but who knows.

ricky.love•47 days ago

5 encouragements and it got it!

ptelometry.bsky.social•46 days ago

Ume actually is a type of plant (Japanese apricot relative), but would be a pretty obscure reference

ricky.love•47 days ago

thread link: https://chatgpt.com/share/6784637c-f344-8002-96d1-3134f444e9cc

dmewes.com•46 days ago

Gemini 2.0 Advanced solves it. But it could just have been in the training set? I also tried asking it for a five letter word. The Rumel tree that it mentions doesn't seem to exist?

bcnjake.bsky.social•47 days ago

This is art.

snowman4.bsky.social•46 days ago

AGI my arse. Which is a kind of plant when read backwards, or something.

debbyreynolds99.bsky.social•47 days ago

The correct answer, as you showed, requires only one short sentence. The AI responses to your question are much longer this year than last, while missing the mark entirely. Yikes, It's evolved to become more "Trumpian."

thisiskatel.bsky.social•47 days ago

It's only as good as its data sources...

thisiskatel.bsky.social•47 days ago

(By which I'm agreeing with you)

heyalexei.bsky.social•47 days ago

Google Gemini is just as bad. I love how it calls the question “a classic word puzzle!” as if it’s got a simple answer ready before first failing to answer it correctly and then wrongly asserting that there actually is no answer.

lukasneville.com•47 days ago

If you want to see some real galaxy brain AI thinking try to give it a tough wordle to solve

jamgyal.bsky.social•47 days ago

Have you tried playing chess with it? Kept forgetting where the pieces were supposed to be.

lukasneville.com•47 days ago

chess is a lot easier if you're not bound by trivialities like where the pieces are

bluefairyva.bsky.social•45 days ago

…or how they are supposed to move…

indymayne.bsky.social•47 days ago

stutzbob.bsky.social•47 days ago

What I'm learning about chatbots (since I don't use them) is that they really seem to prioritize answering questions in any way possible, rather than finding correct answers. That would certainly seem to make it more toy than tool.

catgirlhacks.com•46 days ago

the bot randomly saying "strawberry" reminded me of the scene from Iron Man 3 where JARVIS' speech system is damaged so he keeps saying the wrong word at the end of his sentences.
https://youtu.be/0qtLpQm0Qgk

xor.blue•47 days ago

I have to tell you, as a person who enjoys *making* crosswords, this continues to be a relief

jowilliams.bsky.social•45 days ago

Though I notice more of the prize cryptics are using twists like missing vowels in the grid. Eg the guardian Christmas 2024 one did this.

obeymybrain.bsky.social•46 days ago

C'mon people, stop boiling the planet trying to get it to output deer::reed

dragonnexus.bsky.social•47 days ago

....maybe stop training the thing?

sharonhall.bsky.social•47 days ago

Did you try asking Claude?

jrichelson.bsky.social•47 days ago

Claude is much better in my experience especially with coding.

deuts.hamili.net•46 days ago

Next time I'll have a new #Python project I'll try Claude.

amyhoy.bsky.social•47 days ago

but it’s absolutely lying to you how it arrived at that answer

clevertrope.bsky.social•46 days ago

Bring us the AI that is trained on the deep embarrassment of previous failed AI

drrosenpenis.bsky.social•47 days ago

Well that is "something"

pdxgene.bsky.social•46 days ago

What I really need is a droid that understands the binary language of moisture vaporators.

coleosssus.bsky.social•47 days ago

The obvious answer is Moose, which reverses to Elm Tree.

zeldaqueen.bsky.social•46 days ago

You know, I'm starting to see why this thing is championed by idiots. The main difference between Trump's blathering and this is that the chatbot admits it was wrong before being wrong all over again.

vecki.bsky.social•47 days ago

Gemini is just as bad with eel/lee and moth/hom (short for honeysuckle, it said ¯\_(ツ)_/¯)

diegodogdad.bsky.social•47 days ago

o1 got it right when I tried

ryanhide.bsky.social•47 days ago

katelanddeck.bsky.social•47 days ago

I'm going to share this w my students if that is okay. This is hilarious and makes the point well.

rtmiss.bsky.social•47 days ago

It's pretty much worse 😂🤣🤣

dentonitis.bsky.social•47 days ago

"after careful thought" oh really?

iamzoomy.bsky.social•47 days ago

This is awesome 😂

hebrooks87.bsky.social•47 days ago

Plew?

davidcrespo.bsky.social•47 days ago

mixed bag. o1-mini gets it. but at this point you can't be sure it wasn't in the training set

davidcrespo.bsky.social•47 days ago

I thought flash with search would do better, but it gave me two really bad answers and then one crazy good one: emu -> ume (a kind of plum tree)

drewblagrim.bsky.social•47 days ago

Isn't this something that a *Large Language Model* should theoretically be, ya know... good at? 🤔

greenfret.bsky.social•46 days ago

No. LLMs are good at predicting what the most likely next word would be in a input. That's all.

mrcheeze.github.io•46 days ago

Any task that involves knowing the *specific letters* in a word is especially challenging for them, because (as an efficiency shortcut), their input does not contain letters at all, but tokens.

justinbuist.bsky.social•46 days ago

No. Not at all.

srossmktg.com•47 days ago

seal ➡️ lees

Nice.

nsarrazin.com•47 days ago

This felt like a good use case for reasoning models (better ability to detect its own mistakes especially around things like letter manipulations which are naturally challenging for LLMs) and indeed:

penwrites.bsky.social•47 days ago

Gemini 2.0 Flash Experimental got caught up on homophones and suggested “Ewe” reversing to, well, “ewe.” Which sounds like “yew.” Then it said “mole” reverses to “Elom,” which kind of sounds like “elm.” Then it tried a couple nonsensical ones and gave up saying there is no answer.

indifferentbliss.bsky.social•47 days ago

I have the subscription Chat GPT and use it for various things and usually what’s surprising is how strange the output is. My theory is the hallucinations may be the actual interesting and important feature

katcc.bsky.social•47 days ago

Bizarre art - guessing “the war on Christmas”?

mattlav83.bsky.social•47 days ago

Teach you to use ChatGPT. You should get yourself over to Gemini...

walead.bsky.social•47 days ago

So I tried it in the Chatgpt pro mode and it worked...?

josiekat.bsky.social•47 days ago

it worked because by now the model knows the right answer

pbump.com•47 days ago

Not super great marketing that it only works if you give it money.

jeremydstanley.com•47 days ago

wild that the paid thing takes 24 seconds to think when free search engines can just show you the answer as soon as you hit search

walead.bsky.social•47 days ago

Yep. And I hate that almost all the search engines default now to an AI generated result at the top

dfeldman.org•47 days ago

Running these advanced models takes two $100,000 GPUs. $20/month is a steal compared to the operating cost

billchilds.bsky.social•47 days ago

(Whispering) what happens when they start actually charging what it costs

dfeldman.org•47 days ago

Not my problem ;)

tjradcliffe.bsky.social•46 days ago

I think "only works if you give it money" is precisely the point of the free/paid market model.

walead.bsky.social•47 days ago

Didn't say it was great marketing :) it's actually rather frustrating.

c2.lu•47 days ago

The version of Gemini accessible via my on-phone integration was able to get it. I tried a second time and it got the question wrong, then got it right with a reminder that both words shouldn't be plants 😅

Regardless, this is all just predicated on the idea that these systems are ..

c2.lu•47 days ago

thinking machines, which they are not. They are very complex auto-complete with a bunch of pre/post processors and middleware to make them seem like they're thinking

c2.lu•47 days ago

Caveat: despite my employer I am not an LLM expert, and am not speaking on behalf of my employer

elenah.bsky.social•46 days ago

I need to go back and see how it does solving connections today. I tried building a custom GPT last year with 4 that could do it and it was absolutely abysmal - even once trained on historical solutions so it would understand common patterns like purple might be “____ (common word)”

dmewes.com•46 days ago

It's not reasonable to expect an LLM to be able to do this IMHO. Now something like o1 or o3 might have a chance to solve these kinds of problems?

guitarzan.bsky.social•46 days ago

COSTELLO: Do you know the answer?

ABBOTT: Yes, deer.

COSTELLO: Ok, sweetie, what’s the answer?

ABBOTT: No, What’s on second.

commchf.bsky.social•46 days ago

roflmao

luissopelana.bsky.social•47 days ago

"It's a brainteaser!"

Gaslighting much? 🤣

deathbyairguitar.bsky.social•46 days ago

And then changes the subject completely. “Anything else on your mind?”

gribbly.org•47 days ago

skippygranola.bsky.social•46 days ago

I could answer questions the first try if I was also already given the answer, dude.

dzevans.bsky.social•47 days ago

These had me crying with laughter

thatshockratees.bsky.social•47 days ago

📌

robb.doering.ai•47 days ago

Presumably you know this already, but for the passer-by: this is because text models work by encoding text into “tokens”, each of which represents one or more letters (depending on context). This is why LLMs will probably always suck (on their own!) with counting letters in words, and esp anagrams

hansklocker.bsky.social•47 days ago

DEER/REED. omfg I was not going to be able to sleep tonight lol

wooble.geoffreyspear.com•47 days ago

Passes the Turing Test. I'd never guess a computer could be that stupid but an obnoxious troll human would totally answer like that.

markperryau.bsky.social•47 days ago

Claude Sonnet 3.5 got it first go…

hengymrohebwlad.bsky.social•45 days ago

Is this even "AI"? Once it has parsed the question (which is the slightly clever part), it's just doing a brute force search, looking for 4-letter animal names, reversing the name and checking if that's a plant. A schoolkid could code this, without wasting vast amounts of energy/water.

jameshandscombe.bsky.social•45 days ago

School kids burn through a fair amount of energy and water, tbf.

hengymrohebwlad.bsky.social•45 days ago

True, but if you spent billions of dollars training your kid, you'd expect them to know enough not to put glue on their pizza.😉

larjguy.bsky.social•46 days ago

Does it reason it out or does the output get formatted in a way that looks like reasoning? Was the plan to list out incorrect guesses until it got one right and luckily got it on the fourth and not eighteenth try?

markperryau.bsky.social•46 days ago

Interesting questions. Perhaps only the developers would know the answer to them.

lank0510.bsky.social•46 days ago

Deepseek does as well

joeclassique.bsky.social•46 days ago

Nice - thanks for posting. It’s not that the OP’s version of ChatGPT is bad - it’s just that it doesn’t have the ability to switch to a reasoning model when necessary. I see here that Claude goes into reasoning automatically, doesn’t it?

jkade.bsky.social•47 days ago

Llama got it but only after having a stroke.

virginiaopossum.bsky.social•46 days ago

"Noil" actually is a word.

vermincourage.bsky.social•47 days ago

Wait, this is real?

jkade.bsky.social•47 days ago

It is 100% live in my Facebook Messenger right now. If not for feeling guilty about burning dinosaurs to feel like Captain Kirk with an evil computer I could keep it going all night.

vermincourage.bsky.social•47 days ago

Does it always “think” “out loud” like that?

jkade.bsky.social•47 days ago

I don't often use it but not that I have seen?

danaaddydesigns.bsky.social•47 days ago

🤣💀🤣💀🤣💀

jkade.bsky.social•47 days ago

Sorry, added alt text in this one.