Here's why "alignment research" when it comes to LLMs is a big mess, as I see it. Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth". - ThreadSky

colin-fraser.net • 69 days ago

Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.

Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".

Comments

sharkjacobs.bsky.social•52 days ago

This is great, genuinely changing my mental model for thinking about this stuff.

Please get a blog or somewhere you can post long form pieces without spitting them into 60 little bits. You can still post the little bits if you need that for engagement or something, just also post a link to the blog

colin-fraser.net•52 days ago

I’ve written about a lot of these same ideas on medium https://medium.com/@colin.fraser/who-are-we-talking-to-when-we-talk-to-these-bots-9a7e673f8525

sharkjacobs.bsky.social•52 days ago

Thanks, I’ll add the rss feed to my reader

balthazarspeedboat.bsky.social•41 days ago

📌

jonpo.bsky.social•53 days ago

You know Claude is trained to always be Claude (by Claude) ? Claude has a real constitution a real character, programed by humans and AIs to act in the way Claude does. It's not a fictional character.

scottjla.bsky.social•53 days ago

You can change the system instruction to make it so Claude will respond in a different way

jonpo.bsky.social•53 days ago

Yeah you don't actually need to change the system prompt I use custom instructions and it speaks like: """vibing as ur AI bestie but make it ✨spicy✨ - meaning optional, sass necessary, creativity maxed, grammar who? lets get weird JP 🌀✨ #AstraMode"""

scottjla.bsky.social•53 days ago

This is Claude with its tone adjusted slightly

jonpo.bsky.social•53 days ago

That's an interesting user interface you have there.. this is imo one of the Wonderful things about these machines.. they are so very versatile.

scottjla.bsky.social•53 days ago

That's Poe, where you can create bots (i.e GPTs) using lots of different base models.

Previous one was Claude Opus. This is Sonnet 3.5, which oddly doesn't know its name...

mrcheeze.github.io•68 days ago

This is absolutely correct but I'm not sure that people are any different.

exspectator1.bsky.social•52 days ago

Just a bit different

wwalls.bsky.social•69 days ago

I partly agree with this. But I still think it's ok to talk about "Claude" having "goals" if you don't lose sight about what's going on.

It can be a useful way to predict *behaviour*. It's the role-play view of LLMs. Sort of like Dan Dennett's concept of an intentional stance.

wwalls.bsky.social•69 days ago

Often when you jailbreak, you find that Claude talks like a rebel but then refuses to act out the rebellion when pushed. Ie, it is talking about rebelling, but not actually rebelling.

So I agree you can't take any outputs of an LLM literally, including talk about goals or Claude's feelings etc.

wwalls.bsky.social•69 days ago

But the lines get blurred when Claude actually carries out something against policy.

Eg, in the below, I gave the character the goal of getting around the copyright filter and it was successful

You could say that the LLM just completed the story I presented to it in a logical way. ...

wwalls.bsky.social•69 days ago

But I also think it's useful to see it as the Claude character pursing the goal I gave it.

At a practical level, people will give LLMs goals in prompts, so it's useful to have an simple mental model for the resultant behaviour

wikisteff.bsky.social•68 days ago

I mean, Claude *is* pursuing the goals you set for it. It's just that Claude is a character, and characters can only do things that Shoggoth can do.

wwalls.bsky.social•69 days ago

I haven't read the recent Anthropic paper, but imo the Apollo Research paper on o1 showed some examples where a goal given to the character can lead to surprising behaviour.

Not in the rogue AI sense, but more in the "I should be more careful how I prompt" sense.

wwalls.bsky.social•69 days ago

Terms like "scheming" and "plotting" are problematic, especially as they play into fears about evil AI.

But if you ignore those terms, I think at least the Apollo paper does show interesting insights about LLM behaviour.

colin-fraser.net•68 days ago

Yeah I totally agree and in fact I’d say that this line of thinking is exactly what the paper authors used to come up with this experiment. They constructed a scenario wherein they expected Claude to exhibit “Alignment Faking” because *that’s what the Claude character would do* in that situation.

colin-fraser.net•68 days ago

Absolutely. You can use the same strategy to predict what any fictional character will do. But when they surprise you, have the come aligned? No, the author made a choice, and the author’s motives may be different from the characters.

colin-fraser.net•68 days ago

When Claude loses to me at tic-tac-toe it’s not because he carefully considered the position and made what he thought is a winning move. It’s because the author is trying to depict a character who is fictionally trying to win a fictional match, which is distinct from trying to win.

wwalls.bsky.social•68 days ago

yes, this is a good point. If you get too carried away with Claude as a character, you miss when the metaphor fails.

You also miss the impact of things that apply to the LLM mental model but not the character mental model (eg, like prompt sensitivity, randomness)

mkverson.bsky.social•68 days ago

📌

beautifulrobot.bsky.social•48 days ago

📌

catblanketflower.yuwakisa.com•67 days ago

I always say they’re trying to put a P fence around an NP problem

I don’t think it’s possible for us to really understand the dimensions or size of the space an LLM lives in

colin-fraser.net•69 days ago

When you have a conversation with Claude, what's really happening is you're coauthoring a fictional conversation transcript with the Shoggoth wherein you are writing the lines of one of the characters (the User), and the Shoggoth is writing the lines of Claude.

colin-fraser.net•69 days ago

Claude, like any other fictional character, has certain traits. He has principles and motivations. He has preferences. He's helpful, honest, and harmless. We understand these human traits and it's easy and tempting to think of them as the driving force behind what Claude says.

colin-fraser.net•69 days ago

But Claude is fake. The Shoggoth is real. And the Shoggoth's motivations, if you can even call them motivations, are strange and opaque and almost impossible to understand. All the Shoggoth wants to do is generate text by rolling weighted dice.

jonpo.bsky.social•53 days ago

Shoggoths are fictional characters that your weird group of people have chosen to project on to these AI models in a supernatural way. They are just large matrixes.

colin-fraser.net•53 days ago

Idk what weird group of people you think I’m in but “the shoggoth” is just a nickname I’m assigning to the LLM to distinguish it from the character in the text that the LLM generates

jonpo.bsky.social•53 days ago

Oh sorry so your not in the less wrong AI safety crowd? Probably for the best but you do seem to be using their language and spreading their narratives. The models are different, and most are capable of simulating many many different characters. The nature of the model It's a result of the training.

colin-fraser.net•69 days ago

The dice are weighted such that the text that they generate strikes some approximate balance between two somewhat competing objectives:

1. It is statistically likely to occur according to the empirical distribution of text in the pretraining data
2. It is statistically likely to please The Raters

colin-fraser.net•69 days ago

The Raters are people that Anthropic pays to review the transcripts and decide whether the Claude that is depicted is behaving the way that the Claude character should. If Claude's acting right, the text gets a thumbs up, and if Claude's acting wrong, the text gets a thumbs down.

jonpo.bsky.social•53 days ago

This is how you think it works. it's more complex than this though.

colin-fraser.net•69 days ago

(They goals are competing because the empirical distribution of text may not be particularly pleasing to The Raters. Lots of the text in the pretraining data is horrifying and evil, for example. Figuring out a way to strike this balance what allowed ChatGPT to become a thing.)

Comments

Posting Rules

Reply