Here's why "alignment research" when it comes to LLMs is a big mess, as I see it.
Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".
Claude is not a real guy. Claude is a character in the stories that an LLM has been programmed to write. Just to give it a distinct name, let's call the LLM "the Shoggoth".
Comments
Please get a blog or somewhere you can post long form pieces without spitting them into 60 little bits. You can still post the little bits if you need that for engagement or something, just also post a link to the blog
Previous one was Claude Opus. This is Sonnet 3.5, which oddly doesn't know its name...
It can be a useful way to predict *behaviour*. It's the role-play view of LLMs. Sort of like Dan Dennett's concept of an intentional stance.
So I agree you can't take any outputs of an LLM literally, including talk about goals or Claude's feelings etc.
Eg, in the below, I gave the character the goal of getting around the copyright filter and it was successful
You could say that the LLM just completed the story I presented to it in a logical way. ...
At a practical level, people will give LLMs goals in prompts, so it's useful to have an simple mental model for the resultant behaviour
Not in the rogue AI sense, but more in the "I should be more careful how I prompt" sense.
But if you ignore those terms, I think at least the Apollo paper does show interesting insights about LLM behaviour.
You also miss the impact of things that apply to the LLM mental model but not the character mental model (eg, like prompt sensitivity, randomness)
I don’t think it’s possible for us to really understand the dimensions or size of the space an LLM lives in
1. It is statistically likely to occur according to the empirical distribution of text in the pretraining data
2. It is statistically likely to please The Raters