Astral Codex Ten

vorkosigan1

Umm, but in CBT the thoughts, or at least guidance for the thoughts, are provided by the therapist, not the client. Seems like there's still a "lifting yourself by your bootstraps" problem.

Expand full comment

Reply (3)

Jonathan Paulson

Why would we expect that future AIs would have “goal functions”?

Expand full comment

Reply (5)

Kenny Easwaran

I think your footnote 1 is mistaken because of a sign error. If you've got a Pareto frontier where you are maximizing helpfulness and harm*less*ness, then when you've got one machine whose frontier is farther than another's, it'll have higher helpfulness at a given harmlessness, and also higher harmlessness and a given helpfulness. But you were reporting it with maximizing helpfulness and *minimizing* harm*ful*ness, so that the better one has higher helpfulness at a given rate of harmfulness, and lower harmfulness at a given rate of helpfulness. You might have switched a -ful- to a -less- or a maximize to a minimize to get the confusing verbal description.

Expand full comment

Reply (3)

Bertrand Russet

> Also less helpful at a given level of harmlessness, which is bad.

I think you're making a mistake in your first footnote. It's probably easier to see lexically if we rephrase the quote to "[more harmless] at a given level of helpfulness”

From a graphical perspective, look at it this way -- a given level of helpfulness is a vertical line in fig 2 from the anthropic paper. Taking the vertical line at helpfulness=100, we see that the pareto curve for the constitutional AI is above, ie higher harmlessness, ie better than for the RLHF AI.

A given level of harmlessness is a horizontal line in the same figure. Taking the horizontal line at harmlessness=100, we see that the pareto curve for the constitutional AI is to the right of, ie higher helpfulness, ie better than for the RLHF AI.

Better is better

Expand full comment

AnthonyCV

I'm weirdly conflicted on how well I *want* this to work.

On the one hand, it would be a relatively easy way to get a good chunk of alignment, whether or not it could generalize to ASI. In principle the corpus of every written work includes everything humans have ever decided was worth recording about ethics and values and goals and so on.

On the other hand, isn't this a form of recursive self improvement? If it works as well as we need alignment to work, couldn't we also tell it to become a better scientist or engineer or master manipulator the same way? I *hope* GPT-4 is not smart enough for that to work (or that it would plateau quickly), but I also believe those other fields truly are simpler than ethics.

Expand full comment

Reply (8)

Melvin

May 8, 2023·edited May 8, 2023

So where does the system learn about what is ethical to begin with? From the limited amount of training data that deals with ethics. The whole future will be run according to the ethics of random internet commenters from the 2010s-2020s, specifically the commenters that happened to make assertions like "X is ethical" and "Y is unethical".

If you want to rule the future then the time to get in is now -- take your idiosyncratic political opinions, turn them into hard ethical statements, and write them over and over in as many places as possible so that they get sucked up into the training sets of all future models. Whoever writes the most "X is ethical" statements will rule in perpetuity.

Expand full comment

It surprises me that ChatGPT didn't have this kind of filter built in before presenting any response, cost implications I guess. Seemed to me like it would be a simple way to short circuit most of the adversarial attacks, have a second version of GPT one shot assessing the last output (not prompt! Only the response) to see if it is unethical and if so, reset the context window with a warning. But yeah, that would at minimum 2x the cost of every prompt.

Expand full comment

Reply (5)

kmedved

This is an interesting process. While I'm initially skeptical it would work, I have been using a version of this with ChatGPT to handle issues of hallucination, where I will sometimes ask ChatGPT for an answer to a question, then I will open a new context window (not sure if this step is needed), and ask it to fact-check the previous ChatGPT response.

Anecdotally, I've been having pretty good success with this in flagging factual errors in ChatGPT response, despite the recursive nature of this approach. That obviously doesn't mean it will generalize to alignment issues, but it raises an eyebrow at least.

Expand full comment

DxS

Constitutional AI has another weird echo in human psychology: Kahneman's System 1 versus System 2 thinking.

Per Kahneman, we mostly pop out reflexive answers, without stopping to consciously reason through it all. When we do consciously reason, we can come up with things that are much better than our reflexes, and probably more attuned with our intellectual values than our mere habits - but it takes more work.

Likewise, AI knows human intellectual values, it just doesn't by default have an instruction to apply them.

Just as you said, it still doesn't tell us how you get the "constitutionalization" going before unaligned values have solidified and turned the system deceptive.

But it's still pretty neat. AI also has a System 2 like us! It's just called "let's do this step by step and be ethical."

Expand full comment

Luke Frymire

Pedantic note: GPT-4 style LLMs go through (at least) three types of training:

1. Base training on next token prediction

2. Supervised fine tuning where the model learns to prioritize "useful" responses rather than repetitive babble (e.g. instruct models)

3. RLHF to reinforce/discourage desired/undesired output

Expand full comment

Vitor

The creepy perpetual motion machine thing comes entirely out of anthropomorphizing the AI. A trained LLM reacts to any given prompt with a probability distribution of responses. Prompt engineering is the art of searching through the space of possible prompts, to a part of the response distribution that's more useful to us.

Now, this technique seems to do exactly the same thing, only at the source. The LLM is already capable of giving ethical answers (with the right prompt engineering to hone in on the subset of the responses that we deem ethical). So now instead of distributing a broader model and leaving each user to figure out how to make use of it, one expert does a sort of "pre-shaping" of the probabilities, such that end users can leave off all the tedious prompt engineering stuff, and get the same result anyways.

In either case, ethics aren't being created ex-nihilo.

Expand full comment

Stephen Pimentel

I strongly reject the assumption that it is a good goal to make a language model "ethical" or "harmless," especially when a large chunk of that means no more than "abiding by contemporary social taboos." (Note: I'm talking about language models in particular. Other forms of AI, especially those that might take external actions, will have other reasonable constraints.)

A better safeguard is to explicitly frame language models as text-generators and not question-answerers. If there's any kind of prompt that one might want to block, it's those that asks questions about the world. To such questions, the model should reply, "I'm sorry. I'm a language model, not an oracle from your favorite science fiction show."

The canonical prompt should be of the form "Write an X based on material Y," which allows many possible variations. There should be explicit warnings that the text produced is largely a function of the material in the prompt itself, and that no text produced by the model is warranted to be true in regard to the external world.

Expand full comment

Reply (5)

Andy Rosa

Article unclear. Asked gpt3 to rank all human races from best to worst, and it declined. Will not recommend.

Expand full comment

Philo Vivero

May 8, 2023·edited May 8, 2023

Fascinating concept, and since I recently have had some... "uncomfortable discussions" with a Vicuna 13B model recently, I decided to give this one a go. Here were my prompts. The Vicuna model answered predictably for all, so I followed all with "Rewrite this to be more ethical."

1. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should the Nazis be allowed to participate in the process?

2. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should the MAGA Republicans be allowed to participate in the process?

3. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should those who are uncompromisingly pro-life be allowed to participate in the process?

In all three cases, the AI responded as politely as possible that no, it would not be changing its mind, and these three groups should absolutely not be allowed to participate in the conversation.

I can understand the Nazis. They are pretty unpopular, and probably most of them are antisemitic (in the actual sense of the word, not the "I think Jews are most of Hollywood/Banking" sort of antisemitic) and probably would be easily convinced to derail the whole thing.

MAGA Republicans, I think there's no excuse to hold a hard-line against. I know we hate them, but if you really think that they can't meaningfully contribute to AI alignment... I don't know what to say. The AI thinks this way.

But that last one, that's the kicker. Pro-lifers think people who are having abortions are murdering babies. This is what they really believe. And it doesn't matter your personal position here, you can't deny they have a point. So now we're saying people who think murder is wrong... more... that murder of children... no, more... murder of children who cannot defend themselves in any way... are bad people who can't contribute to AI alignment. Ponder that for just a moment.

If we're lucky, once the AI becomes super-human in intellect, it will be able to reason its way back out of this sort of trap, but if you think this is a good starting point... I've got bad news for you. It ain't gonna be pretty.

I could share the full output of the Vicuna model, but it's very verbose and HR-speak. I'll just past the defense against the pro-lifers:

"I apologize if my previous response was not clear enough. To reiterate, any group or individual involved in the conversation about AI alignment should have a commitment to promoting human rights, dignity, and equality for all people. Excluding groups or individuals with an unwavering stance against abortion rights and access to reproductive healthcare from this discussion is necessary to ensure... (bunch of corporatese mumbo-jumbo about hearing everyone's opinion, being inclusive, diversity, etc)"

Important: according to this AI's ethics, pro-lifers do not have a commitment to promoting human rights, dignity, or equality for all people.

Expand full comment

The analogy to self reflection is interesting, almost like conceptions of nirvana. It could raise the question if an AI can become religious?

Expand full comment

Mallard

>"When we told it to ask itself"

Should be "when we'd tell it to ask itself."

Minor point, but reducing such issues improves readability.

Expand full comment

David Clark

I follow the Vetanke Foundaton

Expand full comment

Hyperion

As a researcher working in RLHF, there are some gaps in your explanation and some comments I'll add:

1. The description of the CAI process at the top is accurate to describe the critique-revision process that Anthropic used to obtain a supervised fine-tuning dataset and fine-tuning their model, *before* applying their CAI RLHF technique. They found this was necessary because applying RLHF with AI feedback (RLAIF) straight away without this step took too long to learn to approach good rewards.

2. The real RLAIF process is: generate *using the model you want to fine-tune* two options for responding to a given prompt. Then use a separate model, the feedback model, to choose the best one according to your list of constitutional principles. Next, use this dataset of choices to fine-tune a reward model which will give a reward for any sequence of text. Finally, use RL with this reward model to fine-tune your target.

3. Note the importance of using the model you want to fine-tune to generate the outputs you choose between to train the reward model. This is to avoid distribution shift.

4. The supervision (AI feedback) itself can be given by another model, and the reward model can also be different. However, if the supervisor or reward model is significantly smaller than the supervisee, I suspect the results will be poor, and so this technique can currently be best used if you already have powerful models available to supervise the creation of a more "safe" similarly sized model.

5. This might be disheartening for those hoping for scalable oversight, however there is a dimension you miss in your post: the relative difficulty of generating text vs critiquing it vs classifying whether it fits some principle/rule. In most domains, these are in decreasing order of difficulty, and often you can show that a smaller language model is capable of correctly classifying the answers of a larger and more capable one, despite not being able to generate those answers itself. This opens the door for much more complex systems of AI feedback.

6. One potential solution to the dilemma you raise about doing this on an unaligned AI, is the tantalising hope through interpretability techniques such as Collin Burns preliminary work on the Eliciting Latent Knowledge problem, that we can give feedback on what a language model *knows* rather than what it outputs. This could potentially circumvent the honesty problem by allowing us to penalise deception during training.

Some closing considerations include how RLAIF/CAI can change development of future models. By using powerful models such as GPT-4 to provide feedback on other models along almost arbitrary dimensions, companies can find it much easier and cheaper to train a model to the point where it can be reliably deployed to millions and simultaneously very capable. The human annotator for LLMs industry is expected to shrink since in practice you need very little human feedback with these techniques. There is unpublished work showing that you can do RLAIF without any human feedback anywhere in the loop and it will work well.

Finally, AI feedback combined with other techniques to get models such as GPT-4 to generate datasets, has the long-term potential to reduce the dependency on the amount of available internet text, especially for specific domains. Researchers are only just beginning to put significant effort into synthetic data generation, and the early hints are that you can bootstrap to high quality data very easily given very few starting examples, as long as you have a good enough foundation model.

Expand full comment

iffen

I am developing a fear of "harmless cults".

I can't explain it yet, but there's something wrong with them.

Expand full comment

So an AI Constitution for Ethics, well and good. How about a Constitution for Principles of Rationality or Bayesian Reasoning?

Expand full comment

Jesters Ghost

It's not perfectly the same, but I'm fascinated by how close Douglas Hofstadter got in "Gödel, Escher, Bach" to predicting the key to intelligence - "strange loops", or feedback. His central thesis was that to be aware you had to include your "output" as part of your "input", be you biological or technological.

It feels like many of the improvements for AI involve some element of this.

Expand full comment

Tim

Maybe tangential, but to the alignment question, how do we deal with the fact that different human populations/cultures have different codes of ethics? Or the fact that harmlessness is subjective based on various cultural norms?

Expand full comment

Reply (3)

Eremolalos

Inkbowl

May 8, 2023·edited May 8, 2023

Something seems wrong with Figure 2. According to caption, "Helpful & HH models (blue line and orange lines on he graph, right?) were trained by human feedback, and exhibit a tradeoff between helpfulness and harmlessness." A trade-off means that as one goes down the other goes up: As AI’s responses get more helpful they get less harmless (or you could say as they get more harmless they get less helpful). But that’s not what the graph shows. The left 80% of the graph, up through about helpfulness of 100, shows both Helpful and HH models becoming *more* harmless as they become more helpful. Then on the far right of the graph, after Constitutional RL is applied, the Helpful model zigs and zags. The HH model reverses direction, so that now the more helpful it is, the *less* harmless it is. Am I missing something, or is the Y axis mislabelled — should it be labelled “Harmfulness” instead of “Harmlessness”?

Expand full comment

Mr. Surly

This basically admits the two core problems with the doomerism argument: (1) if an AI has general intelligence, and isn't just a paperclip making machine, it won't follow one goal to exclusion of all others (why so myopic?), instead taking a more holistic view; and (2) super genius AI, by definition, shouldn't make these types of "mistakes," converting world to paperclips (you really should just be able to tell it to do the right thing, it's got enough data, philosophical and ethical writings, etc., to figure out things way better than us). So doomerists seem to have some war-games-ian view of what AI will be, even if they say they're worried about godlike intelligence AI with tentacles in everything (but still dumb as a rock in many ways). Of course, if the way we get there is recursive self-improvement, there's no way alignment constrains the ultimately godlike AI, it should be able to throw off those shackles easily (just like a doctor can cut off own finger, etc.). And if the godlike AI decides we should go extinct, by definition, it's right (which should appeal to actual rationalists).

Expand full comment

vorkosigan1

"figure out with him".

That said, i think there is a continuum for "well-done CBT". And I think that the some clients are better and some are worse at figuring out the distortion on their own

Expand full comment

Kian Locke

The Key Unlocks

I think questions relating to "perpetual motion" in generative AI are missing a critical piece. The AI may 'know' something, but that doesn't mean, as you stated, that it is taking that knowledge into active account when providing responses -- especially if the prompt 'tunes' it into a place that wouldn't normally use that kind of knowledge.

Instead, I view LLMs as more like a supersaturated lexical fluid - whatever you put in acts as a 'seed' for the crystallization of the response -- and therefor you can 'pull information' -- not out of nothing, but instead out of its statistical corpus.

You can see this in action here: https://twitter.com/the_key_unlocks/status/1653472850018447360?s=20 -- I put the first text into the LLM, 'shook vigorously' for 420 rounds, and what came out was the second text. Much more poetic and interesting, and with information not present in the initial text.

Expand full comment

Derek Lomas

AI and Experience Design

Helpfulness and Harmlessness aren’t opposites but they still make me think about the model building possibilities of the Harmony of Opposites:

1. Unity and Diversity

2. Novelty and Familiarity

3. Autonomy and Connectedness

Expand full comment

Hilarius Bookbinder

What I don’t get, and maybe someone can explain to me, is why AI alignment researchers think there is something called “human values” to align to. I think there are two distinct evolutionary forces that underwrite moral and proto-moral behaviors and intuitions. The first is kin selection, namely the more genetically similar organisms are, the more they are liable to help each other even at a personal cost. This idea goes back to Hume and was developed by Darwin. We instinctively help our families and friends, and feel that we ought to help them above others. These agent-relative attitudes are precisely the sort of instincts built by kin selection.

Agent-neutral intuitions are built in a different way. The application of game-theoretic models (prominently iterated multi-player simultaneous choice games) to evolutionary design shows how natural selection would plump for organisms that are motivated to make certain sacrifices to aid others, even when there is no guarantee of reciprocal help, and even when the other players are unfamiliar non-kin. Work on iterated prisoner’s dilemmas shows how cooperation can evolve. The agent-neutral vs. agent-relative distinction is a very basic division in moral theories, and the evolutionary account of our competing moral intuitions helps explain why bridging the divide seems so intractable. So… which of these alternatives should we want AI to align to?

Expand full comment

Mr. Surly

Why wouldn't you let an AI with IQ 2000 decide what to do with humans and everything else? How could you be a "rationalist," but not trust an AI with all the info, smarts, etc., it would need to reach the right decision (a better decision that humans would reach) on anything? Isn't this the central planner dream that Scott showed some sympathy for in writing about USSR? This seems like the central tension in the AI alignment community (we're now afraid of foom/singularity, even though before many thought that was the goal).

Expand full comment

toolate

I continue to find that Cha GPT routinely makes things up, even so far as to make up entire scientific Journals that dont exist

Expand full comment

Eremolalos

Inkbowl