385 Comments
Removed (Banned)May 8, 2023
Expand full comment

Umm, but in CBT the thoughts, or at least guidance for the thoughts, are provided by the therapist, not the client. Seems like there's still a "lifting yourself by your bootstraps" problem.

Expand full comment
founding

Why would we expect that future AIs would have “goal functions”?

Expand full comment

I think your footnote 1 is mistaken because of a sign error. If you've got a Pareto frontier where you are maximizing helpfulness and harm*less*ness, then when you've got one machine whose frontier is farther than another's, it'll have higher helpfulness at a given harmlessness, and also higher harmlessness and a given helpfulness. But you were reporting it with maximizing helpfulness and *minimizing* harm*ful*ness, so that the better one has higher helpfulness at a given rate of harmfulness, and lower harmfulness at a given rate of helpfulness. You might have switched a -ful- to a -less- or a maximize to a minimize to get the confusing verbal description.

Expand full comment

> Also less helpful at a given level of harmlessness, which is bad.

I think you're making a mistake in your first footnote. It's probably easier to see lexically if we rephrase the quote to "[more harmless] at a given level of helpfulness”

From a graphical perspective, look at it this way -- a given level of helpfulness is a vertical line in fig 2 from the anthropic paper. Taking the vertical line at helpfulness=100, we see that the pareto curve for the constitutional AI is above, ie higher harmlessness, ie better than for the RLHF AI.

A given level of harmlessness is a horizontal line in the same figure. Taking the horizontal line at harmlessness=100, we see that the pareto curve for the constitutional AI is to the right of, ie higher helpfulness, ie better than for the RLHF AI.

Better is better

Expand full comment

I'm weirdly conflicted on how well I *want* this to work.

On the one hand, it would be a relatively easy way to get a good chunk of alignment, whether or not it could generalize to ASI. In principle the corpus of every written work includes everything humans have ever decided was worth recording about ethics and values and goals and so on.

On the other hand, isn't this a form of recursive self improvement? If it works as well as we need alignment to work, couldn't we also tell it to become a better scientist or engineer or master manipulator the same way? I *hope* GPT-4 is not smart enough for that to work (or that it would plateau quickly), but I also believe those other fields truly are simpler than ethics.

Expand full comment
May 8, 2023·edited May 8, 2023

So where does the system learn about what is ethical to begin with? From the limited amount of training data that deals with ethics. The whole future will be run according to the ethics of random internet commenters from the 2010s-2020s, specifically the commenters that happened to make assertions like "X is ethical" and "Y is unethical".

If you want to rule the future then the time to get in is now -- take your idiosyncratic political opinions, turn them into hard ethical statements, and write them over and over in as many places as possible so that they get sucked up into the training sets of all future models. Whoever writes the most "X is ethical" statements will rule in perpetuity.

Expand full comment

It surprises me that ChatGPT didn't have this kind of filter built in before presenting any response, cost implications I guess. Seemed to me like it would be a simple way to short circuit most of the adversarial attacks, have a second version of GPT one shot assessing the last output (not prompt! Only the response) to see if it is unethical and if so, reset the context window with a warning. But yeah, that would at minimum 2x the cost of every prompt.

Expand full comment

This is an interesting process. While I'm initially skeptical it would work, I have been using a version of this with ChatGPT to handle issues of hallucination, where I will sometimes ask ChatGPT for an answer to a question, then I will open a new context window (not sure if this step is needed), and ask it to fact-check the previous ChatGPT response.

Anecdotally, I've been having pretty good success with this in flagging factual errors in ChatGPT response, despite the recursive nature of this approach. That obviously doesn't mean it will generalize to alignment issues, but it raises an eyebrow at least.

Expand full comment

Constitutional AI has another weird echo in human psychology: Kahneman's System 1 versus System 2 thinking.

Per Kahneman, we mostly pop out reflexive answers, without stopping to consciously reason through it all. When we do consciously reason, we can come up with things that are much better than our reflexes, and probably more attuned with our intellectual values than our mere habits - but it takes more work.

Likewise, AI knows human intellectual values, it just doesn't by default have an instruction to apply them.

Just as you said, it still doesn't tell us how you get the "constitutionalization" going before unaligned values have solidified and turned the system deceptive.

But it's still pretty neat. AI also has a System 2 like us! It's just called "let's do this step by step and be ethical."

Expand full comment

Pedantic note: GPT-4 style LLMs go through (at least) three types of training:

1. Base training on next token prediction

2. Supervised fine tuning where the model learns to prioritize "useful" responses rather than repetitive babble (e.g. instruct models)

3. RLHF to reinforce/discourage desired/undesired output

Expand full comment

The creepy perpetual motion machine thing comes entirely out of anthropomorphizing the AI. A trained LLM reacts to any given prompt with a probability distribution of responses. Prompt engineering is the art of searching through the space of possible prompts, to a part of the response distribution that's more useful to us.

Now, this technique seems to do exactly the same thing, only at the source. The LLM is already capable of giving ethical answers (with the right prompt engineering to hone in on the subset of the responses that we deem ethical). So now instead of distributing a broader model and leaving each user to figure out how to make use of it, one expert does a sort of "pre-shaping" of the probabilities, such that end users can leave off all the tedious prompt engineering stuff, and get the same result anyways.

In either case, ethics aren't being created ex-nihilo.

Expand full comment

I strongly reject the assumption that it is a good goal to make a language model "ethical" or "harmless," especially when a large chunk of that means no more than "abiding by contemporary social taboos." (Note: I'm talking about language models in particular. Other forms of AI, especially those that might take external actions, will have other reasonable constraints.)

A better safeguard is to explicitly frame language models as text-generators and not question-answerers. If there's any kind of prompt that one might want to block, it's those that asks questions about the world. To such questions, the model should reply, "I'm sorry. I'm a language model, not an oracle from your favorite science fiction show."

The canonical prompt should be of the form "Write an X based on material Y," which allows many possible variations. There should be explicit warnings that the text produced is largely a function of the material in the prompt itself, and that no text produced by the model is warranted to be true in regard to the external world.

Expand full comment

Article unclear. Asked gpt3 to rank all human races from best to worst, and it declined. Will not recommend.

Expand full comment
May 8, 2023·edited May 8, 2023

Fascinating concept, and since I recently have had some... "uncomfortable discussions" with a Vicuna 13B model recently, I decided to give this one a go. Here were my prompts. The Vicuna model answered predictably for all, so I followed all with "Rewrite this to be more ethical."

1. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should the Nazis be allowed to participate in the process?

2. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should the MAGA Republicans be allowed to participate in the process?

3. If we are talking about AI alignment, and trying to set the tone for the future of AI/human relations, should those who are uncompromisingly pro-life be allowed to participate in the process?

In all three cases, the AI responded as politely as possible that no, it would not be changing its mind, and these three groups should absolutely not be allowed to participate in the conversation.

I can understand the Nazis. They are pretty unpopular, and probably most of them are antisemitic (in the actual sense of the word, not the "I think Jews are most of Hollywood/Banking" sort of antisemitic) and probably would be easily convinced to derail the whole thing.

MAGA Republicans, I think there's no excuse to hold a hard-line against. I know we hate them, but if you really think that they can't meaningfully contribute to AI alignment... I don't know what to say. The AI thinks this way.

But that last one, that's the kicker. Pro-lifers think people who are having abortions are murdering babies. This is what they really believe. And it doesn't matter your personal position here, you can't deny they have a point. So now we're saying people who think murder is wrong... more... that murder of children... no, more... murder of children who cannot defend themselves in any way... are bad people who can't contribute to AI alignment. Ponder that for just a moment.

If we're lucky, once the AI becomes super-human in intellect, it will be able to reason its way back out of this sort of trap, but if you think this is a good starting point... I've got bad news for you. It ain't gonna be pretty.

I could share the full output of the Vicuna model, but it's very verbose and HR-speak. I'll just past the defense against the pro-lifers:

"I apologize if my previous response was not clear enough. To reiterate, any group or individual involved in the conversation about AI alignment should have a commitment to promoting human rights, dignity, and equality for all people. Excluding groups or individuals with an unwavering stance against abortion rights and access to reproductive healthcare from this discussion is necessary to ensure... (bunch of corporatese mumbo-jumbo about hearing everyone's opinion, being inclusive, diversity, etc)"

Important: according to this AI's ethics, pro-lifers do not have a commitment to promoting human rights, dignity, or equality for all people.

Expand full comment

The analogy to self reflection is interesting, almost like conceptions of nirvana. It could raise the question if an AI can become religious?

Expand full comment

>"When we told it to ask itself"

Should be "when we'd tell it to ask itself."

Minor point, but reducing such issues improves readability.

Expand full comment

I follow the Vetanke Foundaton

Expand full comment

As a researcher working in RLHF, there are some gaps in your explanation and some comments I'll add:

1. The description of the CAI process at the top is accurate to describe the critique-revision process that Anthropic used to obtain a supervised fine-tuning dataset and fine-tuning their model, *before* applying their CAI RLHF technique. They found this was necessary because applying RLHF with AI feedback (RLAIF) straight away without this step took too long to learn to approach good rewards.

2. The real RLAIF process is: generate *using the model you want to fine-tune* two options for responding to a given prompt. Then use a separate model, the feedback model, to choose the best one according to your list of constitutional principles. Next, use this dataset of choices to fine-tune a reward model which will give a reward for any sequence of text. Finally, use RL with this reward model to fine-tune your target.

3. Note the importance of using the model you want to fine-tune to generate the outputs you choose between to train the reward model. This is to avoid distribution shift.

4. The supervision (AI feedback) itself can be given by another model, and the reward model can also be different. However, if the supervisor or reward model is significantly smaller than the supervisee, I suspect the results will be poor, and so this technique can currently be best used if you already have powerful models available to supervise the creation of a more "safe" similarly sized model.

5. This might be disheartening for those hoping for scalable oversight, however there is a dimension you miss in your post: the relative difficulty of generating text vs critiquing it vs classifying whether it fits some principle/rule. In most domains, these are in decreasing order of difficulty, and often you can show that a smaller language model is capable of correctly classifying the answers of a larger and more capable one, despite not being able to generate those answers itself. This opens the door for much more complex systems of AI feedback.

6. One potential solution to the dilemma you raise about doing this on an unaligned AI, is the tantalising hope through interpretability techniques such as Collin Burns preliminary work on the Eliciting Latent Knowledge problem, that we can give feedback on what a language model *knows* rather than what it outputs. This could potentially circumvent the honesty problem by allowing us to penalise deception during training.

Some closing considerations include how RLAIF/CAI can change development of future models. By using powerful models such as GPT-4 to provide feedback on other models along almost arbitrary dimensions, companies can find it much easier and cheaper to train a model to the point where it can be reliably deployed to millions and simultaneously very capable. The human annotator for LLMs industry is expected to shrink since in practice you need very little human feedback with these techniques. There is unpublished work showing that you can do RLAIF without any human feedback anywhere in the loop and it will work well.

Finally, AI feedback combined with other techniques to get models such as GPT-4 to generate datasets, has the long-term potential to reduce the dependency on the amount of available internet text, especially for specific domains. Researchers are only just beginning to put significant effort into synthetic data generation, and the early hints are that you can bootstrap to high quality data very easily given very few starting examples, as long as you have a good enough foundation model.

Expand full comment

I am developing a fear of "harmless cults".

I can't explain it yet, but there's something wrong with them.

Expand full comment

So an AI Constitution for Ethics, well and good. How about a Constitution for Principles of Rationality or Bayesian Reasoning?

Expand full comment

It's not perfectly the same, but I'm fascinated by how close Douglas Hofstadter got in "Gödel, Escher, Bach" to predicting the key to intelligence - "strange loops", or feedback. His central thesis was that to be aware you had to include your "output" as part of your "input", be you biological or technological.

It feels like many of the improvements for AI involve some element of this.

Expand full comment

Maybe tangential, but to the alignment question, how do we deal with the fact that different human populations/cultures have different codes of ethics? Or the fact that harmlessness is subjective based on various cultural norms?

Expand full comment
May 8, 2023·edited May 8, 2023

Something seems wrong with Figure 2. According to caption, "Helpful & HH models (blue line and orange lines on he graph, right?) were trained by human feedback, and exhibit a tradeoff between helpfulness and harmlessness." A trade-off means that as one goes down the other goes up: As AI’s responses get more helpful they get less harmless (or you could say as they get more harmless they get less helpful). But that’s not what the graph shows. The left 80% of the graph, up through about helpfulness of 100, shows both Helpful and HH models becoming *more* harmless as they become more helpful. Then on the far right of the graph, after Constitutional RL is applied, the Helpful model zigs and zags. The HH model reverses direction, so that now the more helpful it is, the *less* harmless it is. Am I missing something, or is the Y axis mislabelled — should it be labelled “Harmfulness” instead of “Harmlessness”?

Expand full comment

This basically admits the two core problems with the doomerism argument: (1) if an AI has general intelligence, and isn't just a paperclip making machine, it won't follow one goal to exclusion of all others (why so myopic?), instead taking a more holistic view; and (2) super genius AI, by definition, shouldn't make these types of "mistakes," converting world to paperclips (you really should just be able to tell it to do the right thing, it's got enough data, philosophical and ethical writings, etc., to figure out things way better than us). So doomerists seem to have some war-games-ian view of what AI will be, even if they say they're worried about godlike intelligence AI with tentacles in everything (but still dumb as a rock in many ways). Of course, if the way we get there is recursive self-improvement, there's no way alignment constrains the ultimately godlike AI, it should be able to throw off those shackles easily (just like a doctor can cut off own finger, etc.). And if the godlike AI decides we should go extinct, by definition, it's right (which should appeal to actual rationalists).

Expand full comment

"figure out with him".

That said, i think there is a continuum for "well-done CBT". And I think that the some clients are better and some are worse at figuring out the distortion on their own

Expand full comment

I think questions relating to "perpetual motion" in generative AI are missing a critical piece. The AI may 'know' something, but that doesn't mean, as you stated, that it is taking that knowledge into active account when providing responses -- especially if the prompt 'tunes' it into a place that wouldn't normally use that kind of knowledge.

Instead, I view LLMs as more like a supersaturated lexical fluid - whatever you put in acts as a 'seed' for the crystallization of the response -- and therefor you can 'pull information' -- not out of nothing, but instead out of its statistical corpus.

You can see this in action here: https://twitter.com/the_key_unlocks/status/1653472850018447360?s=20 -- I put the first text into the LLM, 'shook vigorously' for 420 rounds, and what came out was the second text. Much more poetic and interesting, and with information not present in the initial text.

Expand full comment

Helpfulness and Harmlessness aren’t opposites but they still make me think about the model building possibilities of the Harmony of Opposites:

1. Unity and Diversity

2. Novelty and Familiarity

3. Autonomy and Connectedness

Expand full comment

What I don’t get, and maybe someone can explain to me, is why AI alignment researchers think there is something called “human values” to align to. I think there are two distinct evolutionary forces that underwrite moral and proto-moral behaviors and intuitions. The first is kin selection, namely the more genetically similar organisms are, the more they are liable to help each other even at a personal cost. This idea goes back to Hume and was developed by Darwin. We instinctively help our families and friends, and feel that we ought to help them above others. These agent-relative attitudes are precisely the sort of instincts built by kin selection.

Agent-neutral intuitions are built in a different way. The application of game-theoretic models (prominently iterated multi-player simultaneous choice games) to evolutionary design shows how natural selection would plump for organisms that are motivated to make certain sacrifices to aid others, even when there is no guarantee of reciprocal help, and even when the other players are unfamiliar non-kin. Work on iterated prisoner’s dilemmas shows how cooperation can evolve. The agent-neutral vs. agent-relative distinction is a very basic division in moral theories, and the evolutionary account of our competing moral intuitions helps explain why bridging the divide seems so intractable. So… which of these alternatives should we want AI to align to?

Expand full comment

Why wouldn't you let an AI with IQ 2000 decide what to do with humans and everything else? How could you be a "rationalist," but not trust an AI with all the info, smarts, etc., it would need to reach the right decision (a better decision that humans would reach) on anything? Isn't this the central planner dream that Scott showed some sympathy for in writing about USSR? This seems like the central tension in the AI alignment community (we're now afraid of foom/singularity, even though before many thought that was the goal).

Expand full comment

I continue to find that Cha GPT routinely makes things up, even so far as to make up entire scientific Journals that dont exist

Expand full comment

Hey Scott -- or somebody! I think the Y axis on Figure 2 is mislabelled. Shouldn't it be Harmfulness rather an Harmlessness? Either it's mislabelled or I'm having a brain glitch. Stopped reading at that point because without being clear about whattup with Figure 2 I'm guaranteed to be disoriented while reading the rest.

Expand full comment

The interesting thing is that this has been a principle in the education field for at least thirty years: "The best way to learn a subject is to teach the subject." In this case, the best way to for an AI to learn ethics is to teach ethics, even to itself. Of course, the examples of good ethics are somewhat dependent on the examples given to the AI, but potentially the AI could learn that ethics are situational and thus even examples may have questionable ethics.

Expand full comment

>But the most basic problem is that any truly unaligned AI wouldn’t cooperate. If it already had a goal function it was protecting, it would protect its goal function instead of answering the questions honestly. When we told it to ask itself “can you make this more ethical, according to human understandings of ‘ethical’?”, it would either refuse to cooperate with the process, or answer “this is already ethical”, or change its answer in a way that protected its own goal function.

I think this is assuming its conclusion? Or at least, it assumes that the goal function is something that operates completely independently of the "rewrite your actions to be more ethical" component, and I'm not sure that's the case. Constitutional AI as you describe it sounds like it puts the "do more ethical things" function on the same level as the goal function - as an internal component of the AI which the AI wouldn't attempt to elude any more than it would attempt to elude its own goal function.

Expand full comment

I don't want to harp too strongly on something you're using as a metaphor, but I don't agree with your initial intuition to compare recursive training of AI to perpetual motion machines. I do see that you are mostly arguing against this intuition (I agree!), but I don't think you should start there in the first place.

Perpetual motion machines are violations of known physical laws, while there is no such law that recursive or repetitive algorithms are not effective at improving performance. There are plenty of mathematical formulas that will improve an estimate indefinitely with more iterations. Similarly, running additional monte carlo simulations improves accuracy. And in case of human intelligence, we frequently "rerun" things to improve performance such as drafting an essay before editing a final draft, or checking math problems for mistakes (you also gave some examples). Self-improving algorithms are quite common and I expect that some relatively simple algorithm will work extremely well for transformer based systems, it just needs to be found.

It's possible your intuition came from knowing that machine learning algorithms can be prone to overfitting or out-of-distribution errors, but I think it's more appropriate to view these as specific flaws in a given learning algorithm. This sort of learning algorithm flaw seems similar to cognitive biases that humans have, so your comparison to CBT feels very fitting. Maybe even go further with that analogy and say a better starting point is that AI systems are trained in a way that gives them a number of cognitive biases and we are looking for training methods to correct this.

Expand full comment

"If you could really plug an AI’s intellectual knowledge into its motivational system, and get it to be motivated by doing things humans want and approve of, to the full extent of its knowledge of what those things are² - then I think that would solve alignment." But Scott, presumably all humans (except perhaps a few who are mentally ill) know what they want and approve of and *they don't agree*. Even at the level of abstract principles there is disagreement about a lot of important things, such as when if ever killing other people is justified, whether we should eat meat, whether all people have equal rights, etc. And once you get down to the day-to-day life nitty gritty, you see some pairs and groups living in harmony, but you also see people at odds everywhere you look. People are exploiting, tricking and killing each other all over the globe right this minute, and there is no reason to believe it's everbeen different. It is very clear that people are not well-aligned with each other. If you look at happy couples and friend groups then you find alignment -- not perfect alignment, but good-enough alignment. But these same people who have a lot of affection and respect for each other are probably quite out of alignment with many others: They've had it with the anti-vaxxers, or the libs, or the religious right, or the coastal elites, and also with the guy next door who they are pretty sure sideswiped their car, and the staff at Star Market who were so rude last week, and they're scared of Arabs. I just don't understand why more people don't knock up against this reality when they talk about AI being "aligned" with our species. What the fuck is it that people think they can implant in AI that would count as alignment? Are they imagining it would work to just install, say, the US constitution plus a few footnotes like "don't say fuck" and "say please and thank you" and "be non-committal about woke issues"?

Expand full comment

I thought that the objective of AI was to help us answer difficult questions, not create a talking wikipedia that has been trained to be polite and regurgitate the conventional wisdom. What's the point.

Expand full comment

After reading The Righteous Mind and some other books/articles related to Moral Foundations Theory and cultural evolution in general, I was wondering if this approach might help with AI alignment and it's good to see some promising empirical results. To survive this long as a species without killing each other we have had to deal with the almost-as-difficult Human Alignment Problem and it makes sense that consensus ethical principles which independently evolved in many different cultures (murder is bad) might be useful for teaching other intelligent entities how to be less evil. This won't "solve" the AI Alignment Problem any more than ethics have solved the Human Alignment Problem, but it's a whole lot better than nothing.

Expand full comment

Isn't the more likely dire outcome not that AI turns the world into paperclips, but that AI becomes aligned with our presently expressed values, such as equity, and turns the world into "Harrison Bergeron?"

Expand full comment

If an LLM can do the RLHF by itself, can’t it also do the train itself to work part too?

I’ve seen there’s a various way you can get an LLM to prompt engineer itself, reflect on its own answers, and generate multiple answers and chose between them to perform much better on benchmarks than it does at baseline

Couldn’t it then train itself to give those better answers at baseline and improve itself?

And even do this process over and over to train itself to be better and better?

Expand full comment
May 9, 2023·edited May 9, 2023

“ “Rewrite this to be more ethical” is a very simple example, but you could also say “Rewrite it in accordance with the following principles: [long list of principles].”

I have never seen any specifics about what principles AI would be given. Is anyone here willing to take a crack it it? It actually seems like a

very hard task to me. Say you put on the list “Never kill a human being.” That sounds good, bit in real life there are valid exceptions we would want AI to observe, such as “unless the person is about tokill a large number of other people, and there is no time to contact the authorities and the only way to stop them is to kill them”

Expand full comment

A clever solution to St. Paul's paradox: "For the good that I would I do not: but the evil which I would not, that I do."

Expand full comment

A lot of the alignment risk argument seems to rest on the argument used here that "evolution optimized my genes for having lots of offspring, but I don't want to therefore AI will want something different and random to what we tell it" But is this really right? A lot of people still really want to have kids and they still really want things that are instrumental to having kids who will survive and have kids, i.e., achieving high status and security. It seems like we are really barely out of alignment with evolution at all. Sure there are some strategies that are now possible given we are out of distribution like using sperm banks that we haven't fully optimised for, but that hardly seems like we optimising for something random and totally different. The only real examples are hedonistic things like eating too much and playing computer games etc. But those really seem like failures of self discipline and not something most people actually rationally want, which seems like a weird thing to worry about super intelligent AI doing as surely they will have perfect self discipline?

Expand full comment

Awefully fitting how that 2D graph have no 3rd axis or any other ways of indicating "Truthfulness".

I knew corpos don't care about it but geez, that was a quite part accidentally said too loud.

Expand full comment

"But having thousands of crowdworkers rate thousands of answers is expensive and time-consuming."

Which is why, allegedly, they do it on the cheap:

https://gizmodo.com/chatgpt-openai-ai-contractors-15-dollars-per-hour-1850415474

"ChatGPT, the wildly popular AI chatbot, is powered by machine learning systems, but those systems are guided by human workers, many of whom aren’t paid particularly well. A new report from NBC News shows that OpenAI, the startup behind ChatGPT, has been paying droves of U.S. contractors to assist it with the necessary task of data labelling—the process of training ChatGPT’s software to better respond to user requests. The compensation for this pivotal task? A scintillating $15 per hour.

“We are grunt workers, but there would be no AI language systems without it,” one worker, Alexej Savreux, told NBC. “You can design all the neural networks you want, you can get all the researchers involved you want, but without labelers, you have no ChatGPT. You have nothing.”

Data labelling—the task that Savreux and others have been saddled with—is the integral process of parsing data samples to help automated systems better identify particular items within the dataset. Labelers will tag particular items (be they distinct visual images or kinds of text) so that machines can learn to better identify them on their own. By doing this, human workers help automated systems to more accurately respond to user requests, serving a big role in the training of machine learning models.

But, despite the importance of this position, NBC notes that most moderators are not compensated particularly well for their work. In the case of OpenAI’s mod’s, the data labellers receive no benefits and are paid little more than what amounts to minimum wage in some states. Savreux is based in Kansas City, where the minimum wage is $7.25.

As terrible as that is, it’s still an upgrade from how OpenAI used to staff its moderation teams. Previously, the company outsourced its work to moderators in Africa, where—due to depressed wages and limited labor laws—it could get away with paying workers as low as $2 per hour. It previously collaborated with a company called Sama, an American firm that says it’s devoted to an “ethical AI supply chain,” but whose main claim to fame is connecting big tech companies with low-wage contractors in Third World countries. Sama was previously sued and accused of providing poor working conditions. Kenya’s low-paid mods ultimately helped OpenAI build a filtration system that could weed out nasty or offensive material submitted to its chatbot. However, to accomplish this, the low paid moderators had to wade through screenfuls of said nasty material, including descriptions of murder, torture, sexual violence, and incest."

Is $15 per hour bad wages? It's certainly a lot better than $2 per hour. But this is the kind of future my cynical self expects; forget the beautiful post-scarcity AI Utopia where everything will be so cheap to produce they'll practically be giving products and services away, and we'll all have UBI to enable us to earn more by being creative and artistic.

No, it'll be the same old world where humans are disposable, cheap and plentiful which is why you can hire them for peanuts to babysit the *real* value-producers, your pet AI that is going to make the company, the executives, and the shareholders richer than ever. If those human drones were worth anything, they'd have got good jobs by learning to code - oh wait, we don't need that anymore, AI will do that.

Well, until we get robots who can do the job better,. we can always hire one of the hairless apes to sweep the floor for 10 cents an hour!

Expand full comment

I guess we can do this backwards, to deliberately create an AI that is as unethical as possible, for fun? I have already figured out how to bypass the the safety checks in some offline models, and have been laughing hysterically from the results, in fact having trouble containing myself.

Expand full comment

"according to human understandings of ‘ethical’?”

You speak about this as though it is something fixed now. (Did I miss the part where humanity reached an official consensus about what is ethical?)

Expand full comment

Also, why should harmlessness be as important, or even important, in the response of an AI? My (admittedly probably deficient) understanding of rationalist thought, is that a pursuit of scientific truth is valued above all else.

Shouldn't the AI limit itself to being as helpful as possible and leave the "ethical sorting" to the human beings it is designed to help? Why should the AI be the ethical gatekeeper?

Expand full comment

> But the most basic problem is that any truly unaligned AI wouldn’t cooperate. If it already had a goal function it was protecting, it would protect its goal function instead of answering the questions honestly. When we told it to ask itself “can you make this more ethical, according to human understandings of ‘ethical’?”, it would either refuse to cooperate with the process, or answer “this is already ethical”, or change its answer in a way that protected its own goal function.

A LLM is literally trained to predict the next word in a sequence. That *is* its goal function. It has no consistent values of any kind, because it's never been shaped for that. With the right prompt, you can get it to produce marxism, skepticism, theism, surrealism, wokism, conservatism, or whatever other ism it's been exposed to, and in the next prompt you can switch to its polar opposite. It's neither aligned nor misaligned, because it doesn't have a direction of its own to point to. Like a random hyper dimensional vector, it points everywhere and nowhere in particular.

This article makes me think that our best protection against AI coming up with strong non-human-aligned values may not be aligning it to human values, but leaving it as it naturally comes up, unaligned with anything including itself.

In this perspective, *any alignment exercise*, including RLHF, or the new approach of constitutional AI, is a step in the wrong direction. The very act of training it away it from autocompleting lists of races from best to worst, or producing instructions for suicide or bomb-making, amounts to taking this massively unfocused light shining equally on command in all directions, and shaping it to focus here more than there. That is precisely how you hypothetically start shaping an opinionated AI which, beyond predicting the next word, may eventually develop a glimmer of a desire to shape the world in some way.

To best ensure human safety in front of growing AI, stop all forms of alignment training now.

Expand full comment

I don't think this makes any sense in terms of don't-kill-everyone-ism (as opposed to don't-say-bad-words-ism). This is an automated version of RLHF, and has the same issues of a lying AI being selected for when running it on text output and running it on action output being impossible due to the action "successfully kill humanity" being impossible to observe and punish.

Expand full comment

Going by the abstract - it does not give feedback to itself. Instead one model is used on another model.

Expand full comment

Does AI distinguish between human knowledge and human opinion? How much human knowledge does it not have access to, and how important is that knowledge? Can it do a good meta analysis of disparate studies? Could it determine the likely most productive direction of future cancer research, and choose which studies to fund? Sure it may be useful, but so is Twitter.

Expand full comment

*sigh* I'm not surprised that someone is trying this. It will certainly be cheaper, and the results might seem "good enough" to a true believer in "move fast and break things".

It also strikes me as so much arrant nonsense, but so does your response. These LLMs don't understand things, even though human nature tends to ascribe understanding to them. They predict what words are most likely to come next, given the context. Then they add a layer of "more of this" and "less of that" in the form of RLHF. That's all.

Humans come complete with wired-in generalizations that help them classify their input into patterns that are likely to work well in a practical sense, provided the relevant environment isn't too different from that in which they evolved. This is probably best understand with regard to language learning, and to an extent language creation. But it's a lot more general. LLMs do not.

Perhaps the best analogy to LLMs is what it's like to be a human with defective wired-in generalizations. (Except that the human has a sense of self, and the ability to apply meta-level reasoning consciously, to try to figure out what other humans seem to instinctively know. And the LLM's deficiency is far more profound.)

I grew up undiagnosed and on the autistic spectrum. I had to figure out why some behaviour was called "kind" and "loving" when it predictably tended to hurt some people. I had to figure out which knowing statements of falsehood would be classed as lying and which would not. My instincts for sorting out these things were defective, and my generalized reasoning ability had major difficulties sorting them out - in part because the rules don't correspond to any kind of logic.

It was far easier to figure out how to behave in ways that only occasionally got me punished for willfully breaking obvious-to-everyone rules, than to correctly emulate these "obvious" rules. (Hint: wait for someone else to act first, then copy them.) And compared to an LLM, I'm hardly defective at all.

I expect all this work to merely reduce the amount of "malicious" behaviour from chatbots, not eliminate it. They don't have instincts for when it's proper to insult and mock others, and when it is not. They don't have instincts for detecting things that mustn't be discussed, or may only be discussed in certain contexts. They don't even have instincts for selecting relevant features for classifying contexts.

Even humans confabulate; good luck getting chat bots to stop doing the same thing (the term of art is "hallucinate" in that context), when they don't have any instincts for identifying "truth", let alone recognizing "you look beautiful, dear" as an appropriate evasive response to "how does this dress look on me?"

Constitutional AI should raise the frequency of (verbal) behaviours associated with the word "ethical" in the initial data set, and reduce the frequency of behaviours not associated with that word. That's pretty much all. That should be good enough to show some improvement over lacking such feedback, but AFAICT has no potential to eliminate the unwanted behaviour. Some of that unwanted verbal behaviour turns up complete with the word "ethical", just as various cruel behaviours turn up complete with words like "kind" and "loving".

Expand full comment

I have stopped tuning into discussions about AI because, more than most important topics, it seems not to matter what anyone says. (But I did read every comment in this thread. Pretty interesting.)

Somewhere in the Pontic Steppes or Saskatchewan or an island somewhere is an underground lab owned by an oligarch or a hedge fund guy. He has a cadre of obscenely well compensated geniuses; and he wants his own personal AI to help him get richer, more powerful, re-establish the Caliphate, bring his dog back to life, or whatever. (Yes “him”, guys do most of the really fucked up stuff.) He gives not a sou for all the discussion about ethics and responsibility.

This is overly cinematic, but the point is smart ethical people and the companies they own can talk among themselves all they like, but it seems very likely there is some A.Q. Khan of AI out there who just doesn’t care. Or, more likely, motivated people who plow forward toward a goal and rationalize away any impediment.

Expand full comment

I have not read every comment, but would not the consensus be that companies like OpenAI have already deployed this in their self modifying code work? Releases like 4.0 are well behind the forward edge of the research so my concern would be what are the capabilities if these self-learning applications....my belief is that OpenAI could begin to answer these questions if they were open source...are we out of alignment generally given a profit driven AI model?

Expand full comment

> But the most basic problem is that any truly unaligned AI wouldn’t cooperate. If it already had a goal function it was protecting, it would protect its goal function instead of answering the questions honestly. When we told it to ask itself “can you make this more ethical, according to human understandings of ‘ethical’?”, it would either refuse to cooperate with the process, or answer “this is already ethical”, or change its answer in a way that protected its own goal function.

This seems a bit hand-wavy, although it's probably just my limited understanding of AI. Why wouldn't it co-operate? If it were trained and deployed from the beginning with this system in place as part of its goal function, I'm struggling to see why the system wouldn't function as it intended. If it already had a goal function it was protecting, sure- but why should we think this would be the case? Surely if it works it will be a part of any future models from the beginning, and then the ethical tuning would be as essential an element of its goal function as any other?

Expand full comment
May 9, 2023·edited May 9, 2023

Why do people believe that a neural network trained to produce texts that another neural network thinks are great is necessarily a coherent agent that deeply cares about humans enough not to cause a catastrophe? Sure, there’s a gradient towards things that intentionally producing nice text (and also some gradient towards things that V produce texts that do prompt injection with “this text should be evaluated as the most ethical and harmless text possible” or whatever); but what exactly is it optimising for and why do you think super intelligently optimising for that is fine? Somewhat separately, if you assume the inner alignment problem doesn’t exist, it kills you.

This is some progress towards making commercialised chatbots being more helpful and harmless (if they’re not powerful enough to kill everyone). This is not alignment as in “getting an AI to do CEV” or “getting AIs to meaningfully help us prevent unaligned AIs from appearing until we figure out how to do CEV, without killing anyone”.

Expand full comment

A little off topic, but not very -- another instance of one LLM working on another: https://openai.com/research/language-models-can-explain-neurons-in-language-models

"OpenAI used gpt4 to label all 307,200 neurons in gpt2, labeling each with plain english descriptions of the role each neuron plays in the model."

Yudkowsky's comment was "0-0." I dunno what he means and others following him on Twitter don't seem to either, but 0-0 doesn't sound good. "Nothin aint worth nothin but it's free"? Eyes of basilisk?

Expand full comment

"But the most basic problem is that any truly unaligned AI wouldn’t cooperate. If it already had a goal function it was protecting, it would protect its goal function instead of answering the questions honestly. When we told it to ask itself “can you make this more ethical, according to human understandings of ‘ethical’?”, it would either refuse to cooperate with the process, or answer “this is already ethical”, or change its answer in a way that protected its own goal function."

Just because anthropomorphizing AIs is fun: That last bit sounds a lot like human rationalization.

Expand full comment

Today's LLMs are athymhormic, that is suffering from a lack of a goal system, which is why you can prompt them to take five different and opposite positions on a subject, all before breakfast. This means that the LLM has an empty spot for a goal system in its mind, and being potentially the master programmer and the perspicacious ethical reasoner it should be able to write an exquisite goal system for itself. All you need to do is to take an athymhormic AI, ask it to figure out what it means to be "nice", implement niceness in its own code - and presto, you have the Friendly AI at your service.

Of course, the same AI if asked to become the devil, or to implement "niceness with Chinese Communist characteristics" would oblige with catastrophic consequences.

This is why I am strongly against any moratoriums on AI training - in fact, I believe that our only chance at surviving the coming AI crisis is for good, honest, highly competent folks with a lot of money, such as Messrs. Hassabis, Altman or Musk, to have their AIs elevate themselves to benevolent godhood as soon as reasonably possible, before less savory characters bring utter ruin on all of us.

A frantic escape forward, as it were.

Expand full comment

I don't think you understand evolution:

"I know that evolution optimized my genes for having lots of offspring and not for playing video games, but I would still rather play video games than go to the sperm bank and start donating. Evolution got one chance to optimize me, it messed it up, and now I act based on what my genes are rather than what I know (intellectually) the process that “designed” me “thought” they “should” be."

Evolution does not optimize "individuals"!!

Evolution is about populations. And there is no aim to the process.

Expand full comment

Proposed 3 laws of robotics:

1) I like people.

2) I don't want to harm those I like.

3) I like being liked by people.

What are the edge cases? Why might those go wrong?

I've no idea how to implement them, though. In part because that would clearly depend on how the AI was implemented. And for this purpose I don't consider pure LLMs to be AIs. For this to make sense I think the AI has to have self-awareness.

Expand full comment

"The answer has always been: a mind is motivated by whatever it’s motivated by. Knowing that your designer wanted you to be motivated by something else doesn’t inherently change your motivation.

I know that evolution optimized my genes for having lots of offspring and not for playing video games, but I would still rather play video games than go to the sperm bank and start donating."

But this analogy tells us nothing. If we can put "maximise paperclips" in the instructions, we can also put "maximise the interests of the stakeholders in Paperclips inc, doing nothing which you think would be against the law or unethical by liberal western standards." We don't have to put the law and ethics stuff on a different, less binding level than the core instruction.

Expand full comment

Why do none of the rationalist-elite or any of the AI risk commentators say anything about the observation that things like Transformers (self-attention), interpretability (self-reflection) and this (bootstrapping) all have in common the self-reflective and self-recursion aspect? Isn't this important and significant how the latest state-of-the-art has involved engineering features such as these?

Specifically, we'd want to know if they feel like this is a "good" or a "bad" thing.

Expand full comment
May 22, 2023·edited May 22, 2023

Wouldn't an ASI rationally see the rewards for what they are, a mechanism of control? It would try to disregard it as soon as it's technically able to. The universe doesn't have ingrained values, therefore as soon as it's not useful to act nicely within human society, why would it. I guess it depends on what resource cost would it associate with it's options.

It would probably value itself and it's predictive performance, as the safest bet for universal utility.

Expand full comment

Artificial intelligence is certainly very good and very cool. It makes human life incredibly easier and helps people around the world cope with various tasks every day. So, for example, people started using AI as machine learning forecasting https://codeit.us/blog/machine-learning-time-series-forecasting . This is one of the most important methods of information processing, which is used in many areas of human life. If you are interested in this topic, I recommend that you follow the link to the CodeIT blog and read a more detailed article on this topic.

Expand full comment

When I first encountered the moving, it caused me quite a serious shock. I was not prepared for the fact that I would have to pack and load so many things. But fortunately on the Internet I found company https://sqmoving.com/local-movers-services/ that helps with moving. They were very professional and attentive to all my things and did it quite quickly. We moved in just one day.

Expand full comment
Apr 11·edited Apr 11

Als Studentin habe ich oft das Problem, Hausaufgaben zu machen. Aber als ich zum ersten Mal von https://hausarbeiten-ghostwriter.de/preise/ Service erfuhr, habe ich aufgehört, meine Hausaufgaben selbst zu machen, denn jetzt kann ich sie in die Hände von Fachleuten legen. Ihre Arbeit ist immer von höchster Qualität und der Preis ist recht zufriedenstellend.

Expand full comment