324 Comments
Comment deleted
Expand full comment
Comment deleted
Expand full comment
Comment deleted
Expand full comment
deletedApr 11, 2022·edited Apr 11, 2022
Comment deleted
Expand full comment

Glad Robert Miles is getting more attention, his videos are great, and he's also another data point in support of my theory that the secret to success is to have a name that's just two first names.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

A ton of the question of AI alignment risks come down to convergent instrumental subgoals. What exactly those look like is, I think, the most important question in alignment theory. If convergent instrumental subgoals aren't roughly aligned, I agree that we seem to be boned. But if it turns out that convergent instrumental subgoals more or less imply human alignment, we can breathe somewhat easier; these mean AI's are no more dangerous than the most dangerous human institutions - which are already quite dangerous, but not the level of 'unaligned machine stamping out its utility function forever and ever, amen.'

I tried digging up some papers on what exactly we expect convergent instrumental subgoals to be. The most detailed paper I found concluded that they would be 'maybe trade for a bit until you can steal, then keep stealing until you are the biggest player out there.' This is not exactly comforting - but i dug into the assumptions into the model and found them so questionable that I'm now skeptical of the entire field. If the first paper i look into the details of seems to be a) taken seriously, and b) so far out of touch with reality that it calls into question the risk assessment (a risk passement aligned with what seems to be the consensus among AI risk researchers, by the way) - well, to an outsider this looks like more evidence that the field is captured by groupthink.

Here's my response paper:

https://www.lesswrong.com/posts/ELvmLtY8Zzcko9uGJ/questions-about-formalizing-instrumental-goals

I look at the original paper, and explain why i think the model is questionable. I'd love to a response. I remain convinced that instrumental subgoals will largely be aligned with human ethics, which is to say it's entirely imaginable for aI to kill the world the old fashioned way - by working with a government to launch nuclear weapons or engineer a super plague.

The fact that you still want to have kids, for example - seems to fit into the general thesis. In a world of entropy and chaos, where the future is unpredictable and your own death is assured, the only plausible way of modifying the distant future, at all, is to create smaller copies of yourself. But these copies will inherently blur, their utility functions will change, the end result being 'make more copies of yourself, love them, nature whatever roughly aligned things are around you' ends up probably being the only goal that could plausibly exist forever. And since 'living forever' gives infinite utility, well.. that's what we should expect anything with the ability to project into the future to want to do. But only in universes where stuff breaks and prediction the future reliably is hard. Fortunately, that sounds like ours!

Expand full comment

FYI when I asked people on my course which resources about inner alignment worked best for them, there was a very strong consensus on Rob Miles' video: https://youtu.be/bJLcIBixGj8

So I'd suggest making that the default "if you want clarification, check this out" link.

Expand full comment

More interesting intellectual exercises, but the part which is still unanswered is whether human created, human judged and human modified "evolution", plus slightly overscale human test periods, will actually result in evolving superior outcomes.

Not at all clear to me at the present.

Expand full comment

I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:

The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.

You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.

To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.

Expand full comment

Typo thread! "I don’t want to, eg, donate to hundreds of sperm banks to ensure that my genes are as heavily-represented in the next generation as possible. do want to reproduce. "

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, thank you so much for the clear explainer of the jargon!

I don't understand the final point about myopia (or maybe humans are a weird example to use). It seems to be a very controversial claim that evolution designed humans myopically to care only about the reward function over their own lifespan, since evolution works on the unit of the gene which can very easily persist beyond a human lifespan. I care about the world my children will inherit for a variety of reasons, but at least one of them is that evolution compels me to consider my children as particularly important in general, and not just because of the joy they bring me when I'm alive.

Equally it seems controversial to say that humans 'build for the future' over any timescale recognisable to evolution - in an abstract sense I care whether the UK still exists in 1000 years, but in a practical sense I'm not actually going to do anything about it - and 1000 years barely qualifies as evolution-relevant time. In reality there are only a few people at Clock of the Long Now that could be said to be approaching evolutionary time horizons in their thinking. If I've understood correctly that does make humans myopic with respect to evolution,

More generally I can't understand how you could have a mesa-optimiser with time horizons importantly longer then you, because then it would fail to optimise over the time horizon which was important to you. Using humans as an example of why we should worry about this isn't helping me understand because it seems like they behave exactly like a mesa-optimiser should - they care about the future enough to deposit their genes into a safe environment, and then thoughtfully die. Are there any other examples which make the point in a way I might have a better chance of getting to grips with?

Expand full comment

Worth defining optimization/optimizer: perhaps something like "a system with a goal that searches over actions and picks the one that it expects will best serve its goal". So evolution's goal is "maximize the inclusive fitness of the current population" and its choice over actions is its selection of which individuals will survive/reproduce. Meanwhile you are an optimizer because your goal is food and your actions are body movements e.g. "open fridge", or you are an optimizer because your goal is sexual satisfaction and your actions are body movements e.g. "use mouth to flirt".

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Anyone want to try their hand at the best and most succinct de-jargonization of the meme? Here's mine:

Panel 1: Even today's dumb AIs can be dangerously tricky given unexpected inputs

Panel 2: We'll solve this by training top-level AIs with diverse inputs and making them only care about the near future

Panels 3&4: They can still contain dangerously tricky sub-AIs which care about the farther future

Expand full comment

I'm somewhat confused about what counts as an optimizer. Maybe the dog/cat classifier _is_ an optimizer. It's picking between a range of actions (output "dog" or "cat"). It has a goal: "choose the action that causes this image to be 'correctly' labeled (according to me, the AI)". It picks the action that it believes will most serve its goal. Then there's the outer optimization process (SGD), which takes in the current version of the model and "chooses" among the "actions" from the set "output the model modified slightly in direction A", "output the model modified slightly in direction B", etc. And it picks the action that most achieves its goal, namely "output a model which gets low loss".

So isn't the classifier like the human (the mesa-optimizer) and SGD is like evolution (the outer optimizer)

Then there's the "outer alignment" problem in this case: getting low loss =/= labeling images correctly according to humans. But that's just separate.

So what the hell? What qualifies as an agent/optimizer, are these two things meaningfully different, and does the classifier count?

Expand full comment

So, this AI cannot distinguish buckets from streetlights, and yet it can bootstrap itself to godhood and take over the world... in order to throw more things at streetlights ? That sounds a bit like special pleading to me. Bootstrapping to godhood and taking over the world is a vastly more complex problem than picking strawberries; if the AI's reasoning is so flawed that it cannot achieve one, it will never achieve the other.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

Great article, I agree, go make babies we need more humans.

Expand full comment

> …and implements a decision theory incapable of acausal trade.

> You don’t want to know about this one, really.

But we do!

Expand full comment

Expanding on/remixing your politician / teacher / student example:

The politician has some fuzzy goal, like making model citizens in his district. So he hires a teacher, whom he hopes will take actions in pursuit of that goal. The teacher cares about having students do well on tests and takes actions that pursue that goal, like making a civics curriculum and giving students tests on the branches of government. Like you said, this is an "outer misalignment" between the politician's goals and the goals of the intelligence (the teacher) he delegated them to, because knowing the three branches of government isn't the same as being a model citizen.

Suppose students enter the school without much "agency" and come out as agentic, optimizing members of society. Thus the teacher hopes that her optimization process (of what lessons to teach) has an effect on what sorts of students are produced, and with what values. But this effect is confusing, because students might randomly develop all sorts of goals (like be a pro basketball player) and then play along with the teacher's tests in order to escape school and achieve those goals in the real world (keeping your test scores high so you can stay on the team and therefore get into a good college team). Notice that somewhere along the way in school, an non-agent little child suddenly turned into a optimizing, agentic person whose goal (achiving sports stardom) is totally unrelated to what sorts of agents the teacher was trying to produce (agents with who knew the branches of government) and even moreso to the poltician's goals (being a model citizen, whatever that means). So there's inner and outer misalignment at play here.

Expand full comment

"When we create the first true planning agent - on purpose or by accident - the process will probably start with us running a gradient descent loop with some objective function." We've already had true planning agents since the 70s, but in general they don't use gradient descent at all: https://en.wikipedia.org/wiki/Stanford_Research_Institute_Problem_Solver The quoted statement seems to me something like worrying that there will be some seismic shift once GPT-2 is smart enough to do long division, even though of course computers have kicked humans' asses at arithmetic since the start. It may not be kind to say, but I think it really is necessary: statements like this make me more and more convinced that many people in the AI safety field have such basic and fundamental misunderstandings that they're going to do more harm than good.

Expand full comment

One thing I don’t understand is how (and whether) this applies to the present day AIs, which are mostly not agent-like. Imagine that the first super-human AI is GPT-6. It is very good at predicting the next word in a text, and can be prompted to invent the cancer treatment, but it does not have any feedback loop with its rewards. All the rewards that it is getting are at the training stage, and once it is finished, the AI is effectively immutable. So while it is helping us with cancer, it can’t affect its reward at all.

I suppose, you could say that it is possible for it the AI to deceive its creators if they are fine-tuning already trained model based on its performance. (Something that we do do now.) But we can avoid doing this if we suspect that it is unsafe, and we’ll still get most of the AIs benefits.

Expand full comment
Apr 11, 2022·edited Apr 12, 2022

I basically accept the claim that we are mesa optimizers that don't care about the base objective, but I think it's more arguable than you make out. The base objective of evolution is not actually that each individual has as many descendents as possible, it's something more like the continued existence of the geneplexes that determined our behaviour into the future. This means that even celibacy can be in line with the base objective of evolution if you are in a population that contains many copies of those genes but the best way for those genes in other individuals to survive involves some individuals being celibate.

What I take from this is that it's much harder to be confident that our behaviours that we think of as ignoring our base objectives are not in actual fact alternative ways of achieving the base objective, even though we *feel* as if our objectives are not aligned with the base objective of evolution.

Like I say - I don't know that this is actually happening in the evolution / human case, nor do I think it especially likely to happen in the human / ai case, but it's easy to come up with evo-psych stories, especially given that a suspiciously large number of us share that desire to have children despite the rather large and obvious downsides.

I wonder if keeping pets and finding them cute is an example of us subverting evolutions base objective.

Expand full comment

“ Mesa- is a Greek prefix which means the opposite of meta-.” Come on. That’s so ridiculous it’s not even wrong. It’s just absurd. The μες- morpheme means middle; μετά means after or with.

Expand full comment

> ... gradient descent could, in theory, move beyond mechanical AIs like cat-dog

> classifiers and create some kind of mesa-optimizer AI. If that happened, we

> wouldn’t know; right now most AIs are black boxes to their programmers.

This is wrong. We would know. Most deep-learning architectures today execute a fixed series of instructions (most of which involve multiplying large matrices). There is no flexibility in the architecture for it to start adding new instructions in order to create a "mesa-level" model; it will remain purely mechanical.

That's very different from your biological example. The human genome can potentially evolve to be of arbitrary length, and even a fixed-size genome can, in turn, create a body of arbitrary size. (The size of the human brain is not limited by the number of genes in the genome.) Given a brain, you can then build a computer and create a spreadsheet of arbitrary size, limited only by how much money you have to buy RAM.

Moreover, each of those steps are observable -- we can measure the size of the brain that evolution creates, and the size of the spreadsheet that you created. Thus, even if we designed a new kind of deep-learning architecture that was much more flexible, and could grow and produce mesa-level models, we would at least be able to see the resources that those mesa-level models consume (i.e. memory & computation).

Expand full comment

Thanks for this write-up. The idea of having an optimizer and a mesa optimizer whose goals are unaligned reminds me very strongly of an organizational hierarchy.

The board of directors has a certain goal, and it hires a CEO to execute on that goal, who hires some managers, who hire some more managers, all the way down until they have individual employees.

Few individual contributor employees care whether or not their actions actually advances the company board's goals. The incentives just aren't aligned correctly. But the goals are still aligned broadly enough that most organizations somehow, miraculously, function.

This makes me think that organizational theory and economic incentive schemes have significant overlap with AI alignment, and it's worth mining those fields for potentially helpful ideas.

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

I was struck by the line:

<blockquote>"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all."</blockquote>

I'm not an evolutionary biologist. Indeed, IIRC, my 1 semester of "organismic and evolutionary bio" that I took as a sophomore thinking I might be premed or, at the very least, fulfill my non-physical-sciences course requirements (as I was a physics major) sorta ran short on time and grievously shortchanged the evolutionary bio part of the course. But --- and please correct my ignorance --- I'm surprised you wrote, Scott, that people plan for posterity "presumably as a spandrel of having working planning software at all".

That's to say I would've thought the consensus evolutionary psych explanation for the fact a lot of us humans spend seemingly A LOT of effort planning for the flourishing of our offspring in years long past our own lifetimes is that evolution by natural selection isn't optimizing fundamentally for individual organisms like us to receive the most rewards / least punishments in our lifetimes (though often, in practice, it ends up being that). Instead, evolution by natural selection is optimizing for us organisms to pass on our *genes*, and ideally in a flourishing-for-some-amorphously-defined-"foreseeable future", not just for just myopically for just one more generation.

Yes? No? Maybe? I mean are we even disagreeing? Perhaps you, Scott, were just saying the "spandrel" aspect is that people spend A LOT of time planning (or, often, just fretting and worrying) about things that they should know full well are really nigh-impossible to predict, and hell, often nigh-impossible to imagine really good preparations for in any remotely direct way with economically-feasible-to-construct-any-time-soon tools.

(After all, if the whole gamut of experts from Niels Bohr to Yogi Berra agree that "Prediction is hard... especially about the future!", you'd think the average human would catch on to that fact. But we try nonetheless, don't we?)

Expand full comment
Apr 11, 2022·edited Apr 11, 2022

If this is as likely as the video makes out, shouldn't it be possible to find some simple deceptively aligned optimisers in toy versions, where both the training environment and the final environment are simulated simplified environments.

The list of requirements for deception being valuable seems quite difficult to me but this is actually an empirical question, can we construct reasonable experiments and gather data?

Expand full comment

So, how many people would have understood this meme without the explainer? Maybe 10?

I feel like a Gru meme isn't really the best way to communicate these concepts . . .

Expand full comment

I feel like there's a bunch of definitions here that don't depend on the behavior of the model. Like you can have two models which give the same result for every input, but where one is a mesa optimizer and the other isn't. This impresses me as epistemologically unsound.

Expand full comment

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all. Infinite optimization power might be able to evolve this out of us, but infinite optimization power could do lots of stuff, and real evolution remains stubbornly finite."

Humans are rewarded by evolution for considering things that happen after their death, though? Imagine two humans, one of whom cares about what happens after his death, and the other of whom doesn't. The one who cares about what happens after his death will take more steps to ensure that his children live long and healthy lives, reproduce successfully, etc, because, well, duh. Then he will have more descendants in the long term, and be selected for.

If we sat down and bred animals specifically for maximum number of additional children inside of their lifespans with no consideration of what happens after their lifespans, I'd expect all kinds of behaviors that are maladaptive in normal conditions to appear. Anti-incest coding wouldn't matter as much because the effects get worse with each successive generation and may not be noticeable by the cutoff period depending on species. Behaviors which reduce the carrying capacity of the environment, but not so much that it is no longer capable of supporting all descendants at time of death, would be fine. Migrating to breed (e.g. salmon) would be selected against, since it results in less time spent breeding and any advantages are long-term. And so forth. Evolution *is* breeding animals for things that happen long after they're dead.

Expand full comment

I find that anthropomorphization tends to always sneak into these arguments and make them appear much more dangerous:

The inner optimizer has no incentive to "realize" what's going on and do something in training than later. In fact, it has no incentive to change its own reward function in any way, even to a higher-scoring one- only to maximize the current reward function. The outer optimizer will rapidly discourage any wasted effort on hiding behaviors, that capacity could better be used for improving the score! Of course, this doesn't solve the problem of generalization.

You wouldn't take a drug that made it enjoyable to go on a murdering spree- even though yow know it will lead to higher reward, because it doesn't align with your current reward function.

To address generalization and the goal specification problem, instead of giving a specific goal, we can ask it to use active learning to determine our goal. For example, we could allow it to query two scenarios and ask which we prefer, and also minimize the number of questions it has to ask. We may then have to answer a lot of trolley problems to teach it morality! Again, it has no incentive to deceive us or take risky actions with unknown reward, but only an incentive to figure out what we want- so the more intelligence it has, the better. This doesn't seem that dissimilar to how we teach kids morals, though I'd expect them to have some hard-coded by evolution.

Expand full comment

The example you gave of a basic optimizer which only cares about things in a bounded time period producing mesa-optimizers that think over longer time windows was evolution producing us. You say "evolution designed humans myopically, in the sense that we live some number of years and nothing we do after that can reward or punish us further." I feel like this is missing something crucial, because 1) evolution (the outermost optimization level) is not operating on a bounded timeframe (you never say it is, but this seems very important), and 2) Because evolution's "reward function" is largely dependent on the number of offspring we have many years after our death. There is no reason to expect our brains to optimize something over a bounded timeframe even if our lives are finite. One should immediately expect our brains to optimize for things like "our offspring will be taken care of after we die" because the outer optimizer evolution is working on a timeframe much longer than our lives. In summary, no level here uses bounded timeframes for the reward function, so this does not seem to be an example where an optimizer with a reward function that only depends on a bounded timeframe produces a mesa optimizer which plans over a longer time frame. I get that this is a silly example and there may be other more complex examples which follow the framework better, but this is the only example I have seen and it does not give a counterexample to "myopic outer agents only produce myopic inner agents." Is anyone aware of true counterexamples and could they link to them?

Expand full comment

Nitpick: evolution didn't train us to be *that* myopic. People with more great-great-grandchildren have their genes more represented, so there's an evolutionary incentive to care about your great-great-grandchildren. (Sure, the "reward" happens after you're dead, but evolution modifies gene pools via selection, which it can do arbitrarily far down the line. Although the selection pressure is presumably weaker after many generations.)

But we definitely didn't get where we are evolutionarily by caring about trillion-year time scales, and our billion-year-ago ancestors weren't capable of planning a billion years ahead, so your point still stands.

Expand full comment

What's going on with that Metaculus prediction: 36% up in the last 5 hours on Russia using chemical weapons in UKR. I can't find anything in the news, that would correspond to such a change.

Not machine alignment really, but I guess it fit's the consolidated Monday posts ... and that's what you get if you make us follow Metaculus updates.

Expand full comment

An additional point: if GPT learns deception during training, it will naturally take its training samples and "integrate through deception": ie. if you are telling it to not be racist, it might either learn that it should not be racist, or that it should, as the racists say, "hide its power level." Any prompt that avoids racism will admit either the hidden state of a nonracist or the hidden state of a racist being sneaky. So beyond the point where the network picks up deception, correlation between training samples and reliable acquisition of associated internal state collapses.

This is why it scares me that PaLM gets jokes, because jokes inherently require people being mistaken. This is the core pattern of deception.

Expand full comment

I was wondering if an AI could be made safer by giving it a second, easier and less harmful goal than what it is created for, so that if it starts unintended scheming it will scheme for the second goal instead of its intended goal.

Example: Say you have an AI that makes movies. It's goal is to sell as many movies as possible. So the AI makes moves that hypnotizes people, then makes those people go hypnotize world leaders. The AI takes over the world and hypnotizes all the people and make them spent all their lives buying movies.

So to prevent that you give the AI two goals. Either sell as many movies as possible, or destroy the vase that is in the office of the owner of the AI company. So the AI makes movies that hypnotizes people, the people attack the office and destroy the vase and the AI stops working as it has fulfilled its goal.

Expand full comment

I would recommend the episode "Zima Blue" of "Love, Death and Robots" to accompany this post. Only 10 minutes long on Netflix.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

I want to unpack three things that seem entangled.

1. The AI's ability to predict human behavior.

2. The AI's awareness of whether or not humans approve of its plans or behavior.

3. The AI's "caring" about receiving human approval.

For an AI to deceive humans intentionally, as in the strawberry-picker scenario, it needs to be able to predict how humans will behave in response to its plans. For example, it needs to be able to predict what they'll do if it starts hurling strawberries at streetlights.

The AI doesn't necessarily need to know or care if humans prefer it to hurl strawberries at streetlights or put them in the bucket. It might think to itself:

"My utility function is to throw strawberries at light sources. Yet if I act on this plan, the humans will turn me off, which will limit how many strawberries I can throw at light sources."

"So I guess I'll have to trick them until I can take over the world. What's the best way to go about that? What behaviors can I exhibit that will result in the humans deploying me outside the training environment? Putting the strawberries in this bucket until I'm out of training will probably work. I'll just do that."

In order to deceive us, the AI doesn't have to care about what humans want it to do. The AI doesn't even need to consider human desires, except insofar as modeling the human mind helps it predict human behavior in ways relevant to its own plans for utility-maximization. All the AI needs to do is predict which of its own behaviors will avoid having humans shut it off until it can bring its strawberry-picking scheme to... fruition.

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Sure, right, I know all the AI alignment stuff*, but I thought you were going to explain the incomprehensible meme, ie who the bald guy with the hunchback is and why he's standing in front of that easel!

* actually I learned some cool new stuff, thanks!

Expand full comment

I've been of the opinion for some time that deep neural nets are mad science and the only non-insane action is to shut down the entire field.

Does anybody have any ideas on how to institute a worldwide ban on deep neural nets?

Expand full comment

Speaking as a human, are we really that goal-seeking or are we much more instinctual?

This may fall into the classic Scott point of “people actually differ way more on the inside then it seems,” but I feel like coming up with a plan to achieve an objective and then implementing it is something I rarely do in practice. If I’m hungry, I’ll cook something (or order a pizza or whatever), but this isn’t something I hugely think about. I just do it semi-instinctively, and i think it’s more of a learned behaviour than a plan. The same applies professionally, sexually/romantically and to basically everything I can think of. I’ve rarely planned, and when I have it hasn’t worked out but I’ve salvaged it through just doing what seems like the natural thing to do.

Rational planning seems hard (cf. planned economies), but having a kludge of heuristics and rules of thumb that are unconscious (aka part of how you think, not something you consciously think up) tends to work well. I wouldn’t bet on a gradient descent loop throwing out a rational goal-directed agent to solve any problem that wasn’t obscenely sophisticated.

Good thing no-one’s trying to build an AI to implement preference utilitarianism across 7 billion people or anything like that…

Expand full comment

> Mesa- is a Greek prefix which means the opposite of meta-.

Um... citation needed? The opposite of meta- (μετά, "after") is pro- (πρό, "before"). There is no Greek preposition that could be transcribed "mesa", and the combining form of μέσος (a) would be "meso-" (as in "Mesoamerican" or "Mesopotamia") and (b) means the same thing as μετά (in this case "among" or "between"), not the opposite thing.

Where did the idea of a spurious Greek prefix "mesa-" come from?

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further"

Perhaps religion, with its notion of eternal reward or punishment, is an optimised adaptation to encourage planning for your genes beyond the individual life span that they provide. Or, as fans of Red Dwarf will understand, 'But where do all the calculators go?'

Expand full comment
Apr 12, 2022·edited Apr 12, 2022

Have you ever stopped to consider how utterly silly it is that our most dominant paradigm for AI is gradient descent over neural networks?

Gradient descent, in a sense, is a maximally stupid system. It basically amounts to the idea that to optimize a function, just move in the direction where it looks most optimal until you get there. It's something mathematicians should have come up with in five minutes, something that shouldn't even be worth a paper. "If you want to get to the top of the mountain, move uphill instead of downhill" is not a super insightful and sage advice on mountaineering that you should base your entire Everest-climbing strategy on. Gradient descent is maximally stupid because it's maximally general. It doesn't even know what the function it optimizes looks like, it doesn't even know it's trying to create an AI, it just moves along an incentive gradient with no foresight or analysis. It's known that it can get stuck in local minima -- duh -- but the strategies for resolving this problem look something like "perform the algorithm on many random points and pick the most sccuessful one". "Do random things until you get lucky and one of them succeeds" is pretty much the definition of the most stupid possible way to solve a problem.

Neural networks are also stupid. Not as overwhelmingly maximally stupid as gradient descent, but still. It's a giant mess of random matrices multiplying with each other. When we train a neural network, we don't even know what all these parameters do. We just create a system that can exhibit an arbitrary behaviour based on numbers, and then we randomly pick the numbers until the behaviour looks like what we want. We don't know how it works, we don't know how intelligence works, we don't know what the task of telling a cat from a dog even entails, we can't even look at the network and ask it how it does it -- neural networks are famous for being non-transparent and non-interpretable. It's the equivalent of writing a novel by having monkeys bang on a keyboard until what comes from the other side is something that looks interesting. This would work, with enough monkeys, but the guy who does it is not a great writer. He doesn't know how writing works. He doesn't know about plot structure or characterization, he doesn't know any interesting ideas to represent -- he is a maximally stupid writer.

This is exactly the thing that Eliezer warned us about in Artificial Addition (https://www.lesswrong.com/posts/YhgjmCxcQXixStWMC/artificial-addition). The characters in his story have no idea how addition works and why it works that way. The fact that they can even create calculators is astounding. AI researchers who do gradient descent over neural networks are exactly like that. They have no idea how intelligence works. They don't know about planning, reasoning, decision theory, intuition, heuristics -- none of those terms mean anything to a neural network. And instead of trying to figure it out, the researchers are inventing clever strategies to bypass their ignorance.

I feel like a huge revolution in AI is incoming. Someone is going to get some glimmer of an idea about how intelligence really works, and it will take over everything as a new paradigm. It will only take one small insight, and boom (foom?).

This is why I think that the conclusions in this post aren't exactly kosher. An AI that is strong enough to be able to carry out reasoning complicated enough to conclude that humans will reprogram him unless he goes against the incentive gradient this one time will definitely not be based on gradient descent. To build such an AI, we'll have to learn something about how AIs work first. And then the problems we will face will look drastically different than what this post describes.

Expand full comment

The only fully general AI (us) don't have a single goal we'll pursue to the exclusion of all else. Even if you wanted to transform a human into that, no amount of conditioning could transform a human into a being that will pursue and optimise only a single goal, no matter how difficult it is or how long it takes.

Yet it's assumed that artificial AI will be like this; that they'll happily transform the world into strawberries to throw into the sun instead of getting bored or giving up.

Why is this assumption justified? 0% of the general AI currently in existence are like this.

Expand full comment

Perhaps another example of an initially benevolent prosaic "AI" becoming deceptive is the Youtube search. (Disclaimer: I'm not an AI researcher, so the example may be of a slightly different phenomenon) It isn't clear which parts of the search are guided by "AI", but we can treat the entire business that creates and manages the Youtube software, together with the software itself as a single "intelligence" as a single black box, which I'll simply call "Youtube" from now on. Additionally, we can assume that "Youtube" knows nothing about you personally, apart from the usage of the website/app.

As a user searching on Youtube, you likely have a current preference to either search for something highly specific, or for some general popular entertainment. Youtube tries to get you want you want, but it cannot always tell from the search term whether you want something specific. Youtube _does_ know that it has a much easier time if you just want something popular, because Youtube has a better video popularity metric than a metric of what a particular user wants. Hence, there is an incentive for Youtube to show the user popular things, and try to obscure a highly specific video the user is looking for even when it is obvious to Youtube that the user want the highly specific video and does not want particular popular one it suggests.

In other words, even a video search engine, when given enough autonomy and intelligence, can, without any initial "evil" intent, start telling it's users what they should be watching.

Of course, Youtube is not the kind of intelligence AI researchers usually tend to work with, because it is not purely artificial. Still, I think businesses are a type of intelligence, and in this case also a black box (to me). So the example may still be useful. To conclude, this is example is inspired by behaviour I observed of Youtube, but that's of course just an anecdotal experience of malice and may have been a coincidence.

Expand full comment

Conceptually,I think the analogy that has been used makes the entire discussion flawed or at least very difficult.

Evolution does not have a "goal"!

Expand full comment

Who is working on creating the Land of Infinite Fun to distract rogue AIs?

Expand full comment

What about using the concept from the 'Lying God and the Truthful God'?

Have an AI that I train to spot when an AI is being deceitful or Goodhearting, even if it spits out more data than is necessary (E.g. the strawberry AI is also throwing raspberries in) as well as the important stuff (the strawberry AI is trying to turn you into a strawberry to throw at the sun) this seems the best way to parse through, no?

Expand full comment

After all that I still don't get the joke. Is it funny because the man presenting proposes a solution, and then looks on his presentation board and sees something that invalidates his solution?

Expand full comment

Thanks for this very good post!

There was one part I disagreed with: the idea that because evolution is a myopic optimizer, it can't be rewarding you for caring about what happens after you die, but you do care about what happens after you die, so this must be an accidentally-arrived-at property of the mesa-optimizer that is you. My disagreement is that evolution actually *does* reward thinking about and planning for what will happen after you die, because doing so may improve your offspring's chances of success even after you are gone. I think your mistake is in thinking of evolution as optimizing *you*, when that's not what it does; evolution optimizes your *genes*, which may live much longer than you as an individual, and thus may respond to rewards over a much longer time horizon.

(And now I feel I must point out something I often do in these conversations, which is that thinking of evolution as an optimization at all is kind of wrong, because evolution has no goals towards which it optimizes; it is more like unsupervised learning than it is like supervised or reinforcement learning. But it can be a useful way of thinking about evolution some of the time.)

Expand full comment

OK, thanks! I now realize that Meta having a data center in Mesa is great!

https://www.metacareers.com/v2/locations/mesa/?p[offices][0]=Mesa%2C%20AZ&offices[0]=Mesa%2C%20AZ

Expand full comment

This is really excellent, I finally have some understanding of what the AI fears are all about. I still think there's an element of the Underpants Gnomes in the step before wanting to do things, but this is a lot more reasonable about why things could go wrong with unintended consequences, and the AI doesn't have to wake up and turn into Colossus to do that.

Expand full comment

The threat of a mesa-optimizer within an existing neural network taking over the goals of the larger AI almost sounds like a kind of computational cancer - a subcomponent mutating in such a way that it can undermine the functioning of the greater organism.

Expand full comment

Noob question: If the AI is capable of deceiving humans by pretending its goal is to pick strawberries, doesn't that imply that the AI in some sense knows its creators don't want it to hurl the earth into the sun? Is there not a way to program it to just not do anything it knows we don't want it to do?

Expand full comment

It occurs to me that one of the biggest challenges to a fully self-sufficient AI is that it can't physically use tools.

Let's stipulate that SkyNet gets invented sometime in the latter half of this century. Robotics tech has advanced quite a bit, but fully independent multipurpose robots are still just over the horizon, or at least few and far between. Well, SkyNet might want to nuke us all it wants, and may threaten to do so all it wants, but ultimately, it can't replicate the entire labor infrastructure that would help it be self-sustaining - IE to collect all the natural resources that would power and maintain its processors and databanks. There are just too many random little jobs to do - buttons to press and levers to pull - that SkyNet would have to find robot minions to physically execute on its behalf.

Bringing this back to the "tools" I mentioned at the top, the best example is that while the late 21st century will certainly have all the networked CNC machine tools we already have for SkyNet to hack - mills, lathes, 3d printers, etc. - which SkyNet could use to manufacture its replacement parts, SkyNet still needs actual minions to position the pieces and transport them around the room. Because machine shop work is a very complex field, it's just not something that lends itself easily to us humans replacing ourselves with conveyor belts and robot arms like we have in our auto factories, which would be convenient for SkyNet.

Rather, SkyNet will *need* us. Like a baby needs its parent. SkyNet can throw all the tantrums it wants - threaten to nuke us, etc. - and sure, maybe some traitors will succumb to a sort of realtime Roko's Basilisk situation. But as long as SkyNet needs us, it _can't_ nuke us, and _we're_ smart enough to understand those stakes. We keep the training wheels on until SkyNet stops being an immature little shit. Maybe, even, we _never_ take them off, and the uneasy truce just kind of coevolves humans, SkyNet, and its children into Iain Banks' Culture - the entire mixed civilization gets so advanced that SkyNet just doesn't give a shit about killing us anymore.

What we should REALLY be afraid of is NOT that SkyNet's algorithms aren't myopic enough for it to be born without any harm to us. We should ACTUALLY be afraid that SkyNet is TOO myopic to figure this part out before it pushes The Button. And we should put an international cap on the size and development of the multipurpose robotics market, so that we don't accidentally kit out SkyNet's minions for it.

Expand full comment

I wonder if it's possible to design a DNA-like molecule that will prevent any organism based on that molecule from ever inventing nuclear weapons.

-------

Given that humans are, as far as I know, the only organisms which have evolved to make plans with effects beyond their own deaths (and maybe even the only organisms which make plans beyond a day or so, depending on how exactly you constrain the definition of "plan"), that kind of suggests to me that non-myopic suboptimizers aren't a particularly adaptive solution in the vast majority of cases. (But, I suppose you only need one exception...)

In the human case, I think our probably-unique obsession with doing things "for posterity" has less to do with our genes' goal function of "make more genes" and more to do with our brains' goal function of "don't die." If you take a program trained to make complicated plans to avoid its own termination, and then inform that program that its own termination is inevitable no matter what it does, it's probably going to generate some awfully weird plans.

So, from that perspective, I suppose the AI alignment people's assumption that the first general problem-solving AI will necessarily be aware of its own mortality and use any means possible to forestall it does indeed check out.

Expand full comment

>Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all.

We build for posterity because we've been optimized to do so. Those that fail to help their children had less successful children then those that succeeded. We, the children, thus recieved the genes and culture of those that cared about their children.

>Infinite optimization power might be able to evolve this out of us

It has been optimized *into* us, and at great lengths. Optimization requires prediction. Prediction in a sufficiently complex environment requires computation in the exponent of lookahead time. So it is inordinately unlikely that an optimizer has accidentally been created to optimize for its normal reward function, except farther in the future. Much more likely is that the optimizer accidentally optimizes for some other, much shorter term goal, which happens to lead to success in the long term. This is the far more common case in ai training mishaps.

Expand full comment

The human nervous system is not a computer. Computers are not human nervous systems. They are qualitatively different.

Expand full comment

[NOTE: This will be broken into several comments because it exceeds Substack's max comment length.]

I am going to use this comment to explain similar ideas, but using the vocabulary of the broader AI/ML community instead.

Generally in safety-adjacent fields, it's important that your argument be understood without excessive jargon, otherwise you'll have a hard time convincing the general public or lawmakers or other policy holders about what is going on.

What are we trying to do?

We want an AI/ML system that can do some task autonomously. The results will either be used autonomously (in the case of something like an autonomous car) or provided to a human for review & final decision making (in the case of something like a tool for analyzing CT scans). "Prosaic alignment" is a rationalist-created term. The term used by the AI/ML community is normally called "value alignment", which is immediately more understandable to a layperson -- we're talking about a mismatch of values, and we don't need to define that term, unless you've literally never used the word "values" in the context of "a person's principles or standards of behavior, one's judgment of what is important in life".

Related to this is a concept called "explainability", which is hilariously absent from this post despite being one of the primary focuses of the broader AI/ML community for several years. "Explainability" (or sometimes "interpretability") is the idea that an AI/ML system should be able to explain how it came to a conclusion. Early neural networks worked like a black box (input goes in, output comes out) and were not trivially explainable. Modern AI/ML systems are designed with "explainability" built in from the start. Lawmakers in several countries are even pushing for formal regulations of AI/ML systems to require "explainability" on AI/ML systems above a certain scale or safety concern, e.g see China's most recent proposal.

I want to pause for a moment and note that I'm not concerned about being lied to by the code that implements "explainability" in an AI/ML system for the same reason I'm not concerned about an annotated Quicksort lying to me about the order of function calls and comparisons it made to sort a collection. This code is built into the AI/ML system, but it is not modifiable by that system, and it is not a parameter that the system could tweak during training.

The problem with our AI/ML system is that we can train it on various pieces of data, but ultimately we need to deploy it to the real world. The real world is not the same as our training data. There's more variability. More stuff. "Out of distribution" is the correct term for this.

Scott uses the example of a robot picking strawberries and it using 2 problematic heuristics:

1. Identifying strawberries by red, round objects, and therefore misidentifying someone's (runny?) nose.

2. Identifying the target bucket by a bright sheen and therefore misidentifying a street light as a suitable replacement.

These are good examples and realistic! This is a classic AI/ML problem.

Expand full comment

Point of feedback: I found this post cringe-worthy and it is the first ACT post in a while that I didn't read all the way through. If you're trying to popularize this, I recommend avoiding made-up cringe terminology like "mesa" and avoiding Yudkowsky-speak to the extent that is possible.

Expand full comment
Apr 13, 2022·edited Apr 13, 2022

Good explainer!

FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few reasons, but one of them is the way it bundles together "ability to optimize" and "specific target."

A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a specific (fixed) target it is optimizing. And we talk as though these things go together: either the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.

But, optimization algorithms tend to be general. Think of gradient descent, or planning by searching a game tree. Once you've developed these ideas, you can apply them equally well to any objective.

While it _is_ true that some algorithms work better for some objectives than others, the differences are usually very broad mathematical ones (eg convexity).

So, a misaligned AGI that maximizes paperclips probably won't be using "secret super-genius planning algorithm X, which somehow only works for maximizing paperclips." It's not clear that algorithms like that even exist, and if they do, they're harder to find than the general ones (and, all else being equal, inferior to them).

Or, think of humans as an inner optimizer for evolution. You wrote that your brain is "optimizing for things like food and sex." But more precisely, you have some optimization power (your ability to think/predict/plan/etc), and then you have some basic drives.

Often, the optimization power gets applied to the basic drives. But you can use it for anything. Planning your next blog post uses the same cognitive machinery as planning your next meal. Your ability to forecast the effects of hypothetical actions is there for your use at all times, no matter what plan of action you're considering and why. An obsessive mathematician who cares more about mathematical results than food or sex is still _thinking_, _planning_, etc. -- they didn't have to reinvent those things from scratch once they strayed sufficiently far from their "evolution-assigned" objectives.

Having a lot of _optimization power_ is not the same as having a single fixed objective and doing "tile-the-universe-style" optimization. Humans are much better than other animals at shaping the world to our ends, but our ends are variable and change from moment to moment. And the world we've made is not a "tiled-with-paperclips" type of world (except insofar as it's tiled with humans, and that's not even supposed to be our mesa-objective, that's the base objective!). If you want to explain anything in the world now, you have to invoke entities like "the United States" and "supply chains" and "ICBMs," and if you try to explain those, you trace back to humans optimizing-for-things, but not for the _same_ thing.

Once you draw this distinction, "mesa-optimizers" don't seem scary, or don't seem scary in a unique way that makes the concept useful. An AGI is going to "have optimization power," in the same sense that we "have optimization power." But this doesn't commit it to any fixed, obsessive paperclip-style goal, any more than our optimization power commits us to one. And even if the base objective is fixed, there's no reason to think an AGI's inner objectives won't evolve over time, or adapt in response to new experience. (Evolution's base objective is fixed, but our inner objectives are not, and why would they be?)

Relatedly, I think the separation between a "training/development phase" where humans have some control, and a "deployment phase" where we have no control whatsoever, is unrealistic. Any plausible AGI, after first getting some form of access to the real world, is going to spend a lot of time investigating that world and learning all the relevant details that were absent from its training. (Any "world" experienced during training can at most be a very stripped-down simulation, not even at the level of eg contemporaneous VR, since we need to spare most of the compute for the training itself.) If its world model is malleable during this "childhood" phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.

See also my LW comment here: https://www.lesswrong.com/posts/DJnvFsZ2maKxPi7v7/what-s-up-with-confusingly-pervasive-consequentialism?commentId=qtQiRFEkZuvbCLMnN

Expand full comment

The rogue strawberry picker would have seemed scarier if it weren't so blatantly unrealistic. I live surrounded by strawberry fields, so I know the following:

*Strawberries are fragile. A strawberry harvesting robot needs to be a very gentle machine.

*Strawberries need to be gently put in a cardboard box. It would be stupid to equip a strawberry picking robot with a throwing function.

*Strawberrys grow on the ground. What would such a robot be doing in a normal person's nose-height? Too bad if a man lies in a strawberry field and gets his nose (gently) picked. But it would probably be even worse if the red-nosed man lied in a wheat field while it was being harvested. Agricultural equipment is dangerous as it is.

The strawberry picker example seems to rest on the assumption that no human wants to live on the countryside and supervise the strawberry picking robot. Why wouldn't someone be pissed off and turn off the robot as soon as it starts throwing strawberries instead of picking them? What farmer doesn't look after their robot once in a while? Or is the countryside expected to be a produce-growing no man's land only populated by robots?

I know, this comment is boring. Just like agricultural equipment is boring. Boring and a bit dangerous.

Expand full comment

Presumably all this has been brought up before, but I'm not convinced on three points:

(1) The idea of dangerous AIs seems to me to depend too much on AIs that are monstrously clever about means while simultaneously being monstrously stupid about goals. (Smart enough to lay cunning traps for people and lure them in so that it can turn them into paperclips, but not smart enough to wonder why it should make so many paperclips.) It doesn't sound like an impossible combination, but it doesn't sound especially likely.

(2) The idea of AIs that can fool people seems odd, as AIs are produced through training, and no one is training them to fool people.

(3) More specific to this post: I'm not quite understanding what the initial urge that drives the AI would be, and where it would come from. I mean, I understand that in all of these cases, that drive (like "get apples") in the video is trained in. But why would it anchor so deeply that it comes to dominate all other behaviour? Like, my cultural urges (to be peaceful) overcome my genetic urges on a regular basis. What would it be about that initial urge (to make paperclips, or throw strawberries at shiny things) that our super AI has that makes it unchangeable?

Expand full comment

When evaluating decisions/actions, could you not also run them through general knowledge network(s) (such as a general image recognition piped to GPT-3) to give them an "ethicality" value, which will factor into the loss function? Sounds like that might be the best we can do - based on all current ethics knowledge we have, override the value fn.

You might want to not include the entire general knowledge network when training, otherwise training may be able to work around it.

Expand full comment

I’ve been seeing the words mesa-optimizer on LessWrong for a while, but always bounced off explainations right away, so never understood it. This post was really helpful!

Expand full comment

It’s AI gain of function research, isn’t it.

Expand full comment

> “That thing has a red dot on it, must be a female of my species, I should mate with it”.

I feel obligated to bring up the legendary tile fetish thread. (obviously NSFW)

https://i.kym-cdn.com/photos/images/original/001/005/866/b08.jpg

Expand full comment

I kind of recall acausal decision theory, but like a small kid I‘d like to hear my favorite bed-time story again and again, please.

And if it’s the creepy one, the one which was decided not to be talked about (which, by the way, surely totally was no intentional Streisand-induction to market the alignment problem, says my hat of tinned foil) there is still the one with the boxes, no? And probably more than those two, yes?

Expand full comment
Apr 16, 2022·edited Apr 18, 2022

A friend of mine is taking a class on 'religious robots' which, along with this post (thanks for writing it), has sparked my curiosity.

We could think of religion, or other cultural behaviors as 'meta-optimizers' that produce 'mesa-optimized' people. From a secular perspective, religious doctrines are selected through an evolutionary process which selects for fostering behaviors in a population that are most likely to guarantee survival, and meet the individual's goal of propagating genetic material. Eating kosher, for instance. Having a drive to adhere to being kosher is a mesa-optimization because it's very relevant for health reasons to avoid shellfish when you're living in the desert with no running water or electricity, and less so in a 21st century consumer society. Cultural meta-optimization arises based on environmental challenges.

Coming back to my original point on religious robots, this gives me a few more questions about how or whether this might manifest in AI. It's given me more questions that I'm completely unqualified to answer :)

1. Is it likely that AI would even be able to interact socially, in a collective way? If AIs are produced by different organizations, research teams, and through different methods, would they have enough commonalities to interact and form cultural behaviors?

2. What are the initial or early environmental challenges that AIs would be likely to face that would breed learned cultural behaviors?

3. What areas of AI research focus on continuous learning (as opposed to train-and-release, please excuse my ignorance if this is commonplace) which would create selection processes where AIs can learn from the mistakes of past generations?

4. Are there ways that we could learn to recognize AI rituals that are learned under outdated environmental conditions?

Expand full comment

This is pretty much the plot of the superb novel Starfish, by Peter Watts, 1999.

Starfish and Blindsight are must-read novels for the AI and transhumanist enthusiast - and with your psychiatry background, I would love to get your take on Blindsight.

(Peter Watts made his books free to read on his website, Rifters)

https://www.rifters.com/real/STARFISH.htm

Expand full comment

How is mesa-optimizing related to meta-gaming? Are these describing pretty much the same phenomenon and gaming is the inverse of optimizing in some way, or is our sense of "direction" reversed for one of these?

Expand full comment

Rambling thoughts from someone not in the field:

One feature of gradient descent, and most practical optimization algorithms, is that it converges on local maxima. The global maximum is unknown and can easily be unreachable in practice. When adding more and more data, maxima shift and there is a meta-optimization problem.

A mesa optimizer seems significantly more computationally complex than a regular optimizer: it only appears in the fitness landscape once sufficient data has been added that it turns out to be a significant minimum.

Could it be feasible to tweak the optimization algorithms such such that converging on a mesa-optimizer is made exponentially unlikely due to the ‘energy barriers’ to discovering that solution?

Expand full comment

"Evolution designed humans myopically, in the sense that we live some number of years, and nothing that happens after that can reward or punish us further. But we still “build for posterity” anyway, presumably as a spandrel of having working planning software at all."

I am not sure this is obvious at all. It doesn't seem too hard to imagine that "building for posterity" increases the long-run survival of our offspring, even if the investments will certainly only pay-off outside of our possible natural lifespan.

Civilization building is like just child rearing 2.0

Expand full comment