nostalgebraist:
comments on mesa-optimizers
(Copy/pasted from a comment on the latest ACX post, see that for context if needed)
FWIW, the mesa-optimizer concept has never sat quite right with me. There are a few reasons, but one of them is the way it bundles together "ability to optimize" and "specific target."
A mesa-optimizer is supposed to be two things: an algorithm that does optimization, and a specific (fixed) target it is optimizing. And we talk as though these things go together: either the ML model is not doing inner optimization, or it is *and* it has some fixed inner objective.
But, optimization algorithms tend to be general. Think of gradient descent, or planning by searching a game tree. Once you've developed these ideas, you can apply them equally well to any objective.
While it is true that some algorithms work better for some objectives than others, the differences are usually very broad mathematical ones (eg convexity).
So, a misaligned AGI that maximizes paperclips probably won't be using "secret super-genius planning algorithm X, which somehow only works for maximizing paperclips." It's not clear that algorithms like that even exist, and if they do, they're harder to find than the general ones (and, all else being equal, inferior to them).
Or, think of humans as an inner optimizer for evolution. You wrote that your brain is "optimizing for things like food and sex." But more precisely, you have some optimization power (your ability to think/predict/plan/etc), and then you have some basic drives.
Often, the optimization power gets applied to the basic drives. But you can use it for anything.
Planning your next blog post uses the same cognitive machinery as planning your next meal. Your ability to forecast the effects of hypothetical actions is there for your use at all times, no matter what plan of action you're considering and why. An obsessive mathematician who cares more about mathematical results than food or sex is still thinking, planning, etc. -- they didn't have to reinvent those things from scratch once they strayed sufficiently far from their "evolution-assigned" objectives.
Having a lot of optimization power is not the same as having a single fixed objective and doing "tile-the-universe-style" optimization. Humans are much better than other animals at shaping the world to our ends, but our ends are variable and change from moment to moment. And the world we've made is not a "tiled-with-paperclips" type of world (except insofar as it's tiled with humans, and that's not even supposed to be our mesa-objective, that's the base objective!)
If you want to explain anything in the world now, you have to invoke entities like "the United States" and "supply chains" and "ICBMs," and if you try to explain those, you trace back to humans optimizing-for-things, but not for the same thing.
Once you draw this distinction, "mesa-optimizers" don't seem scary, or don't seem scary in a unique way that makes the concept useful. An AGI is going to "have optimization power," in the same sense that we "have optimization power." But this doesn't commit it to any fixed, obsessive paperclip-style goal, any more than our optimization power commits us to one.
And even if the base objective is fixed, there's no reason to think an AGI's inner objectives won't evolve over time, or adapt in response to new experience. (Evolution's base objective is fixed, but our inner objectives are not, and why would they be?)
Relatedly, I think the separation between a "training/development phase" where humans have some control, and a "deployment phase" where we have no control whatsoever, is unrealistic. Any plausible AGI, after first getting some form of access to the real world, is going to spend a lot of time investigating that world and learning all the relevant details that were absent from its training. (Any "world" experienced during training can at most be a very stripped-down simulation, not even at the level of eg contemporaneous VR, since we need to spare most of the compute for the training itself.)
If its world model is malleable during this "childhood" phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.
See also my LW comment here.
I agree that optimization power is not *necessarily* correlated with specific goals. But why wouldn't mesa-optimizers, contingently, have a specific goal. Presumably we're running gradient descent on some specific loss function, like "number of paperclips produced", and then mesa-optimizer inherits some proxy for that.
I agree humans aren't like that, and that this is surprising.
Maybe this is because humans aren't real consequentialists, they're perceptual control theory agents trying to satisfy finite drives? EG when we're hungry, our goal becomes to find food, but we don't want to tile the universe with food, we just want to eat 3000ish calories and then we're done. We have a couple of other goals like that, and when we've accomplished all of them, most people are content to just hang out on the beach until something else happens.
Might gradient descent produce a PCT agent instead of a mesa-optimizer? I don’t know. My guess is maybe, but that optimizers would be more, well, optimal, and we would get one eventually (either later in the gradient descent process, or in a different lab later). My guess is evolution didn’t make us optimizers because it hasn't had enough time to work with us while we've been intelligent. If we got locked at 20th century technology forever, I think it might, after a few million years, produce humans who genuinely wanted to tile the universe with kids.
"Even if the base objective is fixed, there's no reason to think an AGI's inner objectives won't evolve over time, or adapt in response to new experience."
Wouldn't the first thing a superintelligence with a goal did be to make sure its goal didn't drift?
If its world model is malleable during this "childhood" phase, why not its values, too? It has no reason to single out a region of itself labeled $MESA_OBJECTIVE and make it unusually averse to updates after the end of training.
I think this is where the deception comes in. If the mesa-optimizer is smart and doesn't want people (or other parts of itself) changing its values, it will take steps to stop that, either by lying about its values or fighting back.
I think this idea that "real consequentialists are more optimal" is (sort of) the crux of our disagreement.
But it will be easiest to explain why if I spend some time fleshing out how I think about the situation.
What are these things we're talking about, these "agents" or "intelligences"?
First, they're physical systems. (That far is pretty obvious.) And they are probably pretty complicated ones, to support intelligence. They are structured in a purposeful way, with different parts working together.
And this structure is probably hierarchical, with higher-level parts that are made up of lower-level parts. Like how brains are made of neuroanatomical regions, which are made of cells, etc. Or the nested layers of abstraction in any non-trivial (human-written) computer program.
At some level(s) of the hierarchy, there may be parts that "run optimization algorithms."
But these could live at any level of the hierarchy. They could be very low-level and simple. There may be optimization algorithms at low levels controlled by non-optimization algorithms at higher levels. And those might be controlled by optimization algorithms at even higher levels, which in turn might be controlled by non-optimization ... etc.
Consider my computer. Sometimes, it runs optimization algorithms. But they're not optimizing the same function every time. They don't "have" targets of their own, they're just algorithms.
They blindly optimize whatever function they're given by the next level up, which is part of a long stack of higher levels (such as the programming language and the operating system). Few, if any, of the higher-level routines are optimization algorithms in themselves. They just control lower-level optimization algorithms.
If I use my computer to, say, make an amusing tumblr bot, I am wielding a lot of optimization power. But most of my computer is not doing optimization.
Python isn't asking itself, "what's the best code to run next if we want to make amusing tumblr bots?" The OS isn't asking itself, "how can I make all the different programs I'm running into the best versions of themselves for making amusing tumblr bots?"
And this is probably a good thing. It's hard to imagine these bizarre behaviors being helpful, giving me a more amusing tumblr bot at the end.
Which is to say, "doing optimization well" (in the sense of hitting the target, sitting on a giant heap of utility) can happen without doing optimization at high abstraction levels.
And indeed, I'd go further, and say that it's generically better (for hitting your target) to put all the optimization at low levels, and control it with non-optimizing wrappers.
Why? The reasons include:
Goodhart's Law
- ...especially its "extremal" variant, where optimization preferentially chooses regions of solution space where the assumptions behind your proxy target break down.
- This is no less a problem when the thing choosing the target is part of a larger program, rather than a human.
- Keeping optimization at low levels decreases the blast radius of this effect.
- If the things you're optimizing are low-level intermediate results in the process of choosing the next action at the agent level, the impacts of Goodharting each one may cancel out. The agent-level actions won't look Goodharted, just slightly noisy/worse.
Speed
- Optimization tends to be slow. In a generic sense, it's the "slow, hard, expensive way" to do any given task, and you avoid it if you can. (Think of System 2 vs System 1, satisficing vs maximizing, etc)
- To press the point: why is there a distinction between "training" and "inference"? Why aren't neural networks always training at all times? Because training is high-level optimization, and takes lots of compute, much more than inference.
- Optimization gets vastly slower at higher levels of abstraction, because the state space gets so much larger (consider optimizing a single number vs. optimizing the entire world model).
- You still want to get optimal results at the highest level, but searching for improvements at high level is very expensive in terms of time/etc. In the time it takes to ask "what if the entire way I think were different, like what if it were [X]?", for one single [X] , you could instead have run thousands of low-level optimization routines.
- Optimization tends to take super-linear time, which means that nesting optimization inside of optimization is ultra-slow. So, you have to make tradeoffs and put the optimization at some levels instead of others. You can't just do optimization at every level at once. (Or you can, but it's extremely suboptimal.)
------
When is the agent an "optimizer" / "true consequentialist"?
This question asks whether the very highest level of the hierarchy, the outermost wrapper, is an optimization algorithm.
As discussed above, this is not a promising agent design! There is an argument to be had about whether it still could emerge, for some weird reason.
But I want to push back against the intuition that it's a typical result of applying optimization to the design, or that agents sitting on giant heaps of utility will typically have this kind of design.
The two questions
- "Can my computer make amusing tumblr bots?"
- "Is my computer as a whole, hardware and software, one giant optimizer for amusing tumblr bots?"
have very little to do with one another.
In the LessWrong-adjacent type of AI safety discussion, there's a tendency to overload the word "optimizer" in a misleading way. In casual use, "optimizer" conflates
- "thing that runs an optimization algorithm"
- "thing that has a utility function defined over states of the real world"
- "thing that's good at maximizing a utility function defined over states of the real world"
- "smart thing" (because you have to be smart to do the previous one)
But doing optimization all the way at the top, involving your whole world model and your highest-level objectives, is very slow, and tends to extremal-Goodhart itself into strange and terrible choices of action.
It's also not the only way of applying optimization power to your highest-level objectives.
If I want to make an amusing tumblr bot, the way to do this is not to ponder the world as a whole and ask how to optimize literally everything in it for maximal amusing bot production. Even optimizing just my computer for maximal amusing bot production is way too high-level. (Should I change the hue of my screen? the logic of the background process that builds a search index of my files??? It wastes time to even pose the questions.)
What I actually did was optimize just a few very simple parts of the world, a few collections of bits on my computer or other computers. And even that was very time-intensive and forced me to make tradeoffs about where to spend my GPU/TPU hours. And then of course I had to watch it carefully, applying lots of heuristics to make sure it wasn't Goodharting me (overfitting, etc).
To get back to the original topic, the kind of "mesa-optimizer" we're worried about is an optimizer at a very high level.
It's not dangerous (in the same way) for a machine to run tiny low-level optimizers at a very fast rate. I don't care how many times you run Newton's method to find the roots of a one-variable function -- it's never going to "wake up" and start trying to ensure its goal doesn't change, or engaging in deception, or whatever.
And I am doubtful that mesa-optimizers like this will arise, for the same reasons I am doubtful that the agent will do optimization at its highest level.
Once we are pointing at the agent, or a part of it, and saying "that's a superintelligence, and wouldn't a superintelligence do . . . ", we're probably not talking about something that runs optimization.
You don't spend your optimization budget at the level of abstraction where intelligence happens. You spend it at lower levels, and that's what intelligence is made out of.