# Ensuring smarter-than-human intelligence has a positive outcome

||Analysis,Video

I recently gave a talk at Google on the problem of aligning smarter-than-human AI with operators’ goals:

Talk outline (slides):

1.概观

3.1。Alignment priorities
3.2。四个关键命题

### 概观

When people talk about the social implications ofgeneral AI, they often fall prey to anthropomorphism. They conflate artificialintelligencewith artificialconsciousness, or assume that if AI systems are “intelligent,” they must be intelligent in the same way a human is intelligent. A lot of journalists express a concern that when AI systems pass a certain capability level, they’ll spontaneously develop “natural” desires like a human hunger for power; or they’ll reflect on their programmed goals, find them foolish, and “rebel,” refusing to obey their programmed instructions.

AI系统的概念“打破免费”的自己的亚博体育苹果app官方下载源代码的束缚或自发地发展类似人类的欲望只是混淆。AI系统亚博体育苹果app官方下载is它的源代码，其行为仅会从我们开始执行指令跟随。该CPU只是不断在程序寄存器执行下一条指令。我们可以写操纵它自己的代码，包括编码目标的程序。即使是这样，虽然，这让作为执行的原代码，我们写的结果作出的操作;他们不从机器某种鬼干。

1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.

2.任何足够能力的智能系统会优先保证自身继续存在，并获得物理和计算资源 - 不为亚博体育苹果app官方下载自己着想，但在其指定的任务取得成功。

These kinds of concerns deserve a lot more attention than the more anthropomorphic risks that are generally depicted in Hollywood blockbusters.

### Simple bright ideas going wrong

I think this is a much better picture:

This is Mickey Mouse in the movie幻想曲，谁也非常巧妙，陶醉扫帚来填补代表他大锅。

$$\ mathcal【U} _ {扫帚} = \ {开始}案件 1 &\text{ if cauldron full} \\ 0 \ {文本如果釜空} \ {端}例$$
Given some set of available actions, Mickey then writes a program that can take one of these actions as input and calculate how high the score is expected to be if the broom takes that action. Then Mickey can write a function that spends some time looking through actions and predicting which ones lead to high scores, and outputs an action that leads to a relatively high score:
$$\底流{一个在A \} {\ mathrm {几分\ MBOX { - } argmax}} \ \ mathbb {E} \左[\ mathcal【U} _ {扫帚} \中旬\权利]$$The reason this is “sorta-argmax” is that there may not be time to evaluate every action in . For realistic action sets, agents should only need to find actions that make the scoring function as large as they can given resource constraints, even if this isn’t the maximal action.

When Mickey runs this program, everything goes smoothly at first.Then:

I claim that as fictional depictions of AI go, this is pretty realistic.

Why would we expect a generally intelligent system executing the above program to start overflowing the cauldron, or otherwise to go to extreme lengths to ensure the cauldron is full?

$$\mathcal{U}_{human} = \ {开始}案件 1&\text{ if cauldron full} \\ 0 \ {文本如果釜空} \\ -10&\text{ if workshop flooded} \\ +0.2&\text{ if it’s funny} \\ -1000000＆\ {文本，如果有人就会被杀死} \\ &\text{… and a whole lot more} \\ \ {端}例$$

There are a number of different ways that a goal that looks task-like can turn out to be open-ended. Another example: a larger system that has an overarching task-like goal may have子进程that are themselves trying to maximize a variety of different objective functions, such as optimizing the system’s memory usage. If you don’t understand your system well enough to track whether any of its subprocesses are themselves acting like resourceful open-ended optimizers, then它可能没有多么安全的顶层目标是

Subproblem: Suspend buttons

Relevant safety measures that don’t assume we can always outthink and outmaneuver the system will need to modify the system’s objective function in some relevant way. We need to figure out how to formally specify objective functions that don’t automatically place the AI system into an adversarial context with the operators; or we need to figure out some way to have the system achieve goals without optimizing some objective function in the traditional sense. E.g., we can look for ways to limit the system’s capabilitiesto a small set of “whitelisted” domainsthat don’t include thinking about whether the system will be shut down.

One obvious solution is to try changing the score function to make reference to a suspend button :
$$\ mathcal【U} ^ \ prime_ {扫帚} = \ {开始}案件 1 \text{ if cauldron full } & \text{& } \textbf{B}\text{=OFF} \\ 0 \ {文本如果釜空}＆\文本{＆} \ textbf {B} \ {文本= OFF} \\ 1个\文本{如果机器人悬浮}＆\文本{＆} \ textbf {B} \ {文本= ON} \\ 0 \ {文本}否则 \ {端}例$$
This says that when the button is on, the system’s goal changes from “fill the cauldron” to “be suspended.”

Let’s assume, however, that we’ve found an airtight way to technically specify suspension. The problem we run into now is that we’ve given the system an incentive to care a lot about whether the button is pushed. If it has any way to influence its operators’ decision to push the button, or any other way to exercise more control, then it will jump on the opportunity and try to ensure the button is pushed, or ensure that it’s not pushed — depending on whether it looks easier to suspend or to fill the cauldron.

And, surprisingly, the state of the art on this problem isn’t much more advanced than that. There’s a paper co-authored by me, Fallenstein, Yudkowsky, and Armstrong (“Corrigibility“），并通过Orseau和阿姆斯特朗（相关结果”安全可中断代理“），但不是很多。

What we want is a way to combine two objective functions — a default function for normal operation, and a suspend function for when we want to suspend the system to disk.$$\mathcal{U}_{normal}(act) = \ {开始}案件 1个\ {文字如果大锅全} \\ 0\text{ if cauldron empty} \ {端}例$$

$$\mathcal{U}_{suspend}(act) = \ {开始}案件 1 \文本{如果悬浮} \ {HSPACE}0.8厘米\\ 0\text{ otherwise} \ {端}例$$
We want our method for combining the functions to satisfy three conditions: an operator should be able to switch between the functions (say, by pushing a button); the system shouldn’t have any incentives to control which function is active; and if it’s plausible that the system’s normal operations could inadvertently compromise our ability to switch between the functions, then the system should be incentivized to keep that from happening.

$$\mathcal{U}_{switch}(act) = \ {开始}案件 \mathcal{U}_{normal}(act) &\text{ if button=OFF} \\ \mathcal{U}_{suspend}(act) + \theta &\text{ if button=ON} \ {端}例$$
$$\theta = \max_{act} \mathbb{E}\left[\mathcal{U}_{normal}\mid act\right] – \max_{act} \mathbb{E}\left[\mathcal{U}_{suspend}\mid act\right]$$不幸的是，这种方法使得它不可能满足第三个条件。事实证明，对目标函数上述等效于设置被压为0按钮的概率，系统将不能亚博体育苹果app官方下载deliberatelytamper with the button, but it will happily approve updates that delete any reference to the button as dead code, or build subagents that ignore the button.

### The big picture

Alignment priorities

There’s a lot we don’t understand about AI capabilities, but we’re in a position where we at least have a general sense of what progress looks like. We have a number of good frameworks, techniques, and metrics, and we’ve put a great deal of thought and effort into successfully chipping away at the problem from various angles. At the same time, we have a very weak grasp on the problem of how to align highly capable systems with any particular goal. We can list out some intuitive desiderata, but the field hasn’t really developed its first formal frameworks, techniques, or metrics.

I believe that there’s a lot of low-hanging fruit in this area, and also that a fair amount of the work does need to be done early (e.g., to help inform capabilities research directions — some directions may produce systems that are much easier to align than others). If we don’t solve these problems, developers with arbitrarily good or bad intentions will end up producing equally bad outcomes. From an academic or scientific standpoint, our first objective in that kind of situation should be to remedy this state of affairs and at least make good outcomes technologically possible.

Many people quickly recognize that “natural desires” are a fiction, but infer from this that we instead need to focus on the other issues the media tends to emphasize — “What if bad actors get their hands on smarter-than-human AI?”, “How will this kind of AI impact employment and the distribution of wealth?”, etc. These are important questions, but they’ll only end up actually being relevant if we figure out how to bring general AI systems up to a minimum level of reliability and safety.

In contrast, precision is a virtue in real-world safety-critical software systems. Driving down accident risk requires that we begin with limited-scope goals rather than trying to “solve” all of morality at the outset.5

My view is that the critical work is mostly in designing an effective value learning process, and in ensuring that the sorta-argmax process is correctly hooked up to the resultant objective function :

Classic capabilities research is concentrated in the sorta-argmax and Expectation parts of the diagram, but sorta-argmax also contains what I currently view as the most neglected, tractable, and important safety problems. The easiest way to see why “hooking up the value learning process correctly to the system’s capabilities” is likely to be an important and difficult challenge in its own right is to consider the case of our own biological history.

This is a case where the external optimization pressure on an artifact resulted in a general intelligence with internal objectives that didn’t match the external selection pressure. And just as this caused humans’ actions to diverge from natural selection’s pseudo-goal once we gained new capabilities, we can expect AI systems’ actions to diverge from humans’ if we treat their inner workings as black boxes.

If we apply gradient descent to a black box, trying to get it to be very good at maximizing some objective, then with enough ingenuity and patience, we may be able to produce a powerful optimization process of some kind.6默认情况下，我们应该期待这样的神器有一个与我们的目标强相关的训练环境目标，但大幅从发散in some new environments or when a much wider option set becomes available

On my view, the most important part of the alignment problem is ensuring that the value learning framework and overall system design we implement allow us to crack open the hood and confirm when the internal targets the system is optimizing for match (or don’t match) the targets we’re externally selecting through the learning process.7

We expect this to be technically difficult, and if we can’t get it right, then it doesn’t matter who’s standing closest to the AI system when it’s developed. Good intentions aren’t sneezed into computer programs by kind-hearted programmers, and coming up with plausible goals for advanced AI systems doesn’t help if we can’t align the system’s cognitive labor with a given goal.

I think most programmers intuitively understand this. Some people will insist that when a machine tasked with filling a cauldron gets smart enough, it will abandon cauldron-filling as a goal unworthy of its intelligence. From a computer science perspective, the obvious response is that you could go out of your way to build a system that exhibits that conditional behavior, but you could also build a system that doesn’t exhibit that conditional behavior. It can just keeps searching for actions that have a higher score on the “fill a cauldron” metric. You and I might get bored if someone told us to just keep searching for better actions, but it’s entirely possible to write a program that executes a search and never gets bored.8

Second,sufficiently optimized objectives tend to converge on adversarial instrumental strategies。大多数目标自己人工智能系统c亚博体育苹果app官方下载ould possess would be furthered by subgoals like “acquire resources” and “remain operational” (along with “learn more about the environment,” etc.).

To use an example due to Stuart Russell: If you build a robot and program it to go to the supermarket to fetch some milk, and the robot’s model says that one of the paths is much safer than the other, then the robot, in optimizing for the probability that it returns with milk, will automatically take the safer path. It’s not that the system fears death, but that it can’t fetch the milk if it’s dead.

As a simple example, Google can buy a promising AI startup and throw huge numbers of GPUs at them, resulting in a quick jump from “these problems look maybe relevant a decade from now” to “we need to solve all of these problems in the next year” à la DeepMind’s progress in Go. Or performance may suddenly improve when a system is first given large-scale Internet access, when there’s a conceptual breakthrough in algorithm design, or when the system itself is able to propose improvements to its hardware and software.9

Fourth,aligning advanced AI systems with our interests looks difficult。我会说更多关于为什么我认为这目前。

Roughly speaking, the first proposition says that AI systems won’t naturally end up sharing our objectives. The second says that by default, systems with substantially different objectives are likely to end up adversarially competing for control of limited resources. The third suggests that adversarial general-purpose AI systems are likely to have a strong advantage over humans. And the fourth says that this problem is hard to solve — for example, that it’s hard to transmit our values to AI systems (addressing orthogonality) orAVERT对抗性奖励（解决收敛器乐策略）。

These four propositions don’t mean that we’re screwed, but they mean that this problem is critically important. General-purpose AI has the potential to bring enormous benefits if we solve this problem, but we do need to make finding solutions a priority for the field.

### Fundamental difficulties

Why do I think that AI alignment looks fairly difficult? The main reason is just that this has been my experience from actually working on these problems. I encourage you tolook at some of the problems yourself并尝试解决他们的玩具设置;这里我们可以使用更多的目光。我还会记下一些结构性的理由期待这些问题是很难：

Before looking at the details, it’s natural to think “it’s all just AI” and assume that the kinds of safety work relevant to current systems are the same as the kinds you need when systems surpass human performance. On that view, it’s not obvious that we should work on these issues now, given that they might all be worked out in the course of narrow AI research (e.g., making sure that self-driving cars don’t crash).

Similarly, at a glance someone might say, “Why would rocket engineering be fundamentally harder than airplane engineering? It’s all just material science and aerodynamics in the end, isn’t it?” In spite of this, empirically, the proportion of rockets that explode is far higher than the proportion of airplanes that crash. The reason for this is that a rocket is put under much greater stress and pressure than an airplane, and small failures are much more likely to be highly destructive.10

For example, once an AI system begins modeling the fact that (i) your actions affect its ability to achieve its objectives, (ii) your actions depend on your model of the world, and (iii) your model of the world is affected by its actions, the degree to which minor inaccuracies can lead to harmful behavior increases, and the potential harmfulness of its behavior (which can now include, e.g., deception) also increases. In the case of AI, as with rockets, greater capability makes it easier for small defects to cause big problems.

It’s much easier to make code that works well on the path that you were visualizing than to make code that works on all the paths that you weren’t visualizing. AI alignment needs to work在所有的路径你没有可视化

This is a crucial project, and I encourage all of you who are interested in these problems to get involved and try your hand at them. There are网上丰富的资源学习更多关于开放技术问题。一些好的地方开始包括美里的亚博体育官网 and a great paper from researchers at Google Brain, OpenAI, and Stanford called “在AI安全的具体问题。”

1. 一架飞机无法治愈她的伤害或复制，但它可以很远一点，比小鸟快携带重的货物。飞机比小鸟在许多方面更加简单，同时还承载能力和速度（它们被设计的）而言显著能力更强。这是合理的，早期的自动化科学家也比人的头脑简单，在很多方面，同时在某些关键方面显著能力更强。而且，正如相对于生物生物的结构看飞机外星人的建设和设计原则，我们应该期待相比，人类心灵的架构时，精干的AI系统的设计是相当陌生。亚博体育苹果app官方下载
2. 想给一些正规的内容，这些尝试区分任务，像开放式的目标，目标是产生开放的研究问题的一种方式。亚博体育官网在里面 ”Alignment for Advanced Machine Learning Systems” research proposal, the problem of formalizing “don’t try too hard” ismild optimization“避开的荒唐策略”是保守主义, and “don’t have large unanticipated consequences” isimpact measures。见达里奥Amodei，克里斯·奥拉，雅各布·斯坦哈特，保罗·克里斯蒂，约翰·舒尔曼，和丹鬃毛的也是“避免负面影响”，“在AI安全的具体问题。”
3. 我们在机器视觉领域，在过去的几十年里学到的一件事是，它是无望手动指定了猫的模样，但它不是太难指定的学习系统，可以学习识别猫。亚博体育苹果app官方下载这是更加无望手动指定的一切，我们的价值，但它是合理的，我们可以指定一个学习系统，可以学习的相关概念“的价值。”亚博体育苹果app官方下载
4. 请参阅“Environmental Goals”“低抗冲击剂及”和“Mild Optimization”针对于指定物理目标，而不会造成灾难性的副作用的障碍的例子。

粗略地说，美里的工作重点是研究方向，似乎有可能帮助我们在概念上了解如何执行亚博体育官网原则AI比对，所以我们从根本上混淆少的那种工作，这可能是必要的。

What do I mean by this? Let’s say that we’re trying to develop a new chess-playing programs. Do we understand the problem well enough that we could solve it if someone handed us an arbitrarily large computer? Yes: We make the whole search tree, backtrack, see whether white has a winning move.

如果我们不知道该如何回答这个问题，即使任意大的计算机，那么这将表明我们根本搞不清在某种程度上棋。我们希望要么是缺少搜索树数据结构或回溯算法，或者我们会丢失的棋是如何工作的一些认识。

这是我们在香农的开创性论文之前，关于国际象棋的地位，这是我们目前正处在关于AI比对许多问题的立场。No matter how large a computer you hand me, I could not make a smarter-than-human AI system that performs even a very simple limited-scope task (e.g., “put a strawberry on a plate without producing any catastrophic side-effects”) or achieves even a very simple open-ended goal (e.g., “maximize the amount of diamond in the universe”).

如果我没有考虑任何特定目标的系统，我亚博体育苹果app官方下载可以write a program (assuming an arbitrarily large computer) that strongly optimized the future in an undirected way, using a formalism likeAIXI。In that sense we’re less obviously confused about capabilities than about alignment, even though we’re still missing a lot of pieces of the puzzle on the practical capabilities front.

Similarly, we do know how to leverage a powerful function optimizer to mine bitcoin or prove theorems. But we don’t know how to (safely) do the kind of prediction and policy search tasks I described in the “fill a cauldron” section, even for modest goals in the physical world.

Our goal is to develop and formalize basic approaches and ways of thinking about the alignment problem, so that our engineering decisions don’t end up depending on sophisticated and clever-sounding verbal arguments that turn out to be subtly mistaken. Simplifications like “what if we weren’t worried about resource constraints?” and “what if we were trying to achieve a much simpler goal?” are a good place to start breaking down the problem into manageable pieces. For more on this methodology, see “MIRI的方法。”

5. “填充这个大锅没有太聪明这件事或工作太硬或有我不期待任何负面影响”是的多数民众赞成直观地范围有限目标粗略的例子。我们其实是想用更聪明，比人类的人工智能的东西显然比这更雄心勃勃的，但我们还是要开始与各种限制范围的任务，而不是开放式的目标。

阿西莫夫的机器人三定律做出好故事部分是出于同样的原因，他们是从研究的角度来看无益。亚博体育官网把一个道德戒律为行代码是隐藏类似措辞背后的硬任务“[不，]袖手旁观，允许一个人来的伤害。”如果一个遵循这样的规则严格，其结果必然是大规模的破坏性，因为AI系统将需要系统地进行干预，以防止亚博体育苹果app官方下载even the smallest risks of even the slightest harms;如果意图是一个遵循规则松散，然后所有的工作都是由人情和直觉告诉我们做when and how to apply the rule

这里的一个普遍反应是模糊的自然语言指令就足够了，因为更聪明，比人类的AI系统可能能够自然语言理解的。亚博体育苹果app官方下载然而，这是eliding系统的目标函数和世界模型之间的区别。亚博体育苹果app官方下载在含亚博体育苹果app官方下载有人类可以学习世界模型，有很多有关人类语言和概念，该系统就可以使用，以实现其目标函数信息的环境行事的制度;但这一事实并不意味着任何关于人类语言和概念的信息将“泄漏”，并直接改变了系统的目标函数。亚博体育苹果app官方下载

Some kind of value learning process needs to be defined where the objective function itself improves with new information. This is a tricky task because there aren’t known (scalable) metrics or criteria for value learning in the way that there are for conventional learning.

If a system’s world-model is accurate in training environments but fails in the real world, then this is likely to result in lower scores on its objective function — the system itself has an incentive to improve. The severity of accidents is also likelier to be self-limiting in this case, since false beliefs limit a system’s ability to effectively pursue strategies.

相反，如果在一个匹配系统的价值学习过亚博体育苹果app官方下载程的结果我们在训练，但在现实世界发散，那么系统的显然不会予以处罚优化。该系统具亚博体育苹果app官方下载有相对之间，如果该值学习过程最初是有缺陷的“正确”的分歧，没有动力。而事故风险在这种情况下更大，因为之间的不匹配，不一定放在系统的工具有效性的限制，在未来与实现有效的和创造性的战略。亚博体育苹果app官方下载

这个问题有三个方面：

1.“做什么我的意思”是一个非正式的想法，即使我们知道如何建立一个更聪明，比人类的AI系统，我们不知道如何精确的代码行指定了这个想法。亚博体育苹果app官方下载

2.如果在做什么，我们实际上意味着是实现特定目的工具性是有用的，然后有足够能力的系统可以学习如何做到这一点，只要相应的可以充当这样做是为它的目标是有用的。亚博体育苹果app官方下载但是，随着系统亚博体育苹果app官方下载越来越强大，他们很可能会找到创新的方法来达到同样的目的，并没有明显的方式来得到保证，“做我的意思是”将继续成为有用的工具上无限期。

3. If we use value learning to refine a system’s goals over time based on training data that appears to be guiding the system toward a that inherently values doing what we mean, it is likely that the system will actually end up zeroing in on a that approximately does what we mean during training but catastrophically diverges in some difficult-to-anticipate contexts. See “古德哈特的诅咒” for more on this.

For examples of problems faced by existing techniques for learning goals and facts, such as reinforcement learning, see “Using Machine Learning to Address AI Risk。”

7. 这个概念有时集中到“透明度” category, but standard algorithmic transparency research isn’t really addressing this particular problem. A better term for what I have in mind here is “understanding。” What we want is to gain deeper and broader insights into the kind of cognitive work the system is doing and how this work relates to the system’s objectives or optimization targets, to provide a conceptual lens with which to make sense of the hands-on engineering work.
8. 我们可以choose该系统以轮胎项目，但我们没有。亚博体育苹果app官方下载在理论上，可以编写一个扫帚，只有不断的发现和优化大锅丰满执行行动。提高了系统的能力，有效地找到亚博体育苹果app官方下载高得分的动作（一般情况下，或相对于特定的评分规则）本身并不改变它的使用，以评估行动的评分规则。
9. 我们可以想像，后一种情况导致feedback loop作为系统的设计亚博体育苹果app官方下载改进让它拿出进一步的设计改进，直到所有的唾手可得的耗尽。

另一个重要的考虑因素是两个的main bottlenecks to humans doing faster scientific research are training time and communication bandwidth. If we could train a new mind to be a cutting-edge scientist in ten minutes, and if scientists could near-instantly trade their experience, knowledge, concepts, ideas, and intuitions to their collaborators, then scientific progress might be able to proceed much more rapidly. Those sorts of bottlenecks are exactly the sort of bottleneck that might give automated innovators an enormous edge over human innovators even without large advantages in hardware or algorithms.

10. Specifically, rockets experience a wider range of temperatures and pressures, traverse those ranges more rapidly, and are also packed more fully with explosives.
11. Consider Bird and Layzell’sexampleof a very simple genetic algorithm that was tasked with evolving an oscillating circuit. Bird and Layzell were astonished to find that the algorithm made no use of the capacitor on the chip; instead, it had repurposed the circuit tracks on the motherboard as a radio to replay the oscillating signal from the test device back to the test device.

This was not a very smart program. This is just using hill climbing on a very small solution space. In spite of this, the solution turned out to be outside the space of solutions the programmers were themselves visualizing. In a computer simulation, this algorithm might have behaved as intended, but the actual solution space in the real world was wider than that, allowing hardware-level interventions.

在智能系统这是显著比聪明人在任何轴你测量的情况下，你应该是默亚博体育苹果app官方下载认期望系统向喜欢这些奇怪的和创造性的解决方案推动，并为选择的解决方案是难以预料。