AI alignment
The enormous problem facing humanity.
Credit for the cover image goes to Anna Husfeldt.
Epistemic status: In this post, I will argue that AI alignment is important and that it is difficult. That is a lot of ground to cover, so not all arguments will be very precise or assume the same level of technical background. Opinions on exactly how important and/or difficult the alignment problem is vary a lot within the machine learning community, and even within the AI alignment field itself (although, as you’d expect, much less). As a result: keep in mind that this blog post is very much slanted towards my views on the topic.
My thanks to Otto Barten, Leonard Bereska, Jim Boelrijk, Nynke Boiten, Nikita Bortych, Mariska van Dam, Fokel Ellen, Isaak Mengesha, Sam Ritchie, Victoria Ramírez López, Nandi Robijns, Joris Schefold, Marieke de Visscher, Tina Wünn, and especially Leon Lang and Max Oosterbeek for providing feedback. Opinions and mistakes are mine.
So. AI alignment.
I am writing this post because I am immensely worried about the risks that come with building increasingly capable AI systems.
Why write my own post? Why not just refer to one of the many (now, finally) available online? That’s because there’s one thing I haven’t seen many existing articles do: presenting a complete argument for why we should be worried, while straddling the line between 1) understandability for non-experts, and 2) sufficient technical detail for AI experts. This article is meant both for laypeople who want to understand better why we might expect AI to go badly, and for machine learning researchers who would like to see the whole argument written down in one place so that they can evaluate it for themselves. Of course, I cannot nearly cover everything in a single blog post. To compensate, I’ve included links to additional information and further reading. Even so, this post primarily discusses the technical issues, although the problem is also very much one of policy and governance: I will touch on this throughout as well.
To start, what is AI alignment?
AI alignment is the process of making sure that AI systems work towards our intended goals, whatever those are. An AI system that works towards our intended goals is called aligned.
That’s the goal. Aligning AI systems is surprisingly difficult, which is a big part of what has the research community worried. We’ll explore the details of this step by step:
- Sounding the alarm. Setting the stage: many AI researchers are suddenly worried.
- What is AI? How are AI systems created, what can they do, and what problems are we already seeing?
- Aligning AI. What are the difficulties with aligning AI? This is the most technical section and the easiest to skip.
- Existential risk. Why is unaligned AI existentially dangerous to humanity?
- What to do? A call to action. What can you do as a layperson, as someone working in policy, or as a researcher?
Sounding the alarm
In the last few months, many researchers have publicly voiced their concerns about the dangers of future AI systems. These include two of the three Turing Award winners, Geoffrey Hinton and Yoshua Bengio, Gary Marcus, Nando de Freitas, David Duvenaud, Dan Hendrycs, and many others. Even Snoop Dogg is worried.
Other researchers have been publicly worried for longer. Among these is Stuart Russell, professor of computer science at the University of California, Berkeley, and writer of “the most popular artificial intelligence textbook in the world”. He recently spoke at the Pakhuis de Zwijger panel on Existential Risks of AI in Amsterdam, where I was invited to be a panellist. The panel was held at a great time: OpenAI had just released their newest Large Language Model (LLM) – GPT-4 – a month before, Microsoft had released Sidney/Bing a month before that, with Google’s Bard right on their heels, and the world was paying attention. The panel sold out, was moved to the main hall, and sold out again.
What is AI?
So what are we worried about? Stuart Russell’s introductory talk is worth watching for both laypeople and experts. A recording of his talk as well as the subsequent panel discussion is available here. He starts by explaining the goal of companies like OpenAI and DeepMind: to build general-purpose AI; AI capable of (quickly learning) high-quality behaviour in any task environment. You may have previously heard this described as Artificial General Intelligence (AGI). Such an AI would be able to perform a wide variety of tasks. Among other things, it would be able to play chess, Go, and other games, search the internet, make plans and execute on those plans, chat with people, manipulate people, write code, and solve scientific problems.
Current AI systems can do all these things already to some degree. Mostly, these are narrow AI systems trained to do a particular task very well, but there are some examples of systems that can already perform multiple tasks. For instance, ChatGPT can chat, write code, and answer scientific questions (to varying degrees of success). Its sister product Sydney/Bing can chat, search the internet, and even expressed unexpected antisocial behaviours before Microsoft added some safeguards. Interestingly, large language models have developed capabilities they were not directly trained for, or have learnt to perform tasks that their smaller counterparts failed at – emergent capabilities – merely by having been scaled up.1 This makes it likely that these models will develop even more novel capabilities as we scale them up even further.
GPT-4 (the model behind ChatGPT) is already showing enough generality to impress the researchers at Microsoft. Their recent paper – Sparks of Artificial General Intelligence – reports on their experiments with an early version of GPT-4. They write: “We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting. Moreover, in all of these tasks, GPT-4’s performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT.”
AI causing trouble
Even our current narrow(er) AI systems are already causing some trouble. For example, AI can exacerbate problems caused by our society’s socio-economic biases – e.g., racism, gender-biases, etc. – by reflecting them in their algorithmic biases. This risks amplifying inequities in society, for instance in our health systems. Something else that many of us have already experienced, is the effects of algorithmically curated social media feeds on the quality of online discourse. Stuart Russell presents the classic example of social media AI systems trained to maximise clickthrough-rate eventually learning to amplify clickbait – which is, in fact, a very good strategy for maximising clickthrough-rate, but which has the side effect of spreading misinformation. We currently do not have any obvious solutions to these problems, and they are likely to get worse as we include AI systems in more and more of our society’s processes. OpenAI itself has warned about the risk of LLM-powered misinformation campaigns, which seem very possible with current systems.
The social media example is illustrative because it points to an enormous safety problem in the way we currently build AI systems: we don’t know how to robustly get AI to do what we want. Our current paradigm is deep learning, which involves training large models – called neural networks – on huge datasets. Training is implemented as an optimisation problem. In practice, this means that every data point tells the model how to change itself a tiny bit such that it becomes better at predicting that data point. The quality of a predictions is determined according to your specified objective, which we call ‘loss function’. Training an AI system essentially amounts to searching a huge space of possible algorithms for one that does well on your task, as specified by your training data and loss function. This has one main implication: no one knows exactly why a deep learning system predicts the things that it does or performs the actions that it does.
That is worth repeating: no one knows exactly why a deep learning system does what it does. When Syndey/Bing started aggressively misbehaving – by, for instance, declaring its love to a New York Times journalist and subsequently urging him to leave his wife – Microsoft did not know why. The best they could do is: “Guess we didn’t train it hard enough to be helpful and harmless”, and this is the best anyone can do right now. We simply do not have the technology to understand in detail how deep neural networks generate their answers. To Mikhail Parakhin, the person at Microsoft in charge of Syndey/Bing’s deployment, such behaviour came as quite a surprise. Why? Because no one in their test markets had reported such behaviour. No one knows exactly what large language models can do, not even their creators. The field that concerns itself with improving our precise understanding of these capabilities is called ‘mechanistic interpretability’ and it is hugely underexplored compared to the field of ‘giving the models more capabilities’.2
Continuing to build more capable AI systems when we understand so little of what goes on in the models has the potential to lead to existentially dangerous scenarios.
In my experience, the above statement sounds obvious to some people, and completely out of left field for others. Below I will sketch out why I think this is something we should be deeply worried about.
Aligning AI
Imagine we succeed in building a very capable AI system using deep learning. As is standard deep learning protocol, we are using lots of data from all kinds of domains to search for an algorithm that does well on some objective. Presumably, we want the objective to be something good for humanity, or sentient life in general. This step alone opens up quite the philosophical can of worms: what kind of trade-offs should we be making? Free speech versus libel and misinformation. Safety of medical research versus lives lost by delays. This is not mere armchair philosophy: the self-driving car industry has been struggling with questions about such ethical trade-offs for years. Choosing the wrong objective here would be bad, since then we’d be dealing with a very capable AI system that is actively working to achieve that objective, rather than towards something that aligns with our goals (this will be a recurring theme). It’s not obvious that there even exists a singular goal that would be (equally?) good for all of humanity and other sentient life. These are genuinely difficult philosophical questions, that we don’t currently have an answer for.
A question that often comes up at this point is: “Why can we not just programme the AI system to follow the Three Laws of Robotics?” These laws were created by science fiction writer (and professor of biochemistry) Isaac Asimov, arguably to show why such an idea cannot work. One problem is that the laws are too open to interpretration to actually specify robustly safe behaviour: it’s far too easy to circumvent the spirit of the laws while keeping to the letter.
Outer alignment
Returning to our example, let’s assume we somehow solve – or more likely: circumvent(?) – the philosophical issues with finding a good objective. We now know what objective we want our AI system to do well on: can we specify a loss function (objective) that achieves this? Any programmer knows that the trouble with computers is that they do what you tell them to do, not what you want them to do. If you don’t specify what you want very precisely, you don’t get what you want: you get what you (mistakenly) said. This is the outer alignment problem. The Three Laws of Robotics run into exactly this issue as well. Quoting the linked page:
Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.
That last bit about Goodhart’s Law is worth going into. Goodhart’s Law states that “When a measure becomes a target, it ceases to be a good measure”. Why is this? Well, this is exactly what went wrong in the social media example above. AI engineers wanted to build a system that maximised engagement, decided on a measure for this (clickthrough-rate), and trained their AI system to optimise that measure. Clickthrough-rate was supposed to be a measure of what the engineers wanted the AI to optimise (‘engagement’), but as soon as enough optimisation pressure was applied (in the form of AI training) it stopped being a good measure, because the AI found strategies that maximised clickthrough-rate in ways the engineers did not intend.
This phenomenon where an AI system learns behaviour that satisfies the literal specification of an objective without achieving the intended outcome is called specification gaming. There are many examples of this behaviour emerging in deep learning systems. Again, if we don’t somehow solve this, we’ll be dealing with a very capable AI system that is actively working against our goals.
Okay, what if we somehow solve this issue, such that our trained AI does what we want on the training data? Enter goal misgeneralisation: the AI learns behaviour that seems to follow the correct goal during training, but actually it learned some other goal that your training set could not distinguish from the goal you wanted it to have. Again, there are examples of this happening with current systems. And again, we’ll be dealing with a very capable AI system that is actively working against our goals. Notice the theme?
Inner alignment
Are we done if we solve this? Nah, there’s also inner alignment. This is quite a complex topic, for which a slightly more introductory level post can be found here. The basic idea is that optimisation can create internal processes that develop goals/objectives that are not the same as the outer goal/objective.3 The technical term for such a process is mesa-optimiser. Human goals can be viewed as an example of an inner alignment failure. Evolution has created us, humans, through an optimisation process – namely natural selection – with the goal of maximising inclusive (genetic) fitness. Do we – the mesa-optimisers that result from this optimisation process – care about inclusive fitness? I sure don’t. I didn’t even have a concept of ‘inclusive fitness’ – either intellectually or intuitively – until I learned about evolution. It’s very difficult to care about something that I don’t even have a concept for.4
Doughnuts are another alignment failure: sugar, fat, and salt were all highly-valuable nutrients in the ancestral environment, and so we evolved behaviours that sought out those nutrients (for instance, we really like how they taste). Then we created doughnuts, which did not exist in the ancestral environment. Now we collectively eat a lot of doughnuts, which is not very good for our inclusive fitness (e.g., through health issues). Here our training environment (the ancestral environment) did not generalise well to our testing environment (current-day civilisation), and so we ended up with behaviours that seemed aligned – in the sense that they improved our inclusive fitness in the training environment – but actually aren’t within the context of modern society (the ‘testing’ or ‘deployment’ environment, if you will).5
Evolution did its darndest to optimise us to maximise inclusive fitness: the doughnut example shows it failed to align our behaviours. And the fact that we don’t even have an intuitive concept for inclusive fitness means it failed to align our goals.6
Goal-directed behaviour
Okay, but do our current deep learning systems have goals? The obvious answer is that we don’t know because we don’t know very well what goes on in our AI systems. The second obvious answer is that they don’t need to develop goals themselves to pose a risk. Humanity seems very happy to provide GPT-4 with user-specified goals and internet access. Giving AI systems goals in plain English – rather than maths – is not necessarily an improvement from a robustness perspective, because it is really hard to specify exact meaning in natural language. Also, I’m sure no one will actually give an AI system a bad goal, ever.
Still, many current deep learning models aren’t trained to act as agents with goals and don’t seem to coherently pursue any goals as far as we can tell. So, how might AI systems develop goals themselves? This discussion is quite technical: it probably won’t make much sense to you unless you are familiar with deep learning already. Feel free to skip to the last paragraph of this section (right before: Existential risk).
To start, we might train our AI systems as agents, as we typically do in the field of reinforcement learning (RL). Most researchers seem to agree that this is probably not a very safe idea for building very capable systems, for all the outer alignment reasons mentioned above. We’ve even seen shadows of the inner alignment problem here, in that RL agents can learn to implement an inner RL agent in their activations. This may be an example of the mesa-optimisers that we discussed in the section on inner alignment. Even avoiding RL may not necessarily help: the in-context learning abilities of Transformers – the architecture underlying all LLMs – may be partially explained by their learning to do gradient-based training in the forward pass. This is possibly an example of a mesa-optimiser emerging from training non-agentic deep learning systems.
Of course, we don’t know whether these mesa-optimisers are pursuing any coherent goals. We don’t know whether our non-agentically trained AI systems might implement agents as we scale up their capabilities. Paul Christiano, one of the top AI alignment researchers, provides a brief discussion on how LLMs might become goal-directed, here. The (speculative) possibility of emergent goal-directed behaviour is a reason why, for example, the science AI suggested by Yoshua Bengio might not be safe (for those interested in further reading: his proposal is an example of a tool AI).
So, we don’t know whether AI systems will develop coherent goals purely through training, but it seems at least possible: this is very worrying. I want to stress here that there are currently no alignment proposals that the community as a whole expects to work. It is easy to come up with stories of how things might work out, but for now, these all fail in predictable ways.7 We’ll talk more about current alignment proposals after we’ve finished making the argument for existential risk from AI.
Existential risk
Okay, back to our deep learning system that we are trying to align. We don’t know which objective to give it, how to make sure we actually optimise it for that objective if we did know, and whether this optimisation generalises to new environments – it might not because the behaviours don’t generalise well (doughnuts), or because it just learned different goals (inclusive fitness). If we don’t solve these problems, we’ll be dealing with a very capable AI system that is actively working against our goals. That is bad.
Perhaps it is obvious why instantiating AI that has different goals from humanity is likely to end human civilisation, but the actual argument requires two more things: 1) strong capabilities, and 2) instrumental goals.
Strong capabilities (‘very capable AI system’) means that we need to have actually succeeded in building an AI system that is capable of competently acting in the world so that it can achieve its goals. Different goals require different levels of competence: playing a video game is a lot simpler than manipulating people, but both are possible at some level of competence. One big assumption is that it’s possible to build general AI that is more capable than human civilisation at all the relevant tasks (which need not be all tasks). This does not seem impossible in principle: humans are generally intelligent after all, and there is no reason to suppose our intelligence is the limit of what is achievable. Recently it’s starting to appear more practically feasible as well, especially as our largest models seem to get more general as we scale them up. Still, it might be that the deep learning paradigm cannot get us to sufficiently capable AI. That would be a lucky break, because 1) it probably gives us more time to solve the alignment problem(s), and 2) deep learning is the most uninterpretable approach to AI we know.
Thinking about AI
To explore the point of ‘instrumental goals’, it’s worth stressing what the field of AI alignment means by ‘intelligence’. The definition that suffices for our purposes is that: “a system is intelligent to the degree that its actions can be expected to achieve its goals”. Here intelligence is about achieving goals. The key insight here is not to think of an intelligent system as some book-smart professor that spends their day working on abstract maths problems. When we talk about creating intelligent systems, we’re talking about creating smart optimisers: systems that – given some goal – change the world in such a way that that goal is achieved. People are an example of such optimisers: we have goals, and we think intelligently about how to achieve them. If the goal is difficult enough, we make plans that can span multiple years, we gather allies and resources, and we predict obstacles and try to solve those. If we want to build a road, we talk to the municipality and contractors, we buy materials, and we lobby. And if there is an anthill where we want to build the road, we bulldoze the anthill: not because we hate the ants, but because they are in the way of our goal.
Bulldozing the ants is an example of an instrumental goal. An instrumental goal is any goal that is useful to the pursuit of the actual intrinsic goal (i.e., the thing an agent actually tries to achieve). Typical instrumental goals are self-preservation (if you don’t exist, your goal is less likely to be achieved), resource accumulation (more resources means you have more options for achieving your goal), and removing adversaries (adversaries are bad for your chances of achieving your goal), which may include deceiving8 or manipulating adversaries. Instrumental convergence states that goal-directed agents will tend to pursue instrumental goals such as the three mentioned here.9
The kind of intelligence that we’re trying to build should be thought of as capable of planning, gathering resources, and removing obstacles. We should also keep in mind that it will be very different from humans: it has a different cognitive architecture, and it does not share the evolutionary history that makes us cooperative and empathetic. If such a system becomes capable enough – more capable than humans – and we haven’t programmed it to care about human values, then we’ll be dealing with a very capable AI system that is actively working against our goals/values.
Such a system may take the resources – e.g., energy, computational power, water (cooling), in the worst case even the atoms that we are made of – that we need (the instrumental goal of resource accumulation). We won’t be able to switch it off or destroy it, because that’s bad for its goals (the instrumental goal of ‘self-preservation’). We may not even know it is misaligned until it starts acting badly because it could deceive us about its true goals (the instrumental goal of ‘removing adversaries’).10 If the system is sufficiently capable, this means that we end up with a world without human values and likely without humans. Not because the AI suddenly becomes conscious or because it hates us, or anything remotely to do with human emotions. Simply because a more capable system is better at achieving its goals than we are at ours. If we’re not careful, we risk becoming the anthill.
Bad outcomes don’t require a single extremely capable AI trying to take over the world: we are already putting AI systems into all kinds of processes, and we are poised to continue doing this as AI systems develop more capabilities. We may be on a path of slowly giving away the power to steer our future to optimisation processes that we do not fully understand, that are optimising for some objective that is not perfectly aligned with human goals, and that may be hard (or impossible) to roll back. This can cause a slow-rolling catastrophe all by itself.
Even if we solve the technical alignment problem, we may still be risking a future where an AI-supported mini-elite holds global power (for instance, if they somehow take the technical solution and manage to align AI to their specific selfish values). After such a group potentially develops into a kind of global dictatorship, we may wonder what use they have for eight billion people when they control all the powerful AI systems. This very much does not mean we should race to build a superhumanly capable AI system because otherwise ‘the bad guys’ will build it first. That is extremely likely to end badly.
There are many possible scenarios in between the ones mentioned here. The point is not that any individual scenario is likely to happen: predicting the future is hard, after all. The point is that there are good arguments for expecting some chance that we end up with really bad outcomes if we don’t solve these issues, and we are currently not on track to solve these issues.
Alignment proposals
So how do we reduce the risk of becoming the metaphorical anthill? By solving the alignment problem, which is a way of saying: “making sure that AI systems work towards our intended goals”. Note that this should include the policy issues behind making sure the goals are intended to benefit everyone.
Solving the technical alignment problem is an open problem: that is, we don’t know how to solve it yet. Here is an overview of some of the best theoretical alignment proposals we have so far. Our current best practical technologies for aligning large language models are Reinforcement Learning by Human Feedback and Constitutional AI (‘Reinforcement Learning by AI Feedback’). The former has been used to make GPT-4 more harmless and helpful, while the latter is used on Anthropic’s Claude. These are both potential avenues of scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. In their current form, unfortunately, they are far from sufficient, as exemplified by the many creatively applied methods for jailbreaking GPT-4. It’s currently very unclear whether methods like these will work at all for aligning more capable systems.
A recurring problem in all this is that we are training our deep learning systems essentially by doing a big program search based on some optimisation objective. At the very least, this choice adds a bunch of inner alignment problems. It may be that a different paradigm is necessary to solve these problems. Stuart Russell’s lab is one of the groups working on a proposal in this direction based on assistance games / cooperative inverse RL. The idea here is to have an agent do a form of Bayesian updating over a human reward function (a way of representing human values), such that the agent is incentivised to learn about the human reward function before irreversibly acting in the real world. The research community is split on whether this can work, partially precisely because it’s not obvious how to design such an agent in the deep learning (program search) framework.
All of this and more is what many researchers are now sounding the alarm on. In this 2022 survey of machine learning researchers, 48% of respondents put the probability that long-term effects of advanced AI on humanity are “extremely bad (e.g., human extinction)” at 10%.11 Note that this survey was run before Sydney/Bing and GPT-4 were released. Since then, many more researchers have publicly expressed their worries, and the world is finally starting to take the problem more seriously.
What to do?
As is hopefully clear now, there are a lot of important open problems in AI alignment, many of which we’ll need to solve if we want to be sure that future – very capable – AI systems are robustly safe. We need much more research in this field, but unfortunately, there are currently far fewer researchers working on AI alignment than on AI capabilities.
So what can you do? Awareness about the dangers of AI is growing, but we need much more of it. Talk to your friends, family, and colleagues; write letters to your political representatives urging them to take these issues seriously; and share the articles explaining the risks. AI policy is lagging far behind AI technology, and it cannot catch up until politicians and lawmakers feel the need to change that status quo.
Interested in doing alignment research yourself? The paper The Alignment Problem from a Deep Learning Perspective (2023) gives an overview of the research field in section 5. Victoria Krakovna’s AI safety resources are a good general starting point, as is the ML Safety course. If you’re specifically interested in technical work, consider following Richard Ngo’s AI Alignment Course and have a look at the SERI-MATS programme.
If you’re more interested in policy work instead, then have a look at the AI Governance Course. Policy work is especially important right now, because 1) reducing AI risk is not just a technical problem, and 2) finding technical solutions to AI alignment will take time, which we might not have on our current course. At least OpenAI, Microsoft, and Google seem to be locked in an AI arms race, which does not incentivise them to act slowly and safely. Google has called ‘Code Red’ over ChatGPT, and Satya Nadella, CEO of Microsoft, recently announced: “It’s a new day in search. The race starts today, and we’re going to move and move fast.".
The Future of Life Institute, who spearheaded the open letter to pause giant AI experiments in March, has released a report on AI policy recommendations. These are their seven recommendations:
- Mandate robust third-party auditing and certification.12
- Regulate access to computational power.
- Establish capable AI agencies at the national level.
- Establish liability for AI-caused harms.
- Introduce measures to prevent and track AI model leaks.
- Expand technical AI safety research funding.
- Develop standards for identifying and managing AI-generated content and recommendations.
My personal view is that policy work is an absolutely crucial part of the solution.13 The companies (and perhaps in the future: governments) attempting to build ever more capable AI systems need stronger incentives to treat AI alignment as the difficult and hugely important problem it is. Alignment researchers need time to make progress on the technical problems, and we may not have that time unless we put the brakes on large AI training runs.14 International cooperation on this seems necessary: at least when it comes to the existential risks it probably doesn’t matter so much who builds the AI, as whether it is built at all and whether the system is aligned. OpenAI trying and failing, Google trying and failing, Microsoft trying and failing, or even the US or Chinese governments trying and failing may look very similar in terms of the outcome for humanity.15
AI has the potential to do incredible good in the world. The obvious example of vastly improved healthcare is only the beginning of what we could achieve with robustly aligned AI systems. Unfortunately, I worry that this is not the future we are currently headed for, but the public outcry of the last few months is a hopeful sign. Let’s build on this momentum to put systems in place that lead to a positive future: the seven recommendations above are a good start.
Closing
This has been my attempt to convey the incredible importance of solving AI alignment in the form of a blog post. When I first read about this problem in 2014, I figured we’d have a long time to solve it: plenty of time to finish my master’s in physics, study a bit more to then do a PhD in machine learning, and get to work on technical alignment research afterwards. The rapid progress in deep learning over the past ten years has changed my views on this, and as a result, I’ve been trying to create more awareness by speaking and – now – blogging. It’s been surreal to see the Overton window around AI risk widened so quickly over the past months, and I’m very happy that big names in the research community – such as Geoffrey Hinton – are speaking out. There is much more to say and far more to do(!), but this post is already much longer than I intended it to be.
Feel free to contact me with any related questions or requests.
How models acquire these capabilities is a topic of active research, but those of you who’ve read the previous blog posts may have guessed part of my own suspicion: simplicity prior + lots of data = learning rules that generalise to new tasks. Say we are ChatGPT and we want to predict the next word in sentences where a person is writing angrily. During training, we can learn good predictions (on the training data) in roughly two ways: 1) memorise the training data, 2) learn a rule that generates the training data; e.g., learn to detect when a person’s writing is angry and then predict that angry people are more likely to use insults. We want 2), because 1) does not generalise to new data (or tasks) at all. How can we achieve this? Remember Bayes’ theorem (in log space for simplicity): posterior = likelihood + prior. The hypothesis that memorises all the data is hugely complex – in the information-theoretic sense – compared to the hypothesis that a simple rule explains the training data, and so the model will tend to the latter assuming it has seen enough data. This is because the prior on the latter hypothesis is stronger, while the likelihood is similar (both hypotheses nearly perfectly explain the training data).16 In our angry writing example we will need to see a lot of data because the concept of ‘anger’ is not simple to specify, but as we see more and more data it does become simpler than ‘memorise all the data’. ↩︎
Mechanistic interpretability is hugely valuable from a safety perspective. Mechanistic interpretability essentially tries to reverse-engineer neural networks in order to better understand how they produce their outputs. If we want to make sure that our (Large Language) models are robustly aligned to human goals – and we should make sure – we’ll need far more of this kind of research. ↩︎
Note that not everyone in the alignment community agrees with the decomposition of the problem into outer and inner alignment. There’s also some discussion amongst the researchers who use the decomposition about the precise distinction between inner alignment and outer alignment, and – e.g. – where exactly goal misgeneralisation fits in. This does not change the actual practical hurdles to robustly aligning AI for our purposes, and making the distinction helps me order my thoughts. ↩︎
I have not yet explained what inclusive fitness is: this is on purpose. Chances are you’ve not heard of this concept before, or only have a vague idea of what it means. If so, that should help convey the size of the alignment failure that has occurred: the single-minded optimisation process of evolution has resulted in agents that do not have an intrinsic concept of what they’ve been optimised for, namely increasing the relative frequency of their genes in the population. ↩︎
This change in environment between training and testing settings goes by the name of distribution shift in machine learning and is a major hurdle for the robustness of AI systems. It also makes AI alignment much harder. ↩︎
Of course, we are not perfectly misaligned: we generally care about our families, and many people want to have children. Maximising inclusive fitness though, would look very different: for one, men would be donating to sperm banks en masse. ↩︎
See A List of Lethalities for all the ways this might go wrong. Fair warning: the author self-describes this list as “a poorly organized list of individual rants”. Still, it’s the best “but what if we…”-style FAQ I’ve been able to find. ↩︎
AI systems may learn to become deceptively aligned: they may seem aligned until the opportunity arises for them to gain a decisive advantage over humanity, at which point they act. This complicates matters tremendously unless we can understand the inner workings of these systems. Here are some arguments for why training with stochastic gradient descent may favour deceptive alignment. ↩︎
See this paper for why optimal RL agents pursuing random goals will on average seek power. Note that the proof makes two assumptions that do not apply to the real world: our trained agents are not optimal, and we don’t give them random goals. Still, this is not a good sign. ↩︎
Preventing this is another reason we need to do more (mechanistic) interpretability research, as understanding the system’s internals would help with understanding what it is optimising towards. ↩︎
The survey likely has a sample bias. From their report: “We contacted approximately 4271 researchers who published at the conferences NeurIPS or ICML in 2021. These people were selected by taking all of the authors at those conferences and randomly allocating them between this survey and a survey being run by others. We then contacted those whose email addresses we could find. We found email addresses in papers published at those conferences, in other public data, and in records from our previous survey and Zhang et al 2022. We received 738 responses, some partial, for a 17% response rate.” Of those 738 respondees, only 162 answered the specific question here. ↩︎
Shortly after this blog post was published, a collaboration of ARC, Anthropic, DeepMind, OpenAI and multiple universities and safety institutes released the paper Model evaluation for extreme risks (2023). Their proposed theory of change for model evaluations for extreme risk should be extremely useful to policy makers looking to create regulation for ensuring safer deployment of highly capable AI models. ↩︎
It’s worth noting that some people in the field believe that alignment is not a solvable problem and that the only safe way forward is halting humanity’s efforts at building AGI. This would make policy the single most important factor in reducing the existential risks from AI. ↩︎
Training modern deep learning systems is much more expensive than running them, so the development of these systems is the obvious bottleneck to target. Once a model is trained it can be leaked, making it that much harder to regulate. ↩︎
It is theoretically possible to build an AI system that is aligned to a specific group of people to the exclusion of others. This is also very bad, but getting there requires solving most of the alignment problem to begin with, as in the AI-powered dictatorship scenario. I don’t think this is a particularly likely outcome, as solving alignment likely requires global coordination around strict AI policies, which would make it harder for one small group to gain power. That said, evaluating this risk is outside of my expertise. Crucially, we should work to prevent scenarios that have research groups racing to build superhumanly capable AI systems because otherwise ‘the bad guys’ will build them first. Such an arms race is extremely likely to end badly. ↩︎
David MacKay makes a similar argument about Occam’s Razor following from Bayesian model selection in chapter 28.1 of his book Information Theory, Inference, and Learning Algorithms, although his explanation focuses on the model-evidence term, which requires combining multiple of the things I’ve called hypotheses into a single model hypothesis. ↩︎