Richard Ngo on large language models, OpenAI, and striving to make the future go well

By Robert Wiblin and Keiran Harris · Published December 13th, 2022

Richard Ngo on large language models, OpenAI, and striving to make the future go well

By Robert Wiblin and Keiran Harris · Published December 13th, 2022

Images generated by DALL-E using prompts such as "A duck facing a robot duck, digital art".

Enjoyed the episode? Want to listen later? Subscribe here, or anywhere you get podcasts:

It does seem like we’re on track towards having many instances of the largest models rolled out. Maybe every person gets a personal assistant. And as those systems get more and more intelligent, the effects that they have on the world increase and increase. And the interactions that they have with the people who are nominally using them become much more complicated.
Maybe it starts to become less clear whether they’re being deceptive and so on… But we don’t really have concrete solutions right now.
Richard Ngo

Large language models like GPT-3, and now ChatGPT, are neural networks trained on a large fraction of all text available on the internet to do one thing: predict the next word in a passage. This simple technique has led to something extraordinary — black boxes able to write TV scripts, explain jokes, produce satirical poetry, answer common factual questions, argue sensibly for political positions, and more. Every month their capabilities grow.

But do they really ‘understand’ what they’re saying, or do they just give the illusion of understanding?

Today’s guest, Richard Ngo, thinks that in the most important sense they understand many things. Richard is a researcher at OpenAI — the company that created ChatGPT — who works to foresee where AI advances are going and develop strategies that will keep these models from ‘acting out’ as they become more powerful, are deployed and ultimately given power in society.

One way to think about ‘understanding’ is as a subjective experience. Whether it feels like something to be a large language model is an important question, but one we currently have no way to answer.

However, as Richard explains, another way to think about ‘understanding’ is as a functional matter. If you really understand an idea, you’re able to use it to reason and draw inferences in new situations. And that kind of understanding is observable and testable.

One experiment conducted by AI researchers suggests that language models have some of this kind of understanding.

If you ask any of these models what city the Eiffel Tower is in and what else you might do on a holiday to visit the Eiffel Tower, they will say Paris and suggest visiting the Palace of Versailles and eating a croissant.

One would be forgiven for wondering whether this might all be accomplished merely by memorising word associations in the text the model has been trained on. To investigate this, the researchers found the part of the model that stored the connection between ‘Eiffel Tower’ and ‘Paris,’ and flipped that connection from ‘Paris’ to ‘Rome.’

If the model just associated some words with one another, you might think that this would lead it to now be mistaken about the location of the Eiffel Tower, but answer other questions correctly. However, this one flip was enough to switch its answers to many other questions as well. Now if you asked it what else you might visit on a trip to the Eiffel Tower, it will suggest visiting the Colosseum and eating pizza, among other changes.

Another piece of evidence comes from the way models are prompted to give responses to questions. Researchers have found that telling models to talk through problems step by step often significantly improves their performance, which suggests that models are doing something useful with that extra “thinking time”.

Richard argues, based on this and other experiments, that language models are developing sophisticated representations of the world which can be manipulated to draw sensible conclusions — maybe not so different from what happens in the human mind. And experiments have found that, as models get more parameters and are trained on more data, these types of capabilities consistently improve.

We might feel reluctant to say a computer understands something the way that we do. But if it walks like a duck and it quacks like a duck, we should consider that maybe we have a duck — or at least something sufficiently close to a duck it doesn’t matter.

In today’s conversation, host Rob Wiblin and Richard discuss the above, as well as:

Could speeding up AI development be a bad thing?
The balance between excitement and fear when it comes to AI advances
Why OpenAI focuses its efforts where it does
Common misconceptions about machine learning
How many computer chips it might require to be able to do most of the things humans do
How Richard understands the ‘alignment problem’ differently than other people
Why ‘situational awareness’ may be a key concept for understanding the behaviour of AI models
What work to positively shape the development of AI Richard is and isn’t excited about
The AGI Safety Fundamentals course that Richard developed to help people learn more about this field

Get this episode by subscribing to our podcast on the world’s most pressing problems and how to solve them: type ‘80,000 Hours’ into your podcasting app. Or read the transcript below.

Producer: Keiran Harris
Audio mastering: Milo McGuire and Ben Cordell
Transcriptions: Katy Moore

Highlights

What we've learned from recent AI advances

Richard Ngo: It does feel like the progress of the last couple years in particular has been very compelling and quite visceral for me, just watching what’s going on. I think partly because the demos are so striking — the images, the poetry that GPT-3 creates, things like that — and then partly because it’s just so hard to see what’s coming. People are really struggling to try and figure out what things can these language models not actually do? Like, what benchmarks can we design, what tasks can we give them that are not just going to fall in a year or maybe a year and a half? It feels like the whole field is in a state of suspense in some ways. It’s just really hard to know what’s coming. And it might just often be the things that we totally don’t expect — like AI art, for example.
Rob Wiblin: Yeah. I guess there’s two things going on here. One is just that the capabilities are ahead of where people forecast them to be. And they’re also ahead of where they forecast it to be in a kind of strange direction: that the progress is occurring on the ability to do tasks that people didn’t really anticipate would be the first things that AI would be able to do. So it’s just massively blown open our credences or our expectations about what might be possible next, because it seems like we don’t have a very good intuitive grasp of which things ML is able to improve at really rapidly and which ones it’s not.
Richard Ngo: Right. I often hear this argument that says something like, “Oh, look, AI’s not going to be dangerous at all. It can’t even load a dishwasher. It can’t even fold laundry” or something like that. Actually it turns out that a lot of really advanced and sophisticated capabilities — including some kinds of reasoning, like advanced language capabilities, and coding as well actually — can come significantly before a bunch of other capabilities that are more closely related to physical real-world tasks. So I do think there’s a pretty open question as to what the ordering of all these different tasks are, but we really just can’t rule out a bunch of pretty vital and impressive and dangerous capabilities coming even earlier than things that seem much more prosaic or much more mundane.
—
Richard Ngo: I’ve already alluded a little bit to how the unpredictability of capability advances and how things like reasoning, strategic thinking, and so on might just come much earlier than we expect.
Another thing that has felt pretty important is that we don’t really know what the capabilities of our systems are directly after we’ve built them anymore. So once you train a large language model or a large image model, it may be the case that there are all sorts of things that you can get it to do — given the right input, the right prompt, the right setup — that we just haven’t figured out yet, because they’re emergent properties of what the model has learned during training.
Probably the best example of this is that people figured out that if you prompt large language models to think through, step by step, in its reasoning, they can answer significantly more difficult questions than they could if you just give them the question by itself. And this makes sense, because this is what humans do, right? If you tell somebody to answer an arithmetic problem by calculating all the intermediate values, they’re probably much more likely to get it correct than if you say, “You have to give me the final answer directly.”
But this is something that it took ages for people to figure out. Because this was applicable to GPT-3, and I think also to a lesser extent to GPT-2, but papers about this were only coming out last year. So these are the types of things where actually just figuring out what the models can even do is a pretty difficult task, and probably just going to get increasingly difficult.

Key arguments for why this matters

Richard Ngo: I think one argument that feels pretty compelling to me is just that we really have no idea what’s going on inside the systems that we’re training. So you can get a system that will write you dozens of lines of code, that implements a certain function, that leads to certain outcomes on the screen, and we have no way of knowing what it’s doing internally that leads it to produce that output. Why did it choose to implement the function this way instead of that way? Why did it decide to actually follow our instructions as opposed to doing something quite different? We just don’t know mechanistically what’s happening there.
So it feels to me like if we were on course to solving this in the normal run of things, then we would have a better understanding of what’s going on inside our systems. But as it is, without that core ability, it feels hard to rule out or to be confident that we are going to be able to address these things as they come up, because as these systems get more intelligent, anything could be going on.
And there has been some progress towards this. But it feels still very far away, or the progress on this is not clearly advancing faster than the capabilities are advancing.
Rob Wiblin: I suppose this is the first really complicated machine that we’ve ever produced where we don’t know how it works. We know how it learns, but we don’t know what that learning leads it to do internally with the information, or at least we don’t know it very well.
Richard Ngo: Right. In some sense, raising a child is like this. But we have many more guarantees and much more information about what children look like, how they learn, and what sort of inbuilt biases they have, such that they’re going to mostly grow up to be moral, law-abiding people. So maybe a better analogy is raising an alien, and just not having any idea how it’s thinking or when it’s trying to reason about your reactions to it or anything like that.
And right now, I don’t think we’re seeing very clear examples of deception or manipulation or models that are aware of the context of their behaviour. But again, this seems like something where there doesn’t seem to be any clear barrier standing between us and building systems that have those properties.
Rob Wiblin: Yeah. Even with humans, not all of them turn out to be quite that benevolent.
Richard Ngo: Absolutely.
Rob Wiblin: What’s another important reason you think advances in AI may not necessarily go super well without a conscious effort to reduce the risks?
Richard Ngo: I think that a lot of other problems that we’ve faced as a species have been on human timeframes, so you just have a relatively long time to react and a relatively long time to build consensus. And even if you have a few smaller incidents, then things don’t accelerate out of control.
I think the closest thing we’ve seen to real exponential progress that people have needed to wrap their heads around on a societal level has been COVID, where people just had a lot of difficulty grasping how rapidly the virus could ramp up and how rapidly people needed to respond in order to have meaningful precautions.
And in AI, it feels like it’s not just one system that’s developing exponentially: you’ve got this whole underlying trend of things getting more and more powerful. So we should expect that people are just going to underestimate what’s happening, and the scale and scope of what’s happening, consistently — just because our brains are not built for visualising the actual effects of fast technological progress or anything near exponential growth in terms of the effects on the world.

What AI could teach us about ourselves

Rob Wiblin: Another thing that I’ve heard some people speculate that we might be learning from these language models, for example, is that we might be learning something about how humans operate. So these language models are kind of predictive models, where you’ve got some text before and then it’s trying to predict the next word. It seems like using that method, you can at least reasonably often produce surprisingly reasonable speech, and perhaps surprisingly reasonable articles and chat and so on.
Now, some people would say this looks like what people are doing, but it isn’t what they’re doing. We actually have all of these ideas in our minds and then we put them together in a coherent way, because we deeply understand the actual underlying ideas and what we’re trying to communicate. Whereas this thing is just chucking word after word after word in a way that produces a simulation of what a person is like.
But I suppose when people aren’t thinking that deeply, maybe we operate this way as well. Maybe I’m producing speech extemporaneously now, but my brain can do a lot of the work automatically, because it just knows what words are likely to come after other words. Do you have any view on whether these language models are doing something very different than what humans do? Or are we perhaps learning that humans use text prediction as a way of producing speech themselves to some degree?
Richard Ngo: That’s a great example actually, where a lot of the time human behaviour is pretty automatic and instinctive — including things like speech, where, as you say, the words that are coming out of my mouth right now are not really planned in advance by me. I’m just sort of nudging myself towards producing the types of sentences that I intend.
If we think about Kahneman’s System 1 / System 2 distinction, I think that’s actually not a bad way of thinking about our current language models: that they actually do most of the things that humans’ System 1 can do, which is instinctive, quick judgements of things. And then, so far at least, they’re a little bit further away from the sort of metacognition and reflection, and noticing when you’re going down a blind alley or you’ve lost the thread of conversation.
But over time, I think that’s going to change. Maybe another way of thinking about them is that it seems hard to find a task that a human can do in one second or three seconds that our language models can’t currently do. It’s easier to find tasks that humans can do in one minute or five minutes that the models can’t do. But I expect that number to go up and up over time — that the models are going to be able to match humans given increasingly long time periods to think about the task, until eventually they can just beat humans full stop, no matter how much time you give those humans.
Rob Wiblin: So is the idea there that if you only give a human one second to just blurt out something in reaction to something else, then it has to operate on this System 1, this very instinctive thing, where it’s just got to string a sentence together and it doesn’t really get to reflect on it. And the language models can do that: they can blurt something out. But the thing that they’re not so good at is what humans would do, which is look at the sentence that is about to come out of their mouth and then see that that’s actually not what they want to communicate, and then edit it and make it make a whole lot more sense conceptually.
Richard Ngo: Right. And sometimes you even see language models making the same types of mistakes that humans make. So as an example, if you ask it for a biography of somebody, sometimes it’ll give you a description of their career and achievements that’s not quite right, but in the right direction. Maybe they’ll say that the person went to Oxford when actually that person went to Cambridge, or something like that — where it’s like it clearly remembers something about that person, but it just hasn’t memorised the detail. It’s more like it’s learned some kind of broader association. Maybe it’ll say that they studied biology when actually they studied chemistry — but it won’t say that they studied film studies when they actually studied chemistry.
So it seems like there’s these mistakes where it’s not actually recalling the precise details, but kind of remembering the broad outline of the thing and then just blurting that out, which is what humans often do.

Bottlenecks to learning for humans

Richard Ngo: I think the biggest one is just that we don’t get much chance to experience the world. We just don’t get that much input data, because our lives are so short and we die so quickly.
There’s actually a bunch of work on scaling laws for large language models, which basically say: If you have a certain amount of compute to spend, should you spend it on making the model bigger or should you spend it on training it for longer on more data? What’s the optimal way to make that tradeoff?
And it turns out that from this perspective, if you had as much compute as the human brain uses throughout a human lifetime, then the optimal way to spend that is not having a network the size of the human brain — it’s actually having a significantly smaller network and training it on much more data. So intuitively speaking, at least, human brains are solving this tradeoff in a different way from how we are doing it in machine learning, where we are taking relatively smaller networks and training them on way more data than a human ever sees in order to get a certain level of capabilities.
Rob Wiblin: I see. OK, hold on. The notion here is you’ve got a particular amount of compute, and there’s two different ways you could spend it. One would be to have tonnes of parameters in your model. I guess this is the equivalent of having lots of neurons and lots of connections between them. So you’ve got tons of parameters; this is equivalent to brain size. But another way you could use the compute is, instead of having lots and lots of parameters that you’re constantly estimating and adjusting, you’d have a smaller brain in a sense, but train it for longer — have it read way more stuff, have it look at way more images.
And you’re saying humans are off on this crazy extreme, where our brains are massive — so many parameters, so many connections between all the neurons — but we only live for so little time. We read so little, we hear so little speech, relative to what is possible. And we’d do better if somehow nature could have shrunk our brains, but then got us to spend twice as long in school in a sense. I suppose there’s all kinds of problems to getting beings to live quite that long, but you might also get killed while you’re in your stupid early-infant phase for so long.
Richard Ngo: Exactly. Humans faced all these tradeoffs from the environment, which probably neural networks are just not going to face. So by the time we are training neural networks that are as large as the human brain, we should be expecting that they won’t have as much experience as humans do, but they’re actually just going to be training on a huge amount more experience compared with humans. So that’s one way in which humans are systematically disadvantaged. We just haven’t been built to absorb the huge amounts of information that are being generated by the internet or videos, YouTube, Wikipedia, across the entire world.
And that’s closely related to the idea of AIs being copyable. If you have a neural network that’s trained on one piece of data, you can then make many copies of that network and deploy it in all sorts of situations. And then you can feedback the experience that it gets from all those situations into the base model — so effectively you can have a system that’s learning from a very wide array of data, and then taking that knowledge and applying it to a very wide range of new situations.
These are all ways in which I think in the short term, humans are disadvantaged just by virtue of the fact that we’re running on biological brains in physical bodies, rather than virtual brains in the cloud.
And then in the longer term, I think the key thing here is what algorithmic improvements can you make? How much can you scale these things up? Because over the last decade or two, we’ve seen pretty dramatic increases in the sizes of neural networks that we’ve been using, and the algorithms that we are using have also been getting significantly more efficient. So we can think about artificial agents investing in doing more machine learning research and improving themselves in a way that humans just simply can’t match — because our brain sizes, our brain algorithms, and so on are pretty hard coded; there’s not really that much we can do to change this. So in the long term, it seems like these factors really should lead us to expect AI to dramatically outstrip human capabilities.

The most common and important misconception around ML

Richard Ngo: I think the most common and important misconception has to do with the way that the training setup relates to the model that’s actually produced. So for example, with large language models, we train them by getting them to predict the next word on a very wide variety of text. And so some people say, “Well, look, the only thing that they’re trying to do is to predict the next word. It’s meaningless to talk about the model trying to achieve things or trying to produce answers with certain properties, because it’s only been trained to predict the next word.”
The important point here is that the process of training the model in a certain way may then lead the model to actually itself have properties that can’t just be described as predicting the next word. It may be the case that the way the model predicts the next word is by doing some kind of internal planning process, or it may be the case that the way it predicts the next word is by reasoning a bunch about, “How would a human respond in this situation?” I’m not saying our current models do, but that’s the sort of thing that I don’t think we can currently rule out.
And in the future, as we get more sophisticated models, the link between the explicit thing that we’re training them to do — which in this case is predict the next word or the next frame of a video, or things like that — and the internal algorithms that they actually learn for doing that is going to be less and less obvious.
Rob Wiblin: OK, so the idea here is: let’s say that I was set the task of predicting the next word that you are going to say. It seems like one way that I could do that is maybe I should go away and study a whole lot of ML. Maybe I need to understand all of the things that you’re talking about, and then I’ll be able to predict what you’re likely to say next. Then someone could come back and say, “Rob, you don’t understand any of the stuff. You’re just trying to predict the next word that Richard’s saying.” And I’m like, “Well, these things aren’t mutually exclusive. Maybe I’m predicting what you’re saying by understanding it.” And we can’t rule out that there could be elements of embodied understanding inside these language models.
Richard Ngo: Exactly. And in fact, we have some pretty reasonable evidence that suggests that they are understanding things on a meaningful level.
My favourite piece of evidence here is from a paper that used to be called “Moving the Eiffel Tower to ROME” — I think they’ve changed the name since then. But the thing that happens in that paper is that they do a small modification of the weights of a neural network. They identify the neurons corresponding to the Eiffel Tower and Rome and Paris, and then just swap things around. So now the network believes that the Eiffel Tower is in Rome. And you might think that if this was just a bunch of memorised heuristics and no real understanding, then if you ask the model a question — “Where is the Eiffel Tower?” — sure, it’ll say Rome, but it’ll screw up a whole bunch of other questions. It won’t be able to integrate that change into its world model.
But actually what we see is that when you ask a bunch of downstream questions — like, “What can you see from the Eiffel Tower? What type of food is good near the Eiffel Tower? How do I get to the Eiffel Tower?” — it actually integrates that single change of “the Eiffel Tower is now in Rome” into answers like, “From the Eiffel Tower, you can see the Coliseum. You should eat pizza near the Eiffel Tower. You should get there by taking the train from Berlin to Rome via Switzerland,” or things like that.
Rob Wiblin: That’s incredible.
Richard Ngo: Exactly. And it seems like almost a definition of what it means to understand something is that you can take that isolated fact and translate it into a variety of different ideas and situations and circumstances.
And this is still pretty preliminary work. There’s so much more to do here in understanding how these models are actually internally thinking and reasoning. But just saying that they don’t understand what’s going on, that they’re just predicting the next word — as if that’s mutually exclusive with understanding the world — I think that’s basically not very credible at this point.

Situational awareness

Rob Wiblin: So the second point was neural networks trained via reinforcement learning will gain more reward by deceptively pursuing misaligned goals. Yeah, can you elaborate on that?
Richard Ngo: Right. So the key idea here is this concept called situational awareness, which was introduced by Ajeya Cotra in a report on the alignment problem, and which I’ve picked up and am using in my report. The way I’m thinking about situational awareness is just being able to apply abstract knowledge to one’s own situation, in order to choose actions in the context that the agent finds itself in.
In some sense this is a very basic phenomenon. When I go down to the grocery store to buy some matches, for example. Maybe I’ve never bought matches at the grocery store before, but I have this abstract knowledge of, like, “Matches are the type of thing that tend to be found in these types of stores, and I can buy them, and I’m in a situation where I can walk down to the store.” So in some sense this is just a very basic skill for humans.
In the context of AI, we don’t really have systems that have particularly strong situational awareness right now. We have agents that play StarCraft, for example, but they don’t understand that they are an AI playing StarCraft. They’re just within the game. They don’t understand the wider context. And then if you look at language models, I think they come a bit closer, because they do have this abstract knowledge. If you ask them, “What is a language model? How is it trained?”, things like that, they can give you pretty good answers, but they still don’t really apply that knowledge to their own context. They don’t really use that knowledge in order to choose their own answers.
But as we train systems that are useful for a wide range of tasks — for example, an assistant: if you train an assistant AI, then it’s got to have to know a bunch of stuff about, “What are my capabilities? How can I use those capabilities to help humans? Who’s giving me instructions and what do they expect from me?” Basically the only way to have really helpful AIs in these situations is for them to have some kind of awareness of the context that they’re in, the expectations that we have for them, the ways that they operate, the limitations that they act under, and things like that. And that’s this broad thing that I’m calling situational awareness.
Rob Wiblin: So there’s this concept of situational awareness, which is the water that we swim in, such that it is almost hard to imagine that it’s a thing. But it is a thing that humans have that lots of other minds might not have. But in order to get these systems to do lots of tasks that we’re going to ultimately want them to do, they’re going to need situational awareness as well, for the same reason that humans do. So that’s kind of a next stage. And then what?
Richard Ngo: And then, when you’ve got systems that have situational awareness, then I claim that you start to get the problematic types of deception. So in the earlier phases of training, you might have systems that learn the concept of “hide mistakes where possible,” but they don’t really understand who they’re hiding the mistakes from, exactly what types of scrutiny humans are going to apply, things like that. They start off pretty dumb. Maybe they just have this reflex to hide their mistakes, kind of the same way that young kids have an instinctive heuristic of hiding mistakes where possible.
But then, as you start to get systems that actually understand their wider context — like, “I’m being trained in this way,” or “Machine learning systems tend to be scrutinised in this particular way using these types of tools” — then you’re going to end up in an adversarial situation. Where, if the model tries to hide its mistakes, for example, then it will both know how to do so in an effective way, and then also apply the knowledge that it gains from the humans around it, in order to become increasingly effective at that.
So that’s a concern: that these types of deception are just going to be increasingly hard to catch as models get more and more situational awareness.

Reinforcement learning undermining obedience

Richard Ngo: So the things that I’ve talked about so far, none of those are the core things that I’m worried about causing large-scale problems. So if you have a deceptive model, maybe it’s doing insider trading on the stock market. Even if we can’t catch that directly, over time, we’re eventually going to figure out, “Oh, something is going off here.” Maybe we’re in a bit of a cat-and-mouse game, where we’re trying to come up with the correct rewards, and the model’s trying to come up with deceptive strategies, but as long as they’re roughly within the human range of intelligence, it feels like we can hopefully constrain a bunch of the worst behaviour that they perform.
Then I’m thinking, what happens when these models become significantly superintelligent? And, in particular, intelligent enough that we just can’t effectively supervise them? What might that look like? It might look like them just operating way too fast for us to understand. If you’ve got an automated CEO who’s sending hundreds of emails every minute, you’re just not going to be able to get many humans to scan all these emails and make sure that there’s not some sophisticated deceptive strategy implemented by them. Or you’ve got AI scientists, who are coming up with novel discoveries that are just well beyond the current state of human science. These are the types of systems where I’m thinking we just can’t really supervise them very well.
So what are they going to do? That basically depends on how they generalise the goals that they previously learned from the period when we were able to supervise them, into these novel domains or these novel regimes.
There are a few different arguments that make me worried that the generalisation is going to go poorly. Because you can imagine, for example, that in the regime where we could supervise, we always penalise deception, and they learn very strong anti-deception goals. Maybe we think that is going to hold up into even quite novel regimes, where deception might look very different from what it previously looked like. Instead of deception being lying to your human supervisor, deception could mean hiding information in the emails you send or something like that.
And I think there are a couple of core problems. The first one is just that the field of machine learning has very few ways to reason about systems that are generalising and transferring their knowledge from one domain to the other. This is just not a regime that has been very extensively studied, basically because it’s just so difficult to say, “You’ve got a model that’s learned one thing. How well can it do this other task? What’s its performance in this wildly different regime?” Because you can’t quantify the difference between Dota and StarCraft or the difference between speaking English and speaking French. These are just very difficult things to understand. So that’s one problem there. Just by default, the way that these systems generalise is in many ways totally obscure to us, and will become more so, as they generalise further and further into more and more novel regimes.
Then there are a few more specific arguments as to why I’m worried that the goals are going to generalise in bad ways. Maybe one way of making these arguments is to distinguish between two types of goals. I’m going to call one type “outcomes” and I’m going to call the other type “constraints.”
Outcomes are just achieving things in the world — like ending up with a lot of money, or ending up having people respect you, or building a really tall building, just things like that. And then constraints I’m going to say are goals that are related to how you get to that point. So what do you need to do? Do you need to be deceptive in order to make a lot of money? Do you need to hurt people? Do you need to go into a specific industry or take this specific job? You might imagine a system has goals related to those constraints as well. And so the concern here is something like: for goals that are in the form of outcomes, systems are going to benefit from applying deceptive or manipulative strategies there.
This is, broadly speaking, a version of Bostrom’s instrumental convergence argument, which basically states that there are just a lot of things that are really useful for achieving large-scale goals. Or another way of saying this is from Stuart Russell, who says, “You can’t fetch coffee if you’re dead.” Even if you only have the goal of fetching coffee, you want to avoid dying, because then you can’t fetch the coffee anymore.
So you can imagine systems that have goals related to outcomes in the world are going to generalise towards, “It seems reasonable for me not to want to die. It seems reasonable for me to want to acquire more resources, or develop better technology, or get in a better position,” and so on, and these are all things that we just don’t want our systems to be doing autonomously. We don’t want them to be consolidating power or gaining more resources if we haven’t told them to, or things like that.
And then the second problem is that if these goals are going to generalise — if the goal “make a lot of money” is going to generalise to motivate systems to take large-scale actions — what about a goal like, “never harm humans,” or “never lie to humans,” or things like that? And I think there the problem is that, as you get more and more capable, there are just more ways of working around any given constraint.

Articles, books, and other media discussed in the show

AGI Safety Fundamentals programme:

This is a two-month course focused on small weekly facilitated discussion groups on readings about alignment.
Participants should have some technical background (e.g. CS, maths, or statistics), but no machine learning knowledge is necessary.
Richard designed the curriculum, and Blue Dot Impact (which spun out of EA Cambridge) is running it.

Richard’s work:

AGI safety from first principles
AI Alignment Curriculum: AGI Safety Fundamentals and Alignment 201 Curriculum: AGI Safety Fundamentals
The alignment problem from a deep learning perspective (though in the episode Richard and Rob discuss the older version of the paper)
Some conceptual alignment research projects
Characterising utopia — a post on Richard’s blog, Thinking Complete

OpenAI’s work in AI safety:

Others’ work in these areas:

Superintelligence: Paths, Dangers, Strategies and The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents by Nick Bostrom
Human Compatible: Artificial Intelligence and the Problem of Control by Stuart Russell
The Alignment Problem: Machine Learning and Human Values by Brian Christian
Capture the Flag: The emergence of complex cooperative agents from DeepMind’s blog
How much computational power does it take to match the human brain? by Joseph Carlsmith
Is power-seeking AI an existential risk? by Joseph Carlsmith (also available as an audio narration)
Draft report on biological anchors and Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover by Ajeya Cotra
Locating and Editing Factual Associations in GPT (formerly called “Moving the Eiffel Tower to ROME”) by Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov
AI safety via debate by Geoffrey Irving, Paul Christiano, and Dario Amodei

Other 80,000 Hours Podcast episodes:

Everything else:

Thinking, Fast and Slow by Daniel Kahneman
The WEIRDest People in the World: How the West Became Psychologically Peculiar and Particularly Prosperous by Joseph Henrich
Lest Darkness Fall by L. Sprague de Camp

Transcript

Table of Contents

1 Rob’s intro [00:00:00]
2 The interview begins [00:03:12]
3 How Richard feels about recent AI progress [00:05:56]
4 Regulation of AI [00:10:50]
5 Why we should care about AI at all [00:15:00]
6 Key arguments for why this matters [00:23:27]
7 What OpenAI is doing and why [00:34:40]
8 AIs with the same total computation ability as a human brain [00:45:25]
9 What we’ve learned from recent advances [00:51:19]
10 Bottlenecks to learning for humans [01:01:34]
11 The most common and important misconception around ML [01:09:16]
12 The alignment problem from a deep learning perspective [01:15:39]
13 Situational awareness [01:26:02]
14 Reinforcement learning undermining obedience [01:40:07]
15 Arguments for calm [01:49:44]
16 Solutions [02:01:07]
17 Debate and interpretability [02:08:29]
18 Some conceptual alignment research projects [02:12:29]
19 Overrated AI work [02:14:09]
20 Richard’s personal path and advice [02:16:39]
21 Characterising utopia [02:28:37]
22 Richard’s favourite thought experiment [02:37:33]
23 Rob’s outro [02:40:52]

Rob’s intro [00:00:00]

Rob Wiblin: Hi listeners, this is The 80,000 Hours Podcast, where we have unusually in-depth conversations about the world’s most pressing problems, what you can do to solve them, and our awesome new theme song. I’m Rob Wiblin, Head of Research at 80,000 Hours.

AI is once again all over the news thanks to the launch of ChatGPT from OpenAI.

ChatGPT has really taken chatbots to a new level in terms of quality and coherence of the answers it can give to almost any sort of request. It can write poetry, script TV shows, code programs, draft newspaper articles, simulate interviews. It’s not at the level of a person yet, and it has particular weaknesses, but in many situations it’s getting close. The rate of improvement over the last few years is simply staggering and users have been blown away.

If you haven’t seen it you can try it for yourself at chat.openai.com, though they’re struggling to keep up with wild levels of demand, so you might have to wait in a queue.

All that makes me very excited to be releasing this interview with Richard Ngo, who works at OpenAI and researches various issues to do with how very advanced AI can be developed, deployed, and integrated into society, while avoiding all the ways that might go very wrong.

We talk about large language models like ChatGPT, among other sorts of machine learning models — including how they work, the capabilities they’re gaining, whether or not they can really reason and understand things, and whether there’s much to worry about with them.

In the second half of the interview, we turn to Richard’s new paper from August, titled The alignment problem from a deep learning perspective. [The latest version from December 2022 is now available.]

Note that this conversation was recorded a few weeks before ChatGPT was released, so I don’t bring it up specifically.

I’m always worried these conversations about AI are going to get too technical and I’m going to struggle to follow.

But I was really happy listening back to this one that I think we both explain important ideas that most listeners won’t have heard before, while still being pretty easy to follow.

One very important announcement: we have a lovely new theme song, called “La Vita e Bella” by Jazzinuf, from their 2020 album Magic Carpet.

I’m a big fan of Jazzinuf, so I hope you enjoy it, and that one day it generates a Pavlovian response where you start salivating for in-depth conversations about the world’s most pressing problems and what you can do to solve them every time you hear it. Their new album out this year is called Sun Dance, and you can find it anywhere you get music of course.

Oh and the runnerup option was the track “Coffee and Cigarettes,” also from Jazzinuf — so go check it out and tell us if you think we made a mistake so we can feel bad about our decision.

All right, without further ado I bring you Richard Ngo.

The interview begins [00:03:12]

Rob Wiblin: Today, I’m speaking with Richard Ngo. Richard grew up in Vietnam and New Zealand before moving to the UK to study computer science and philosophy at Oxford and a master’s in machine learning at Cambridge. He then became a research engineer on the artificial general intelligence safety team at DeepMind, then spent a year on a PhD in the philosophy of machine learning at Cambridge on the analogy between the development of artificial intelligence and the evolution of human intelligence. But in 2021, he dropped the PhD and moved on to research AI governance at OpenAI, the organisation responsible for GPT-3 and DALL-E 2.

Richard is a prolific writer on the risks created by rapid advances in machine learning and how to ameliorate them across published papers, the AI Alignment Forum, the Effective Altruism Forum, and of course Twitter. His current goals are to understand the key patterns of the world to help make the long-term future of humanity amazing, and hopefully to stick around to enjoy it himself.

Thanks for coming on the podcast, Richard.

Richard Ngo: Thanks so much for having me, Rob. I’m a big fan, really excited to be here.

Rob Wiblin: Fantastic. I hope to talk about recent progress in the field and some things listeners should understand about how ML models actually work. But first, what are you working on at the moment, and why do you think it’s important?

Richard Ngo: Right now I’m splitting my time in a few different ways. I’m on the governance team at OpenAI, and we’re basically trying to figure out what the governance and regulation of advanced artificial intelligence might look like. In particular, questions around how might we get cooperation on a large scale — on a national scale and international scale — to make sure that people don’t build and deploy risky AI systems. And then I also spend a bunch of my time thinking about the alignment problem — continuing work that I was doing at DeepMind, and in particular trying to understand the problem better and then explain it to people in machine learning and elsewhere, so we can build a better understanding of what we actually need to do to try and fix it.

Rob Wiblin: Looking over your CV, it seems like you’ve spent quite a bit of time doing technical ML work, but these days you are at OpenAI and your title involves more focusing on governance. Are you doing more technical or policy stuff these days? Or maybe you’re at the interface of both of these things?

Richard Ngo: Yeah, I’d say I split my time between them. I’m not doing a huge amount of coding these days, but I am thinking about alignment questions from a technical perspective. I think really the key thread going through here is I’ve been trying to figure out where the biggest gaps in the field are and how I can fill them. And so originally, it felt like that was in the technical domain, but then I realised that there was much less clarity about what the problem actually was and how we understood these high-level concepts than I expected. And so I shifted into focusing on that for a little while. And then there was some stuff in AI governance that felt not very fleshed out, maybe pre-paradigmatic, and it felt like there was an opportunity for me to contribute there. So I’m really just trying to figure out what’s the key missing step in humanity’s plan for dealing with advanced AI.

How Richard feels about recent AI progress [00:05:56]

Rob Wiblin: So one reason among a bunch of others that we’re having this conversation right now is that AI is a super hot topic at the moment. Mainly because it’s been just super impressive and has striking results recently with language models and image models and so on that have really gotten the public’s attention. On a personal level, how excited and/or anxious do you feel about all this rapid progress that we’ve seen, as someone who’s actually working really close to it all?

Richard Ngo: It does feel like the progress of the last couple years in particular has been very compelling and quite visceral for me, just watching what’s going on. I think partly because the demos are so striking — the images, the poetry that GPT-3 creates, things like that — and then partly because it’s just so hard to see what’s coming. People are really struggling to try and figure out what things can these language models not actually do? Like, what benchmarks can we design, what tasks can we give them that are not just going to fall in a year or maybe a year and a half? It feels like the whole field is in a state of suspense in some ways. It’s just really hard to know what’s coming. And it might just often be the things that we totally don’t expect — like AI art, for example.

Rob Wiblin: Yeah. I guess there’s two things going on here. One is just that the capabilities are ahead of where people forecast them to be. And they’re also ahead of where they forecast it to be in a kind of strange direction: that the progress is occurring on the ability to do tasks that people didn’t really anticipate would be the first things that AI would be able to do. So it’s just massively blown open our credences or our expectations about what might be possible next, because it seems like we don’t have a very good intuitive grasp of which things ML is able to improve at really rapidly and which ones it’s not.

Richard Ngo: Right. I often hear this argument that says something like, “Oh, look, AI’s not going to be dangerous at all. It can’t even load a dishwasher. It can’t even fold laundry” or something like that. Actually it turns out that a lot of really advanced and sophisticated capabilities — including some kinds of reasoning, like advanced language capabilities, and coding as well actually — can come significantly before a bunch of other capabilities that are more closely related to physical real-world tasks. So I do think there’s a pretty open question as to what the ordering of all these different tasks are, but we really just can’t rule out a bunch of pretty vital and impressive and dangerous capabilities coming even earlier than things that seem much more prosaic or much more mundane.

Rob Wiblin: Yeah. I think that this is a longstanding paradox, Moravec’s paradox: the things that we think of as being really difficult, AI finds easy, and the things that we think of as really easy, AI often finds really, really difficult. Which I suppose means that it could be that loading the dishwasher is the last thing that falls into place for AI to be able to compete with our human staff in many different domains.

In this conversation, we’re going to be focusing quite a lot on the downside risks — because if we can avoid all the downside risks, then hopefully eventually we’ll get to harvest all of the gains from AI at some point before too long anyway. But let’s take a moment to appreciate all the great upsides. If things go really well, is there a benefit that you are really looking forward to personally from all these scientific advances?

Richard Ngo: One thing that I’m pretty excited about is it feels like a lot of fields of science have hit a point where it’s just really hard for humans to understand what’s actually going on, on an intuitive level. Protein folding, for example: humans just can’t visualise these types of things. Or some domains of mathematics, where it’s gotten so abstract and it’s at such a high level of complexity that it really feels like we’re approaching the limits of what human minds can actually do unaided.

So I’m just pretty excited about having systems that can hold much more in their heads, that can really think about advanced mathematics, for example, in the same way that humans think about shapes and objects around them — in this just very intuitive sense. The types of progress that we can get from that feel pretty exciting: just a broader understanding of more complex systems — whether that’s bodies or economies or minds or societies — than we currently have.

Rob Wiblin: Yeah. I suppose we don’t know what might result from that. It could be that there’s all kinds of amazing and perhaps obvious-in-retrospect scientific advances, that is just strangely blocked to the human brain because of its architecture — because of the particular spatial reasoning that we’re good at, and this potential spatial reasoning that we’re bad at, and how small our short-term memory is, and things like that. And I suppose this is also a slightly concerning thing, that we don’t know what advances might come quite rapidly once ML models can do that conceptual reasoning and do that science for themselves.

Richard Ngo: Yep. That seems exactly right.

Rob Wiblin: But on the other hand, it could be incredible.

Regulation of AI [00:10:50]

Rob Wiblin: You mentioned just a minute ago that you are working on ways that governments could regulate the training of really big models. Why would that be desirable?

Richard Ngo: Broadly speaking, we don’t want a situation in which different actors, whether they’re companies or countries, are trying to cut corners in developing increasingly advanced AI systems. And so it feels like we really want some kinds of cooperation here in order to make sure that everyone’s sticking to the same standards, and sharing best practices, and keeping each other informed about the types of examples of dangerous capabilities that they’ve seen or maybe anything that’s particularly concerning, and having the ability to slow down when that’s actually necessary.

I think a lot of ideas along these lines have been floating around in this space for a while, but the type of work that my team is trying to do is focusing a little more on making that concrete. Suppose we had an agreement between different countries to apply some kinds of standards to AI development: what sort of enforcement mechanisms could we have, what sort of metrics could we be looking at, and things like that. And a lot of that work actually comes down to a bunch of technical details: what types of properties of large models or training setups can you automatically verify, or what types of secure verification can you do on the physical chip hardware. These are some of the questions that we’re starting to look into now, which I’m pretty excited about more people looking into.

Rob Wiblin: I suppose it’s maybe an unfortunate analogy, but nuclear weapons is an example where you’ve wanted to have all of these governance arrangements. Nuclear technology has large potential upsides and also large potential downsides, and so you want to have some rules about who can access it and on what terms. It’s quite a bit more challenging in a way, because computer chips are just used in so many different things and their cost is coming down in a way that nuclear technology is not. I suppose the peaceful uses of nuclear technology we’ve found tend to occur in quite small, particular isolated places. They’re just not that numerous; it’s easier to tell what’s going on. Do you actually think it’s going to be very practical to limit people’s access to compute in the medium term?

Richard Ngo: I think if you are focusing on the largest training runs that are done by the biggest ML research labs or the biggest projects, then there’s some reasons to be optimistic here. Especially when we’re looking at the nuclear analogy, I think things have probably just gone significantly better than one might have hoped. There were a lot of people who were despairing pretty heavily a few decades ago — in the ’60s, ’70s — about the future of nuclear regulation and treaties and so on. But it seems like we’ve somehow muddled our way through. And so it may be the case that even though there’s no silver bullets for this, we can just get a bunch of agreements on the table and muddle our way through in a roughly analogous way to how we’ve done in the nuclear case.

Rob Wiblin: Yeah, inasmuch as training the most impressive models that we might be most concerned about just involves enormous amounts of compute, maybe that has a pretty high visibility — the cost is pretty large; it’s kind of hard to do it under the radar. So it does look a bit more like potentially working with developing either nuclear weapons or developing peaceful nuclear power. It has quite a big footprint.

Richard Ngo: Yeah, that’s right. The sort of scales we’re talking about are already pretty large for the biggest models — millions of dollars of compute spent training them — and probably going to become significantly bigger over time. The other factors are, of course, that there’s a bunch of progress in terms of more efficient training algorithms and compute advances that are going to lower that threshold for how easily you can build a model with a given level of capabilities over time. So it feels like there are a bunch of open questions here. But at least for the sort of baseline efforts to make sure that people don’t use very large amounts of compute specifically doing dangerous things, it feels like there’s some hope there.

Why we should care about AI at all [00:15:00]

Rob Wiblin: OK, before we dive too deep into specific solutions or anything like that, I think it’d be good to get on the table a bit of a basic grounding of what is the problem that we’re worried about here, and why would we think that it’s potentially a really important global priority that the kinds of people who listen to this show should care about?

You’ve written that when you started getting involved in the AI safety community, you found the thinking to be quite a bit more muddled, a bit less clear than what you had expected going in. I think that that, among other reasons, has prompted you to spend quite a bit of time writing up the underlying issues and trying to present the core ideas in a crisper and hopefully more accurate way — including in that AGI Safety from First Principles report, the AGI Safety Fundamentals course, and this more recent one that I think you’re still working on called Alignment 201. Is that right?

Richard Ngo: That’s right.

Rob Wiblin: Cool. Let’s go through some of these fundamental reasons why we should care about this whole topic in the first place. In my experience, everyone has a slightly different way that they think about the challenge we face safely developing, and eventually at some point deploying, advanced AI systems. How would you personally describe the overarching issue?

Richard Ngo: So in the classic paradigms of machine learning, you’re training a system on some amount of data on a task, and you get an output which is a model that can perform that task. Maybe that task is playing StarCraft or maybe that task is playing Go or recognising images or things like that.

And over time, we are seeing increasingly general systems that can perform tasks that are less and less related to the specific training data they got. So GPT-2 and GPT-3 were big steps towards more general systems that can transfer experience from seeing a lot of text on the internet into performing novel tasks. If you prompt them with a task that you just made up, they can often perform it after being given a couple of examples.

So this core premise of generality — that eventually we’re going to have systems that can transfer from performing well on some tasks that they’ve been trained on to performing well on pretty novel tasks — that’s the first key idea.

Then the question is: how are they going to do that? What’s the mechanism by which you can take a system that hasn’t been trained on a task (like running a company, being a CEO), or has been trained a very small amount on that, and actually starts to perform very well on it? What seems very plausible to me is that the way they’re going to do it is by reasoning about the world, having these sophisticated models of the world, and trying to figure out what actions are going to lead to which outcomes.

So that’s the second step: just reasoning about outcomes in the world.

And then the third key idea is that when you are doing reasoning about outcomes in the world, then the more powerful that reasoning is, the more you want to make sure that the outcomes that it’s aimed towards are just very specifically what we want.

If you have a system that’s mostly just copying the things that it’s already seen or mostly just mimicking the data that it’s been given, then we’re not so fussed about specifically how it’s doing that, because it’s going to stay roughly within the bounds of what we expect. But if you’ve got a system that’s coming up with totally new strategies, that’s reasoning its way through, “I want to achieve this long-term outcome, so I can take these intermediate steps” — like accumulate some money, use that money to pay some people to do this task, and just generating pretty novel and unusual plans for performing those tasks — then you want to make sure that the outcome that it’s actually aiming for is very specifically what we want. Otherwise, those plans might include things like deception, manipulating humans, harming humans — things like that.

Rob Wiblin: I guess the basic idea is that it would be very natural that these models — as they become more and more sophisticated, and we try to find ways to make them capable of doing more and more valuable things — would start to have this very general reasoning and planning capacity at the level of humans. And potentially beyond that, more at the level of what whole organisations are capable of, or perhaps beyond the planning and strategising capability that we presently have at all. At the point where a system like this is capable of thinking of ways of achieving its goals that are quite different than what we might have expected, or quite different than what we might have wanted, then it becomes quite important that we have specified the goal correctly so that it doesn’t end up using means that are very different than what we had intended for it to. Is that one way of putting it?

Richard Ngo: Yeah, that’s right. And then I think how big of a problem you think this is depends in large part on how far you think these systems are going to get. How intelligent are they going to be? How far are they going to be able to generalise? What types of tasks are they going to be able to perform to what level? And I don’t see any particular boundaries on how far these things can go. If we are building systems that consist of neural networks that are much larger than the human brain, that are using much more compute, that are deployed on very large scales on much more data than any human has ever seen, then it just doesn’t really seem like we can be confident at all that the returns to being more intelligent and coming up with better plans tail off at any given point.

Rob Wiblin: I suppose the AI that people have been most likely to play with and take an interest in this year are these image models and these text models, where you throw in some text and then you get an image or some text back. But there are systems that are more like this planning and strategising thing, right? Of course, there’s the famous ones that play chess or Go and so on, but perhaps a bit more like real-life planning are the ones that play computer games like StarCraft II or Dota 2, that involve potentially quite long-term planning across a very broad space of possible approaches that one could take to playing this game. And they’re getting very good at that style of strategy and planning, which is a step in the direction of thinking about how you might operate in the real world.

Richard Ngo: I’d actually say that I feel pretty uncertain about whether systems that play esports, like Dota or StarCraft, are actually doing planning in a meaningful sense. Because they’ve been trained on so much data, it seems pretty plausible to me that they’ve just learned a bunch of heuristics and the sort of high-level pattern recognition that says, “When the enemy is coming towards my base, go left to avoid them” or something like that. I think this is something that people should really look into and try to figure out to what extent are these systems actually internally making plans, inside the neural network, as they process information. But I don’t really have a strong opinion on whether that’s happening now or not.

Some examples that are a bit clearer: I think the Go and chess engines that were built by DeepMind, like AlphaZero, clearly do some type of planning, but in a very narrow domain. So it’s pretty hard to say whether the type of planning that they do specifically — which is like searching through a whole bunch of possible moves in order to find desirable board states — is going to actually transfer to more sophisticated systems.

I also think that language models are actually a pretty good example to look at right now, because if you start to ask language models how to perform some high-level task — like if you ask GPT-3 to give you a plan to become really good at a certain skill or to achieve a certain outcome — it often gives you plans that are pretty sensible. Then I think there is a clear difference between being able to generate a plan versus being able to act on that plan. But it still seems like being able to generate the plan is much closer to being able to act on the plan than we might have expected a few years ago.

I personally was pretty surprised when I started asking GPT-3 to generate plans for things like, “If you were an AI in this situation, what would you do?” and then it could just come up with a whole bunch of possible strategies that an AI might take in order to achieve certain goals. I don’t see what the barrier is that prevents them from jumping from “Here is what an AI should do in a given situation” to — once we start training it in more real-world contexts; once we take versions of these language models and give them access to computers or things like that — them actually being able to carry out plans over longer and longer time horizons. It does seem like a gradual process, but we seem surprisingly far along the route of systems that are actually able to generate compelling plans.

Rob Wiblin: Yeah. I didn’t know that.

Key arguments for why this matters [00:23:27]

Rob Wiblin: At a high level, what’s one of the most important arguments that makes you think that there’s a real worry here? That there’s an issue that has to be addressed rather than this being the kind of thing that is just going to be solved in the natural course of events? Or that perhaps we’re just misunderstanding the situation and there’s really nothing to be worried about in the first place?

Richard Ngo: I think one argument that feels pretty compelling to me is just that we really have no idea what’s going on inside the systems that we’re training. So you can get a system that will write you dozens of lines of code, that implements a certain function, that leads to certain outcomes on the screen, and we have no way of knowing what it’s doing internally that leads it to produce that output. Why did it choose to implement the function this way instead of that way? Why did it decide to actually follow our instructions as opposed to doing something quite different? We just don’t know mechanistically what’s happening there.

So it feels to me like if we were on course to solving this in the normal run of things, then we would have a better understanding of what’s going on inside our systems. But as it is, without that core ability, it feels hard to rule out or to be confident that we are going to be able to address these things as they come up, because as these systems get more intelligent, anything could be going on.

And there has been some progress towards this. But it feels still very far away, or the progress on this is not clearly advancing faster than the capabilities are advancing.

Rob Wiblin: I suppose this is the first really complicated machine that we’ve ever produced where we don’t know how it works. We know how it learns, but we don’t know what that learning leads it to do internally with the information, or at least we don’t know it very well.

Richard Ngo: Right. In some sense, raising a child is like this. But we have many more guarantees and much more information about what children look like, how they learn, and what sort of inbuilt biases they have, such that they’re going to mostly grow up to be moral, law-abiding people. So maybe a better analogy is raising an alien, and just not having any idea how it’s thinking or when it’s trying to reason about your reactions to it or anything like that.

And right now, I don’t think we’re seeing very clear examples of deception or manipulation or models that are aware of the context of their behaviour. But again, this seems like something where there doesn’t seem to be any clear barrier standing between us and building systems that have those properties.

Rob Wiblin: Yeah. Even with humans, not all of them turn out to be quite that benevolent.

Richard Ngo: Absolutely.

Rob Wiblin: What’s another important reason you think advances in AI may not necessarily go super well without a conscious effort to reduce the risks?

Richard Ngo: I think that a lot of other problems that we’ve faced as a species have been on human timeframes, so you just have a relatively long time to react and a relatively long time to build consensus. And even if you have a few smaller incidents, then things don’t accelerate out of control.

I think the closest thing we’ve seen to real exponential progress that people have needed to wrap their heads around on a societal level has been COVID, where people just had a lot of difficulty grasping how rapidly the virus could ramp up and how rapidly people needed to respond in order to have meaningful precautions.

And in AI, it feels like it’s not just one system that’s developing exponentially: you’ve got this whole underlying trend of things getting more and more powerful. So we should expect that people are just going to underestimate what’s happening, and the scale and scope of what’s happening, consistently — just because our brains are not built for visualising the actual effects of fast technological progress or anything near exponential growth in terms of the effects on the world.

Rob Wiblin: For me, maybe the thing that makes me worry the most is the analogy to humans, where humans got to be somewhat smarter — or at least more generally capable — than the species around them. Then over time, by accumulating knowledge between generations and increasing the number of brains and the number of humans that were out there, we kind of took over the show and now get to dominate what happens.

And then you think in future you could have AIs that similarly become more generally capable than humans are and potentially can operate faster. Then of course they can become extremely numerous, because we could just produce more chips on which they can run. And so by analogy, one might think that perhaps most of the decision making and most of the influence over the future will be passed over to these potentially even more numerous and even more capable beings, so we’d better make sure that we’ve raised them right. Does that argument stand out in your head as well?

Richard Ngo: Yeah, I think this is a pretty strong argument for taking the problem seriously. Just on a very basic level, this is a thing that could happen, and could happen on relatively fast time scales in the same way that humans ended up basically in charge of the world on time scales that were very rapid in evolutionary terms.

I think I’ve placed a little bit less weight on this argument than I used to, just because it feels like now that we have more details of understanding deep learning, neural networks, these types of systems that we’re actually building, we can start to zoom in a bit more. I think this argument is most compelling when you have relatively little idea of the types of AIs that we’re going to build. And the more detail we have there, the more we can start to make more specific arguments.

So I started doing my PhD on this broad analogy of thinking about the link between human intelligence and artificial intelligence. And then one of the reasons that I moved away from this was because I thought that we could just zoom in more and get more confidence by actually thinking about the cutting-edge systems that we have today.

Rob Wiblin: Yeah, that makes sense. How much has it influenced you, the fact that there’s this kind of debate back and forth, where people will try to suggest some sort of instruction that you might be able to give to ML systems? They say, “Well, if we gave them this kind of instruction, then it seems like it would be safe.” And then people try to think, “Well, how could that be perversely interpreted? And how could that lead to undesired behaviours?” And then they’ll come back with some objection, saying, “No, this permits, or it even might encourage, power-seeking or deceptive behaviour.” And then people will try to patch it and then people will object again that that still doesn’t work.

The fact that people haven’t been able to propose an objective that you could give ML systems that seems robustly safe troubles me. Does that also trouble you?

Richard Ngo: I think the thing that particularly worries me here is that it feels like people have pretty different standards for what counts as evidence about a solution to the problem compared with what counts as evidence for the problem existing in the first place.

There are a lot of people who say, “Well, this is quite speculative. We can’t really know anything about future systems, so we may as well just wait and see.” But then often, those people will also say, “And look, here’s a potential solution that could work,” and just throw out a few haphazard ideas. If we’re really going to take seriously uncertainty about the future — which I think we should; it’s just very hard to predict this stuff — then I just have never seen the sort of proposal that gets anywhere near the level of rigour and carefulness that it should motivate us to say, “Well, we should be happy with this. Let’s not actually work on the problem anymore. Let’s focus on other things.” It seems like we’re just absolutely nowhere near having solutions that we can be sufficiently confident in to kind of dismiss the problem.

Rob Wiblin: Yeah. Is there any other common sceptical response that you get to these concerns that you think is misguided?

Richard Ngo: A lot of people talk about anthropomorphism, as in it seems like people who are worried about these problems are expecting AIs to be more similar to humans than they actually will be. I think this is an important and valuable intuition to have. Anthropomorphism is very common, and we should definitely watch out for it.

I think the idea of systems pursuing goals, and trying to reason about how to achieve those goals, though, is just fundamentally not an anthropomorphic concept — it’s just a claim about how intelligent systems can achieve outcomes in the world. And so, as long as we’re assuming that we’re going to have systems that can achieve these long-term, large-scale outcomes in the world — that can actually run companies, for example, or actually make decisions about what sort of plans they want to implement — then I have a hard time seeing what other strategies and mechanisms they could use apart from some kind of planning and reasoning about their models of the world. It is important to watch out for anthropomorphism, but in this particular case, it feels hard to say what the alternative is.

Rob Wiblin: Yeah. I suppose there’s some behaviours that you might think are extremely fundamental, and yes, ML systems in future might share them with humans — but that’s just because that’s the way you get things done. For example, trying to anticipate how things would be different if you behave differently, that’s something humans do. It’s something that any agent that’s trying to influence the future course of events is probably going to do as well.

But I’m surprised that people make that argument, because in my mind, the thing I’m more worried about is that the AIs are going to operate in an extremely inhuman way. If they were going to behave like people then I would be less concerned, because I know how people behave, and there’s only so bad it gets and there’s only so strange it gets. But the problem is that because these are alien minds in a way, really all bets are off. It could behave in extremely unpredictable and strange ways that we’d never expect another person to.

Richard Ngo: Right. I think people might say that we don’t even need to think of these as alien minds. We can think of them as just software or something: it does what we try to build it to do because that’s what software does. And we’re going to engineer it in the same way as we do current software, which is a lot of checks, a lot of tests and things like that.

And I think fundamentally, that’s not taking seriously enough the idea of generalisation and transfer to new tasks. The idea that you can actually have systems that are not just doing one specific thing that you designed them to do, but actually can think about the world as a whole, understand the world as a whole, and can apply that knowledge in many different domains. So I think that’s the sort of core premise that you need to take seriously in order to switch from thinking of AI as just normal products that we’re building to actually other minds, if you will, and other minds that may be trying to achieve things.

What OpenAI is doing and why [00:34:40]

Rob Wiblin: OK, we’ll come back to discussing the nature of the problem later on. We’re actually going to talk about a paper you wrote, called “The alignment problem from a deep learning perspective,” which talks about some of these issues in a slightly more technical and precise way.

But for now, let’s come back and focus on getting up to speed on what’s been going on in the AI world over the last six months or year. As we mentioned in the intro, you work at OpenAI these days. A lot of listeners will have heard of OpenAI because it’s made a bit of a splash with its various language models, and this year its image model DALL-E 2. For those who aren’t familiar though, can you give us a refresher on what OpenAI is trying to accomplish?

Richard Ngo: Broadly speaking, OpenAI’s goal as an organisation is to make sure that the development and deployment of advanced AI systems goes well. And what specifically that looks like has changed a bit over time, but I think some of the key things that people at OpenAI are thinking about are how to make sure that we build systems that are aligned with human preferences, how to ensure that those systems are governed well, and then make sure that the benefits are distributed widely to people across the world.

What that looks like on a day-to-day basis is a bunch of people just doing research on cutting-edge machine learning systems — in particular from a very empirically focused perspective. So I think OpenAI, as compared with most other machine learning labs, is just much more focused on actually building systems and then seeing what happens, rather than doing much more abstract theoretical work.

Rob Wiblin: OK, so it’s a bit more of a learning-by-doing mentality, is that it?

Richard Ngo: Right.

Rob Wiblin: What are the different strands of safety-related work that people do, having built these models? Are there different kinds of schools of thought or different approaches that people are adopting?

Richard Ngo: The main thing that the people on the alignment team at OpenAI are doing is applying reinforcement learning from human feedback — that is, getting a bunch of humans to evaluate the behaviour of our models and score their outputs, and then feeding that back into the system to try and make sure they behave in more desirable ways. And that’s led to a few different products and releases, like the InstructGPT models, which are fine-tuned to be more obedient than the original base models.

Then people are working on trying to leverage this to perform tasks that humans have a lot of difficulty supervising. So for example, an AI system that is being used to summarise a whole book might be very difficult for a human to evaluate — because it’s just hard to read the whole book, it takes a long time to then check whether the summary’s correct. So some of the types of work that people at OpenAI are doing involves breaking tasks down so that they can be more easily evaluated by humans. I think this is going to become increasingly important as the tasks that AIs perform become more and more complex.

And then after the systems have been built, there’s a whole branch of the organisation that focuses on what to do with them. This is the broader policy research team, of which the governance team is a part. The focuses of the policy research team include things like how should we release these models to the public, and if we should release the models to the public. OpenAI has focused on releasing models via an API that allows us to monitor what people are doing with them, and figure out how the capabilities of those models are actually advancing over time and what sort of things we want to be on the lookout for.

Rob Wiblin: OK. It’s very applied stuff, where you’ve got particular models and you’re trying to improve them with feedback and look at the results and see where they’re going wrong and trying to improve it. And then on the delivery side, you’ve got to think about all these really practical issues about which models are safe to put out, which ones aren’t, and then can you release them in a way that doesn’t fully release them — that allows people to use them, but within particular reasonable bounds.

Richard Ngo: Right.

Rob Wiblin: Why did you decide to work at OpenAI rather than somewhere else? There’s been quite a proliferation of different groups trying to make AI safe lately.

Richard Ngo: Yeah. I chose to work at OpenAI because it felt like a bunch of the people there were pretty excited about the work I wanted to do, which is this high-level thinking about how we can affect the development and deployment of AI — including governance and including high-level alignment thinking. It felt like there was buy-in from people at OpenAI to have more people understanding these and then taking actions accordingly.

Partly that’s because OpenAI is a pretty non-credentialist organisation. They encouraged me to drop out of my PhD because they thought it was just more important for me to start doing the work immediately. And in hindsight, I think that makes a lot of sense. It’s increasingly common that larger research organisations are less focused on people having PhDs, and I think OpenAI has been leading the trend in that regard.

So those are a couple of the key reasons. And of course, there are just a bunch of great people there who I’m very excited to work with. In particular, the team I’m currently on, which is headed by Jade Leung and has a few other great people who I work with every day.

Rob Wiblin: Yeah. So one framing of these issues is that we have something of a race between improvements in what AI is capable of doing — in its ability to learn quickly and efficiently from data, and its ability to generalise and plan and all that stuff that we’ve been talking about earlier — and then our understanding of how we can organise those systems so that they act in ways that are predictable and safe and aligned with our intentions and so on.

I guess a worry might be that all of the research that OpenAI is doing in AI is really quite impressive, and maybe it’s advancing the first of those things — it’s advancing the generalizability and it’s advancing the efficiency and so on — in a way that leaves less time for the second side of things that you are potentially working on. How does OpenAI think about that issue? I guess sometimes it’s called differential technological development.

Richard Ngo: So I already mentioned that OpenAI is a very empirically minded organisation, that’s very focused on seeing what evidence we can collect from the cutting-edge models and then updating accordingly. I think that OpenAI’s position on this has developed over time as we get increasing amounts of evidence that actually we’re pretty uncertain what’s going on with these models and there’s increasing reasons for caution.

So yeah, this is something that people are pretty actively thinking about, just how to ensure that we don’t get into worrying races or cases where people are cutting corners. And overall, I think that in terms of investing in alignment in particular and also governance, that’s really a case where we want as many great people as we can get. So that’s one of the key focuses: bringing in people who can in fact help figure out what the best strategies are. And that’s a significant part of what my job is.

Rob Wiblin: Would you personally like to see ML research as a field progress AI capabilities and get closer to potentially general AI? Would you like that to happen sooner and faster, or slower?

Richard Ngo: I think the line between alignment research and capabilities research is in fact a little thinner and blurrier than people used to think. So there’s a lot of work on interpretability, for example, that could in theory eventually be used to design better architectures or more sophisticated training algorithms or things like that. In general, I’m mostly interested in the field of ML being more of a scientific discipline rather than just sort of Facebook’s motto, “Move fast and break things.” I think these are two different approaches to thinking about machine learning and I prefer we have this careful scientific style understanding of what we’re trying to do as opposed to moving fast and breaking things.

Rob Wiblin: I see. So it’s not just about let’s just make new stuff that kind of seems to work. You more want to have a mentality that we want to deeply understand what is going on here. We want to actually understand the phenomenon.

Richard Ngo: Right.

Rob Wiblin: OpenAI has its high-level strategy or high-level description of what it sees as a potential path to making AI become both more powerful, but have it go well. If that strategy doesn’t work out, what do you think is the most likely reason for it to fail or go awry?

Richard Ngo: Probably the underlying bet that OpenAI is often making, which is that an empirical understanding of the biggest systems is the best way to make progress. We could just be in a world where that’s not actually the case, and we need more foundational theoretical research, or more time spent trying to understand the models we have, for example.

I tend to think of OpenAI’s alignment research as part of a wider portfolio of bets that the alignment community is making — where OpenAI is focused particularly on the case where a bunch of very empirical work is going to be the best bet, and then other groups are focused a bit more on more theoretical work and trying to do more prediction in advance of what these sophisticated systems are going to look like, rather than focusing on the things which are currently at the cutting edge. That’s just one way in which different people are making different bets, and I feel uncertain about which category of bets is going to end up being the right way to go.

Rob Wiblin: What’s an argument that stands out to you on the question of, should you as an individual work on the empirical side of things or perhaps more on the theoretical or conceptual side of things? Or is that just an incredibly difficult one?

Richard Ngo: I tend to think that people should go with their comparative advantage a lot of the time. In particular: what interests them the most, what they really enjoy doing, what can they imagine getting really obsessed with? So by default, starting off by actually just staring at these systems a lot, getting as hands-on as possible, just seems like a robustly good strategy. And then from there you start noticing things like, “There’s this interesting phenomenon that I want to try and understand on a deeper level,” or “I should branch out into doing a bit less building these systems and a little bit more of a different type of research.”

That’s in some ways the strategy that I did myself, where I started off as an engineer and then realised that there were a whole bunch of higher-level questions that I wanted to have answered and that nobody else was really trying to answer. So I recommend that for most people.

AIs with the same total computation ability as a human brain [00:45:25]

Rob Wiblin: Yeah. OK, pushing on from OpenAI specifically, I was hoping we could go through some background empirical information that I think I understand — maybe you’ll be able to correct some of it — but that I think informs some of my worries and some of my predictions about how things might go.

The first one is: do we have a rough idea of when it will be affordable to run AIs with the same total computational ability as a human brain? And is that the right way to think about it?

Richard Ngo: I think the thing that’s worth distinguishing here is between running an AI that uses the same amount of compute as a human brain versus training that AI in the first place. When we’re talking about these large neural networks, much more compute is spent on training the system in the first place than in actually running it.

According to a report by Joe Carlsmith from Open Philanthropy, we actually already do have enough compute to run the equivalent of a human brain. It’s just that in order to get a neural network that’s the size of a human brain to do useful things, you need to do a bunch more training, which is going to take much more compute.

And that feels like a much more difficult thing to estimate, because how are we going to train the system? On what data? These are all uncertain variables, but the best estimate we have from that is by Ajeya Cotra, also from Open Phil. And she thinks that it’s plausible within this decade and then likely within the decades after that that we can train systems that are as large or do as much computation as the human brain.

Rob Wiblin: OK, so there’s two quite different stages. There’s the training stage, which involves a huge amount of computation, and then, having trained the model, applying it, which involves a whole lot less. I suppose you could think, at the training stage: Is the training process using as much compute as a human brain has available? That’s one question. You could also think, well, a human brain is trained over a very long period of time, so maybe we should ask the question: During the training process, are as many computations done as a human does over their entire lifetime? Would that be a more sensible question to ask? Can you de-confuse me?

Richard Ngo: I think all of these things are sensible things to ask, and it’s pretty unclear which of them is the most relevant. In the original report by Ajeya, she used a bunch of these different possibilities as anchors for the estimates. She estimated how much compute would we need if it turns out the closest analogy is to the amount of compute used by a human brain over a single lifetime versus a bunch of different other possibilities.

It’s hard to say what the most analogous setup is. I think because evolution did a lot of work in getting human brains to the point where we can learn really efficiently, it feels like you are very likely going to need much more experience and much more training than an individual human does throughout their lifetime. But whether that’s a couple of orders of magnitude or many orders of magnitude, it’s really hard to know. Probably it starts off being many orders of magnitude, and then as algorithms get better it starts dropping, until you get to pretty comparable efficiency as the human brain. That would be my best guess.

Rob Wiblin: OK. So one answer would be that we already do have computers that could do as much compute as the human brain. And then another answer, depending on how you look at it, you could say, maybe in this decade — or if not, at least in decades to come if current trends persist.

So there’s this big ratio between the amount of computation required to train a model and then how much there is to apply it. Which means by the time you have enough computation around to train a system that looks at all like a human being, then with the compute that you used for that, you’d be able to run this system — many, many, many instantiations of it. Do you have a sense of how many people you could run by the time you’ve trained one of them?

Richard Ngo: This is a slightly tricky question to answer, because it depends on exactly how your training setup works. If you use a very small amount of compute for a very long time, that’s going to be different than if you used a lot of compute for a short time when training it.

I think a reasonable ballpark figure is that you can run thousands of copies of a given model using the same amount of compute that you used to train it.

But probably a better way of thinking about it is just in terms of costs. So for training a cutting-edge language model, numbers around $1 million, $10 million seem pretty reasonable for how much it costs to actually run all of that training. And then the cost of using that model to actually do a task — like writing an essay or solving a problem for you — is more like one cent.

So that’s a pretty massive disparity between what it takes to actually train the model versus what it takes to run it. And in some cases, the disparity can be smaller. So in systems like AlphaZero, for example, you can just use as much computation as you want when running it because that’s an architecture that’s actually very scalable at runtime — you can just keep using more and more of it to search through different possible games. But for our current cutting-edge systems, like these large language models, it’s a fairly fixed ratio: something like $1 million or $10 million for training to a couple of cents for doing standard tasks.

Rob Wiblin: And I suppose with images it might be similar? $10 million to train it and then a cent per image that you output. Yeah, it’s a pretty big ratio. Although I guess I’m not exactly sure. I suppose producing one image might turn out to not be very much in the scheme of things in terms of one’s reasoning capacity. So actually running an actor that was doing all sorts of stuff might involve producing the equivalent of very large numbers of images, and so it might actually cost a decent amount of compute and electricity and so on.

Richard Ngo: Right.

What we’ve learned from recent advances [00:51:19]

Rob Wiblin: What’s something we’ve learned, or think that we might possibly have learned, from recent advances in AI capabilities?

Richard Ngo: I’ve already alluded a little bit to how the unpredictability of capability advances and how things like reasoning, strategic thinking, and so on might just come much earlier than we expect.

Another thing that has felt pretty important is that we don’t really know what the capabilities of our systems are directly after we’ve built them anymore. So once you train a large language model or a large image model, it may be the case that there are all sorts of things that you can get it to do — given the right input, the right prompt, the right setup — that we just haven’t figured out yet, because they’re emergent properties of what the model has learned during training.

Probably the best example of this is that people figured out that if you prompt large language models to think through, step by step, in its reasoning, they can answer significantly more difficult questions than they could if you just give them the question by itself. And this makes sense, because this is what humans do, right? If you tell somebody to answer an arithmetic problem by calculating all the intermediate values, they’re probably much more likely to get it correct than if you say, “You have to give me the final answer directly.”

But this is something that it took ages for people to figure out. Because this was applicable to GPT-3, and I think also to a lesser extent to GPT-2, but papers about this were only coming out last year. So these are the types of things where actually just figuring out what the models can even do is a pretty difficult task, and probably just going to get increasingly difficult.

Rob Wiblin: Yeah, that’s really fascinating. I suppose because the exploration space of prompts that you can give to these models is so vast — it’s incredibly open-ended; it’s just any kind of text — that it could be many, many years in principle until someone tries out something and then discovers that if you phrase the question this way, if you give it this kind of lead-in, then you get this different result than what you’re getting otherwise. So in a sense the model might be much more capable than we’d realised previously. Can that just keep on going and going?

Richard Ngo: I think so, and I think it’s probably going to get worse. Because we can think of this as an alignment problem. In some sense, alignment problems are when your model can do the thing that you want it to do and it just doesn’t do that — as opposed to capabilities problems, where it can’t do the thing that you want it to do.

So in this case, this is a less malicious alignment problem. We are trying to get it to do a task, and it’s not that there’s any deliberate deception going on — it’s just that we haven’t put in the input in the right way. But you can imagine in future systems that we try to get them to do a task and they might just make a deliberate choice not to do that task. That’s something that we would have a lot of trouble distinguishing from the model not being capable of doing the task.

So I think over time, trying to figure out what capabilities our model has might become something of an adversarial problem, where you actually need to reason about what the model’s intentions are — like what it’s trying to do and what information it wants you to have, basically.

Rob Wiblin: Right. Yeah, I hadn’t noticed how analogous this is to the alignment problem, or giving values or communicating instructions problem in general. You’re saying it’s definitely an alignment problem if there was a thing that you could have said that would have communicated better what output you wanted to the model, but then you said something that wasn’t quite right or it was phrased the wrong way, and so you got a result that you didn’t want.

But how does that connect to the adversarial issue, or the idea of the model trying to second guess what you want and maybe giving you something that you don’t, even if it knows better?

Richard Ngo: So you can imagine that if we’re thinking about systems that have learned some kind of long-term goal, then it seems plausible that there are some ways it could demonstrate capabilities that would be more or less conducive to achieving that long-term goal.

As a simple example, for a lot of long-term goals, having human trust is useful. So a model might not want to demonstrate a capability like “deceive humans,” or might not want to demonstrate a capability of being able to manipulate humans very easily. I think this is not really a big problem over the next couple of years, but over time I expect it to become a bigger and bigger problem. Models are going to understand that we’re asking them to try something and just not want to do it basically.

Rob Wiblin: Another thing that I’ve heard some people speculate that we might be learning from these language models, for example, is that we might be learning something about how humans operate. So these language models are kind of predictive models, where you’ve got some text before and then it’s trying to predict the next word. It seems like using that method, you can at least reasonably often produce surprisingly reasonable speech, and perhaps surprisingly reasonable articles and chat and so on.

Now, some people would say this looks like what people are doing, but it isn’t what they’re doing. We actually have all of these ideas in our minds and then we put them together in a coherent way, because we deeply understand the actual underlying ideas and what we’re trying to communicate. Whereas this thing is just chucking word after word after word in a way that produces a simulation of what a person is like.

But I suppose when people aren’t thinking that deeply, maybe we operate this way as well. Maybe I’m producing speech extemporaneously now, but my brain can do a lot of the work automatically, because it just knows what words are likely to come after other words. Do you have any view on whether these language models are doing something very different than what humans do? Or are we perhaps learning that humans use text prediction as a way of producing speech themselves to some degree?

Richard Ngo: That’s a great example actually, where a lot of the time human behaviour is pretty automatic and instinctive — including things like speech, where, as you say, the words that are coming out of my mouth right now are not really planned in advance by me. I’m just sort of nudging myself towards producing the types of sentences that I intend.

If we think about Kahneman’s System 1 / System 2 distinction, I think that’s actually not a bad way of thinking about our current language models: that they actually do most of the things that humans’ System 1 can do, which is instinctive, quick judgements of things. And then, so far at least, they’re a little bit further away from the sort of metacognition and reflection, and noticing when you’re going down a blind alley or you’ve lost the thread of conversation.

But over time, I think that’s going to change. Maybe another way of thinking about them is that it seems hard to find a task that a human can do in one second or three seconds that our language models can’t currently do. It’s easier to find tasks that humans can do in one minute or five minutes that the models can’t do. But I expect that number to go up and up over time — that the models are going to be able to match humans given increasingly long time periods to think about the task, until eventually they can just beat humans full stop, no matter how much time you give those humans.

Rob Wiblin: So is the idea there that if you only give a human one second to just blurt out something in reaction to something else, then it has to operate on this System 1, this very instinctive thing, where it’s just got to string a sentence together and it doesn’t really get to reflect on it. And the language models can do that: they can blurt something out. But the thing that they’re not so good at is what humans would do, which is look at the sentence that is about to come out of their mouth and then see that that’s actually not what they want to communicate, and then edit it and make it make a whole lot more sense conceptually.

Richard Ngo: Right. And sometimes you even see language models making the same types of mistakes that humans make. So as an example, if you ask it for a biography of somebody, sometimes it’ll give you a description of their career and achievements that’s not quite right, but in the right direction. Maybe they’ll say that the person went to Oxford when actually that person went to Cambridge, or something like that — where it’s like it clearly remembers something about that person, but it just hasn’t memorised the detail. It’s more like it’s learned some kind of broader association. Maybe it’ll say that they studied biology when actually they studied chemistry — but it won’t say that they studied film studies when they actually studied chemistry.

So it seems like there’s these mistakes where it’s not actually recalling the precise details, but kind of remembering the broad outline of the thing and then just blurting that out, which is what humans often do.

Rob Wiblin: That makes me kind of wonder, is anyone out there spending most of their day just playing with these models for very long periods of time, in order to develop possibly a much better intuition for how these models think in a sense? I suppose becoming the “language model whisperer,” the way that people who spend all their time with horses end up having a deeper understanding of how horses think and what’s going on with them at any given point in time.

Richard Ngo: It’s a little hard to track the people doing this. I’m sure there are some. I only know a handful of people who are doing this myself, some of them as part of red-teaming efforts trying to figure out what the capabilities and potential risks of our current biggest models are. There are two people who are at Conjecture (which is a new alignment research organisation) that are focusing on this, and some of their work is quite interesting.

I think the difficulty here is in navigating the tension between trying to really understand a given model very deeply versus the fact that a lot of the things that work for one particular model are going to be less relevant to future models. So a lot of the ways that people tried to cajole or prompt GPT-2 into doing the things they wanted were just totally unnecessary for GPT-3, just because it understood their intentions better and it understood their instructions better. And so a lot of that knowledge became very quickly outdated and it feels quite tricky to figure out how to extract generalisable principles from studying our current best models when they’re getting better so quickly.

Rob Wiblin: Yeah, that makes sense.

Bottlenecks to learning for humans [01:01:34]

Rob Wiblin: What are some bottlenecks to thinking and learning that human beings face, but which these future ML models won’t have to confront?

Richard Ngo: I think the biggest one is just that we don’t get much chance to experience the world. We just don’t get that much input data, because our lives are so short and we die so quickly.

There’s actually a bunch of work on scaling laws for large language models, which basically say: If you have a certain amount of compute to spend, should you spend it on making the model bigger or should you spend it on training it for longer on more data? What’s the optimal way to make that tradeoff?

And it turns out that from this perspective, if you had as much compute as the human brain uses throughout a human lifetime, then the optimal way to spend that is not having a network the size of the human brain — it’s actually having a significantly smaller network and training it on much more data. So intuitively speaking, at least, human brains are solving this tradeoff in a different way from how we are doing it in machine learning, where we are taking relatively smaller networks and training them on way more data than a human ever sees in order to get a certain level of capabilities.

Rob Wiblin: I see. OK, hold on. The notion here is you’ve got a particular amount of compute, and there’s two different ways you could spend it. One would be to have tonnes of parameters in your model. I guess this is the equivalent of having lots of neurons and lots of connections between them. So you’ve got tons of parameters; this is equivalent to brain size. But another way you could use the compute is, instead of having lots and lots of parameters that you’re constantly estimating and adjusting, you’d have a smaller brain in a sense, but train it for longer — have it read way more stuff, have it look at way more images.

And you’re saying humans are off on this crazy extreme, where our brains are massive — so many parameters, so many connections between all the neurons — but we only live for so little time. We read so little, we hear so little speech, relative to what is possible. And we’d do better if somehow nature could have shrunk our brains, but then got us to spend twice as long in school in a sense. I suppose there’s all kinds of problems to getting beings to live quite that long, but you might also get killed while you’re in your stupid early-infant phase for so long.

Richard Ngo: Exactly. Humans faced all these tradeoffs from the environment, which probably neural networks are just not going to face. So by the time we are training neural networks that are as large as the human brain, we should be expecting that they won’t have as much experience as humans do, but they’re actually just going to be training on a huge amount more experience compared with humans. So that’s one way in which humans are systematically disadvantaged. We just haven’t been built to absorb the huge amounts of information that are being generated by the internet or videos, YouTube, Wikipedia, across the entire world.

And that’s closely related to the idea of AIs being copyable. If you have a neural network that’s trained on one piece of data, you can then make many copies of that network and deploy it in all sorts of situations. And then you can feedback the experience that it gets from all those situations into the base model — so effectively you can have a system that’s learning from a very wide array of data, and then taking that knowledge and applying it to a very wide range of new situations.

These are all ways in which I think in the short term, humans are disadvantaged just by virtue of the fact that we’re running on biological brains in physical bodies, rather than virtual brains in the cloud.

And then in the longer term, I think the key thing here is what algorithmic improvements can you make? How much can you scale these things up? Because over the last decade or two, we’ve seen pretty dramatic increases in the sizes of neural networks that we’ve been using, and the algorithms that we are using have also been getting significantly more efficient. So we can think about artificial agents investing in doing more machine learning research and improving themselves in a way that humans just simply can’t match — because our brain sizes, our brain algorithms, and so on are pretty hard coded; there’s not really that much we can do to change this. So in the long term, it seems like these factors really should lead us to expect AI to dramatically outstrip human capabilities.

Rob Wiblin: So they don’t die. They can just keep accumulating knowledge forever. And as more stuff comes out on the internet, they don’t have to throw out the old stuff; they just read the new stuff and add it in. I guess as compute gets cheaper, they can just throw more compute at it. Whereas the human brain is kind of stuck at its particular clocking speed. It’s not getting any faster, it’s not getting any bigger. Oh yeah, and also we can keep reprogramming these things, making the algorithms more efficient. Humans are just kind of stuck with the software that we have, for better or worse.

One other thing I’ve heard, that I’m not sure what the implication is: signals in the human brain — just because of limitations and the engineering of neurons and synapses and so on — tend to move pretty slowly through space, much less than the speed of electrons moving down a wire. So in a sense, our signal propagation is quite gradual and our reaction times are really slow compared to what computers can manage. Is that right?

Richard Ngo: That’s right. But I think this effect is probably a little overrated as a factor for overall intelligence differences between AIs and humans, just because it does take quite a long time to run a very large neural network. So if our neural networks just keep getting bigger at a significant pace, then it may be the case that for quite a while, most cutting-edge neural networks are actually going to take a pretty long time to go from the inputs to the outputs, just because you’re going to have to pass it through so many different neurons.

Rob Wiblin: Stages, so to speak.

Richard Ngo: Yeah, exactly. So I do expect that in the longer term there’s going to be a significant advantage for neural networks in terms of thinking time compared with the human brain. But it’s not actually clear how big that advantage is now or in the foreseeable future, just because it’s really hard to run a neural network with hundreds of billions of parameters on the types of chips that we have now or are going to have in the coming years.

Rob Wiblin: All of this raises the question, why aren’t computers much better than us at everything already? I suppose you’re saying that the thing that the human brain is exceptional in is that it has tonnes of parameters — which in this case corresponds to lots of neurons, lots of connections between them. Computers haven’t been able to match us with that. And presumably we also suspect that the algorithmic efficiency has been quite a bit worse, although that’s improving over time.

Richard Ngo: Right. I wouldn’t want to say that the number of parameters is the be-all and end-all of how efficient your neural network is. I think we’ve had some pretty strong results from scaling up neural networks and they’ve gotten much better. And we should expect that trend to continue; I’m not really seeing it stopping anytime soon.

But at the same time, how to anchor that compared with a human baseline feels like a very difficult question. Should we think of neural networks as being more efficient than the human brain in terms of using the parameters that they have? Is it less efficient than a human brain per biological neuron versus artificial neuron? Is it a couple of orders of magnitude less efficient? That feels very uncertain to me.

And so the main thing I want to say is that, as we approach the point where we’re training networks that are within a couple of orders of magnitude of the human brain, it feels like we shouldn’t be ruling a bunch of things out. It may be the case that they’re significantly less efficient than the human brain in terms of converting number of parameters and number of neurons into actual effective intelligence. But it seems like a pretty strong claim to say they’re many orders of magnitude away, in terms of how efficiently they can scale.

The most common and important misconception around ML [01:09:16]

Rob Wiblin: Yeah. What’s a common misconception you run into about how ML models work or how they get deployed that it might be helpful to clarify for people?

Richard Ngo: I think the most common and important misconception has to do with the way that the training setup relates to the model that’s actually produced. So for example, with large language models, we train them by getting them to predict the next word on a very wide variety of text. And so some people say, “Well, look, the only thing that they’re trying to do is to predict the next word. It’s meaningless to talk about the model trying to achieve things or trying to produce answers with certain properties, because it’s only been trained to predict the next word.”

The important point here is that the process of training the model in a certain way may then lead the model to actually itself have properties that can’t just be described as predicting the next word. It may be the case that the way the model predicts the next word is by doing some kind of internal planning process, or it may be the case that the way it predicts the next word is by reasoning a bunch about, “How would a human respond in this situation?” I’m not saying our current models do, but that’s the sort of thing that I don’t think we can currently rule out.

And in the future, as we get more sophisticated models, the link between the explicit thing that we’re training them to do — which in this case is predict the next word or the next frame of a video, or things like that — and the internal algorithms that they actually learn for doing that is going to be less and less obvious.

Rob Wiblin: OK, so the idea here is: let’s say that I was set the task of predicting the next word that you are going to say. It seems like one way that I could do that is maybe I should go away and study a whole lot of ML. Maybe I need to understand all of the things that you’re talking about, and then I’ll be able to predict what you’re likely to say next. Then someone could come back and say, “Rob, you don’t understand any of the stuff. You’re just trying to predict the next word that Richard’s saying.” And I’m like, “Well, these things aren’t mutually exclusive. Maybe I’m predicting what you’re saying by understanding it.” And we can’t rule out that there could be elements of embodied understanding inside these language models.

Richard Ngo: Exactly. And in fact, we have some pretty reasonable evidence that suggests that they are understanding things on a meaningful level.

My favourite piece of evidence here is from a paper that used to be called “Moving the Eiffel Tower to ROME” — I think they’ve changed the name since then. But the thing that happens in that paper is that they do a small modification of the weights of a neural network. They identify the neurons corresponding to the Eiffel Tower and Rome and Paris, and then just swap things around. So now the network believes that the Eiffel Tower is in Rome. And you might think that if this was just a bunch of memorised heuristics and no real understanding, then if you ask the model a question — “Where is the Eiffel Tower?” — sure, it’ll say Rome, but it’ll screw up a whole bunch of other questions. It won’t be able to integrate that change into its world model.

But actually what we see is that when you ask a bunch of downstream questions — like, “What can you see from the Eiffel Tower? What type of food is good near the Eiffel Tower? How do I get to the Eiffel Tower?” — it actually integrates that single change of “the Eiffel Tower is now in Rome” into answers like, “From the Eiffel Tower, you can see the Coliseum. You should eat pizza near the Eiffel Tower. You should get there by taking the train from Berlin to Rome via Switzerland,” or things like that.

Rob Wiblin: That’s incredible.

Richard Ngo: Exactly. And it seems like almost a definition of what it means to understand something is that you can take that isolated fact and translate it into a variety of different ideas and situations and circumstances.

And this is still pretty preliminary work. There’s so much more to do here in understanding how these models are actually internally thinking and reasoning. But just saying that they don’t understand what’s going on, that they’re just predicting the next word — as if that’s mutually exclusive with understanding the world — I think that’s basically not very credible at this point.

Rob Wiblin: Yeah, that’s a fantastic point. Who is currently working on developing general AI as opposed to specific applications of AI? And is that a meaningful question?

Richard Ngo: I think that at this point, language models in particular are becoming increasingly general. So it’s hard to distinguish any lab that’s working on big language models from labs that are really aiming at generality, especially because a lot of people are interested in integrating video or audio or a wide range of different modalities. So we’re at the point where it’s hard to give any comprehensive list of groups that are working on general intelligence. Basically any big AI lab, of which there are hundreds, is doing something that’s in the broad direction of building general systems.

Rob Wiblin: But some groups seem to be self-consciously saying, “We are on the path. We are marching towards producing an AI that can do almost everything or everything.” I think OpenAI is in this spirit, broadly speaking. And I guess DeepMind is in that spirit — they’re like, “We’re going to solve intelligence, we’re going to create general AI.” Are there any other groups other than those two that perceive themselves that way?

Richard Ngo: I think it’s a little tricky to tell. One thing that I think people tend to overestimate is the extent to which the priorities at top AI labs are set by the explicit leaders of those labs, as opposed to the researchers who are working within those labs. It’s not really that clear to me what would be different about Google Brain, for example, if the leadership had explicit priorities to build AGI, as opposed to implicit priorities, like “We want the best language models.” I guess that’s a little bit of a harder thing to really pin down.

The other clear examples are: Anthropic is working towards general systems, and then Google Brain and Facebook AI are doing pretty cutting-edge work as well. And then a whole bunch of other places — some that are more like startups, some that are more like labs attached to big companies, some that are consortia of different countries working together in the EU, some in China, and so on. Yeah, it’s really a Cambrian explosion of AI right now.

The alignment problem from a deep learning perspective [01:15:39]

Rob Wiblin: Earlier on, we gave more general conceptual reasons for the arguments that concern me about why positively shaping the development of AI is an issue that people should really be concerned about. But let’s dive in now, at least zoom in, and get a bit more technical and think about exactly how models actually work.

Let’s talk about this paper that you wrote, called “The alignment problem from a deep learning perspective.” What did you want to communicate or add to the conversation with that paper?

Richard Ngo: A lot of the original arguments about AI risk were at a very high level of abstraction. I think that’s pretty reasonable, because back when those arguments were being formulated, we hadn’t gone through the deep learning revolution. We had much worse ideas of what it might look like for these systems to start approaching general intelligence.

But now that we do, and now that we’ve made so much progress in deep learning, I wanted to see how concrete we could get these scenarios where AIs learn misaligned goals and then misbehave. How concretely can we convey these core ideas? And the way I ended up phrasing this was in terms of the things that happen throughout a training process. So I’m imagining a large-scale training process of a large neural network, and then thinking about, as it becomes more and more competent, what might happen and what are the forces that are going to be pushing it towards acting in aligned ways versus in ways that are unintended and misaligned with human preferences?

Rob Wiblin: Remind me, what is a deep learning perspective? What’s deep learning as opposed to other forms of learning?

Richard Ngo: Deep learning is just the use of large neural networks, and in particular neural networks with multiple layers, in order to learn from data. And that’s basically taken over the field of machine learning, especially over the last decade or so. Basically any cutting-edge system you’ll see is a deep learning system that’s got a neural network that’s somewhat analogous to the human brain, and that has artificial neurons that are connected together. And the connections between those neurons are formed by learning from data. So that’s the basic idea.

Rob Wiblin: OK, so the “deep” is lots of layers of neurons, lots of parameters, bigger brain kind of.

Richard Ngo: Yep.

Rob Wiblin: OK, yeah. In the abstract of this paper, you wrote:

The report aims to cover the key arguments motivating concern about the alignment problem in a way that’s as succinct, concrete and technically-grounded as possible. I argue that realistic training processes plausibly lead to the development of misaligned goals in AGIs, in particular because neural networks trained via reinforcement learning will learn to plan towards achieving a range of goals; gain more reward by deceptively pursuing misaligned goals; and generalize in ways which undermine obedience.

Let’s go through those three issues one by one. Is it possible to explain in a not super technical way why it’s a problem that neural networks trained via reinforcement learning will learn to plan towards achieving a range of goals?

Richard Ngo: I think this part is not a big problem in its own right. It’s more like setting up some background concepts for how to think about these networks.

In particular, a lot of people think about the behaviour of neural networks that are trained via reinforcement learning — which is when they’re given rewards and penalties based on their behaviour — purely in terms of what rewards the system gets and how that determines its behaviour. So a lot of people will say the goal of the system is to get a higher reward.

And I wanted to focus on a more specific concept, which is the idea that these networks are going to have internal representations of outcomes. That is, within the weights of those networks, there are going to be some neurons corresponding to different concepts, and some of those concepts are going to be outcomes in the real world.

We’ve seen some examples that move towards this so far. There are studies of neural networks that were used for image recognition where they could identify the specific neurons that corresponded to pictures of cats or pictures of dogs or different shapes, different angles, different textures, and so on. And so you could say, “Look, here’s a model that builds up representations of cats or dogs or whatever, by combining representations of shapes and textures and outlines and things like that.”

I’m just hypothesising that this goes further: that it’s not just representations of objects like these, but in these more general systems that are trained on a wider range of tasks, they’re going to learn representations of actual real-world outcomes — like “The human is happy with my performance,” or “I got to the end of the maze,” or “I won the game,” or things like that.

Then exactly how they’re going to use those representations to choose actions feels like a very open question. But the main claim I’m making here is just that these networks are in fact going to learn these complex representations, and then those representations are going to feed into their actions via some of them representing desirable outcomes and some of them representing undesirable outcomes. And then they’re going to learn to choose actions which tend to lead more towards the desirable outcomes rather than the undesirable outcomes.

Rob Wiblin: So to have an example in mind that perhaps involves a bit more agent decision making and planning and so on, I was looking recently at this blog post from DeepMind, it’s a couple years old, where they were training a system to learn to play Capture the Flag in these kind of complicated environments. I think they were working on it mostly because you’d have multiple different people on a Capture the Flag team and they were trying to get them to coordinate, or see whether they could learn to coordinate.

You’re saying that these models, as they learn, they’re going to develop intermediate goals. With Capture the Flag, the goal is to get the flag and bring it back to your home base. But it’s going to have all these intermediate goals, where it’s going to say, “Have I managed to pick up the flag yet?” or “Has anyone on my team picked up the flag?” And then also, “Are two of us together, so if one of us gets killed, the other one can pick up the flag?” It’s going to have all these intermediate things that it’s trying to accomplish because it knows that those are correlated with achieving the final goal. And then it starts developing plans to achieve those intermediate goals as kind of sub-parts of its strategy. Am I thinking about this right?

Richard Ngo: Yeah, I think that’s right, where there’s just going to be a bunch of possible states of the game. And in fact, DeepMind did a bunch of work to identify which representations the Capture the Flag agent had learned, including things like, “My flag is at my home base,” or things like that. And then it seems like a relatively straightforward extrapolation to say, how is it using that representation? Well, that representation of “my flag is at my home base” is somehow guiding its actions towards outcomes where that continues to be the case where the flag isn’t taken, for example.

Now in this case, I’m not going to say that that agent is doing any real planning necessarily. It could just have learned to act on instinct basically, rather than reasoning about how to achieve those outcomes. And so the hypothesis here is just that as we get more and more sophisticated agents, the ways they’re going to be aiming towards those outcomes is going to look more and more like they’re actually doing planning, or like they’re doing some kind of internal reasoning about how to achieve those outcomes.

And then there’s a separate question of which outcomes they’re going to end up aiming towards. And the reason I phrase this in terms of a “range of goals” is because in the early stages of training, I don’t think we can really say that agents are going to learn any one specific overriding goal. In kind of the same way that humans have lots of different drives — we have the drive towards eating nice food, we have the drive towards exercising, and the drive towards succeeding and having people respect us and so on. It’s not really like you can say, “Well, all of these are just intermediate goals towards some one final goal.” Rather, it happens to be the case that we’ve just got all of these internal representations of things that we like. Our actions are partly motivated by each of those.

So this seems like a pretty reasonable way to think about machine learning systems as they start to approach human level: that they’re going to have a bunch of representations of a bunch of outcomes that they’ve been rewarded for in the past, and they’re going to take actions that sometimes aim towards each of these different outcomes.

Rob Wiblin: I see. So humans do this as well. We do have broader, final goals in life to some extent, but then we develop all of these proxies that we’re targeting on a day-to-day basis. At one level, we don’t want to starve and then we use as a proxy, “Am I standing in front of an open fridge?” or whatever other things we’ve felt rewarded for in the past. But you’re saying as these models get more and more complicated, they’re going to develop a bigger and bigger structure of these representations of different states in the world that they either have positive or negative associations with, and that those are going to end up doing a lot of heavy lifting and driving the actual behaviour in any given situation. That’s right?

Richard Ngo: Yep, That’s right.

Rob Wiblin: Should that set off red flags, or is that just setting up the situation for understanding more how they might behave in complicated ways?

Richard Ngo: I think early on in training, this is a little bit worrying, but not crucial. If I’ve got an agent that has sometimes been rewarded for being obedient, but also has sometimes gotten some rewards from being deceptive — so it’s learned the internal representation of the concept of obedience, but it’s also learned an internal representation of a concept like “hide mistakes where possible” — then in theory, at least, if we are setting the rewards correctly over time, it’s going to end up in situations where these two things conflict. And hopefully it learns to prioritise obedience over, for example, hiding evidence of mistakes.

But at least in the early stages, I expect that each of these things is going to be separately rewarded in a bunch of different circumstances. So it’s not going to be the case that it’s just going to at least immediately converge towards just one of these things.

I’m kind of picturing these agents that have learned a bunch of drives that are applicable to a bunch of different situations, and then I think a lot of people would say, “Well, that’s fine. We can just keep training them until they converge towards only having the goals that we want.” And that’s when the next part of the argument comes in, where I’m saying this is not in fact a reliable mechanism. We can’t just keep training them until these correlations go away, because of the phenomenon I’m going to talk about next, which is “situational awareness.” That’s the second thing that I highlighted in the summary.

Situational awareness [01:26:02]

Rob Wiblin: So the second point was neural networks trained via reinforcement learning will gain more reward by deceptively pursuing misaligned goals. Yeah, can you elaborate on that?

Richard Ngo: Right. So the key idea here is this concept called situational awareness, which was introduced by Ajeya Cotra in a report on the alignment problem, and which I’ve picked up and am using in my report. The way I’m thinking about situational awareness is just being able to apply abstract knowledge to one’s own situation, in order to choose actions in the context that the agent finds itself in.

In some sense this is a very basic phenomenon. When I go down to the grocery store to buy some matches, for example. Maybe I’ve never bought matches at the grocery store before, but I have this abstract knowledge of, like, “Matches are the type of thing that tend to be found in these types of stores, and I can buy them, and I’m in a situation where I can walk down to the store.” So in some sense this is just a very basic skill for humans.

In the context of AI, we don’t really have systems that have particularly strong situational awareness right now. We have agents that play StarCraft, for example, but they don’t understand that they are an AI playing StarCraft. They’re just within the game. They don’t understand the wider context. And then if you look at language models, I think they come a bit closer, because they do have this abstract knowledge. If you ask them, “What is a language model? How is it trained?”, things like that, they can give you pretty good answers, but they still don’t really apply that knowledge to their own context. They don’t really use that knowledge in order to choose their own answers.

But as we train systems that are useful for a wide range of tasks — for example, an assistant: if you train an assistant AI, then it’s got to have to know a bunch of stuff about, “What are my capabilities? How can I use those capabilities to help humans? Who’s giving me instructions and what do they expect from me?” Basically the only way to have really helpful AIs in these situations is for them to have some kind of awareness of the context that they’re in, the expectations that we have for them, the ways that they operate, the limitations that they act under, and things like that. And that’s this broad thing that I’m calling situational awareness.

Rob Wiblin: So there’s this concept of situational awareness, which is the water that we swim in, such that it is almost hard to imagine that it’s a thing. But it is a thing that humans have that lots of other minds might not have. But in order to get these systems to do lots of tasks that we’re going to ultimately want them to do, they’re going to need situational awareness as well, for the same reason that humans do. So that’s kind of a next stage. And then what?

Richard Ngo: And then, when you’ve got systems that have situational awareness, then I claim that you start to get the problematic types of deception. So in the earlier phases of training, you might have systems that learn the concept of “hide mistakes where possible,” but they don’t really understand who they’re hiding the mistakes from, exactly what types of scrutiny humans are going to apply, things like that. They start off pretty dumb. Maybe they just have this reflex to hide their mistakes, kind of the same way that young kids have an instinctive heuristic of hiding mistakes where possible.

But then, as you start to get systems that actually understand their wider context — like, “I’m being trained in this way,” or “Machine learning systems tend to be scrutinised in this particular way using these types of tools” — then you’re going to end up in an adversarial situation. Where, if the model tries to hide its mistakes, for example, then it will both know how to do so in an effective way, and then also apply the knowledge that it gains from the humans around it, in order to become increasingly effective at that.

So that’s a concern: that these types of deception are just going to be increasingly hard to catch as models get more and more situational awareness.

Rob Wiblin: Yeah, that makes sense to me. Where is the badness slipping in here? I suppose you’ve got a model and you’re rewarding it. This model that you’re conceptualising is being rewarded at the end of some process, where the human says, “Yes, it did a good job.” And then it realises, based on the situational awareness, that it’s better to lie about mistakes that it’s made, for example, because that is one way that it can convince humans to say, “Yes, it did a good job.” And so it develops a concept of, and indeed an intermediate goal of, hiding mistakes that it makes.

Then you might try to outwit that, by checking its behaviour, by checking its thoughts. But it now has situational awareness, so it might see that you are using these tools, that you’re using these approaches to try to root out deception, so it’s going to try to counteract that. It now develops another concept for how it’s going to get around those methods of detection. Am I understanding it right?

Richard Ngo: Right. And I think this is particularly important because, in the normal context of machine learning, you’re implicitly assuming that we can just apply a whole bunch of tests and safeguards and metrics and so on, in a way that’s not really adversarial. The normal process of developing software or developing models, and figuring out whether we should deploy them, does involve a whole bunch of checks and safeguards. Then the possibility of models being deliberately deceptive is a reason why all these types of safeguards that we might usually expect to prevent us from doing dangerous things with our models apply to a lesser and lesser extent over time.

Rob Wiblin: Maybe this is just no surprise really, because we’re trying to make these models more and more humanlike, to be more and more like human agents, and humans share these maladies — where they develop their own intermediate goals that we didn’t like, they try to deceive other people, they notice conflicts between what they want and what other people want, and then can be a bit schemy. So perhaps it’s just completely natural that in this process you would recapitulate undesired human behaviours.

Richard Ngo: Yeah, that seems right. It’s not like we’re explicitly trying to make them more humanlike; it’s more like we’re trying to get them to do a bunch of tasks, and then it just turns out that things like deception are robustly useful for getting higher scores on a wide range of tasks.

If we think about stock market trading, for example, there’s some obvious metrics that you could use, like how much money have you made. But there’s also less obvious metrics that we might be interested in, like have you done any insider trading? Or have you done any market manipulation? And these are the types of things where a dumb system that starts to learn how to do some market manipulation is probably going to be caught pretty easily, but a system that has situational awareness could plausibly figure out, “These are the types of market manipulation that will be caught, and these are the types that won’t be caught, and I’m just going to focus on the latter type.”

Rob Wiblin: Is the original sin here that we’re rewarding the system based on us saying, “Yes, we’re satisfied, you did a good job,” but that’s not ultimately exactly what we want? What we want is to be satisfied and to have our preferences met — and that’s not the same thing as pressing a button saying that that happened. And so there’s a bit of slippage between these two, and then all of these behaviours appear in the crack between them, and it pulls them apart gradually.

Richard Ngo: That seems right. And in particular, if we are trying to do this type of supervision for systems that are more capable than us in certain ways — maybe can act more quickly — and whose operations we don’t really understand. Because if you’re trying to reward and penalise a kid for teaching them what to do, then a lot of the time this tends to work. At least in the short term, you can mostly monitor them. But in this case it feels significantly harder to monitor these systems as they start to get to the level where they can reason as well as humans can about deception and about what’s going to be caught and what isn’t going to be caught.

Rob Wiblin: Yeah. So sticking with the child analogy, a five-year-old isn’t as smart and doesn’t know as much as an adult, so their attempts at deception often get rooted out. I suppose not always, but pretty often, because they’re likely to slip up. But with another adult, they’re in a much better position to try to trick you and get the better of you in an adversarial situation.

Richard Ngo: Right.

Rob Wiblin: And then if you had someone who’s not very bright dealing with someone who’s really sharp and quite experienced with deception, then maybe they’re at a high risk of getting outwitted. And for all the reasons earlier we were talking about how these systems can potentially advance quite quickly, you might think after a while a trainer of one of these systems is going to end up in quite an unenviable situation, trying to detect all of the cases where it’s doing something that wasn’t desired.

Richard Ngo: Yep, absolutely.

Rob Wiblin: So it seems like, to some extent, people were reasoning about the possibility of a tendency towards deception earlier — before we even knew exactly what machine learning systems would look like, before we had modern architecture — because you can also approach this at just a high conceptual level, saying, “You’ve got goal X. You discover you can get that using some hack that the other agent hasn’t anticipated, and so you’ll naturally start to do that.”

Is there anything in particular that we gain from looking at the systems that we have now and seeing how specifically does that play out, rather than just how that plays out in a very abstract case?

Richard Ngo: I think so. When you’re reasoning about a system that already has some goals, and it’s just trying to figure out strategies for doing so, this feels different from a system whose goals are, in some sense, being developed as it goes along. So a system’s goals are going to be changing on the basis of the rewards and the feedback that we give these systems. You could imagine, for example, that if we really penalise deception often enough, while the system’s not smart enough to deceive us properly, then it might just learn the strong goal of “never deceive humans,” and this might actually get us quite a long way.

So I think it’s important to think about the process of us reinforcing certain goals of the system, and that system trying to figure out novel strategies for achieving those goals, as a process that happens in tandem — rather than just thinking about the limits of a system that already has fixed immutable goals, and it’s just trying to figure out how to implement them.

Rob Wiblin: So the idea here is that we now see that the goal-formation process of these systems is very messy. You have some sort of reinforcement process where you say how satisfied you were with the outcome, but this produces a lot of complexity internally in the behaviour, and there’s a bit of randomness in what gets rewarded and what doesn’t. Is that what’s adding the additional insight that we can have today versus 10 years ago?

Richard Ngo: Right. And then also, by trying to pin this down, I hope that we’ll be able to study it and actually understand what’s going on better. If we have the hypothesis that the goals of a neural network are represented in this way — maybe it has these concepts represented in the weights, and some of those concepts are wired up to others in certain ways, and that’s what’s leading to it choosing actions that are deceptive, for example — then this seems like the type of work that could plausibly be done empirically on the systems that we currently have today, or systems that are not too far away from them. And then ideally we’d be able to somehow modify or change the goals that these systems have learned in more desirable ways — perhaps in a way that’s somewhat analogous to the paper I mentioned earlier, of changing the representation of the Eiffel Tower to be linked to Rome rather than Paris.

So that’s one example of how the more concrete we can get about what we mean by the system “having goals” — the ways they’re represented, and the ways they’re formed — the more possible interventions we’ll be able to do.

Rob Wiblin: This is making me think, if you can find the neuron that is representing deception, then if you could flip it from Paris to Rome, could you flip it from mild positive association with deception to maximally negative association with deception? Is that a crazy idea?

Richard Ngo: That’s the type of thing that it would be very exciting to be able to do. The bottlenecks are that for these very high-level, sophisticated concepts like deception, it’s just really hard to know. Have you actually found the neuron that represents it? Maybe there are 20 neurons that represent it in some complex way. Maybe it has a concept that’s kind of similar to deception but doesn’t quite map onto the human concept of deception, so you’re actually penalising the wrong thing.

These are all possibilities that we really have no way of figuring out right now, and we’ll need a lot of further study in order to move towards that. But at least if we can pin down claims — like “Neural networks are going to choose actions on the basis of these internal goal representations, and here are some possibilities for what they might look like” — then I hope that we can start to make progress towards that.

Rob Wiblin: So there’s this general thing that in a sense, these systems, as they’re being trained, there’s these selective pressures, there’s pushes towards particular tendencies. And here we’ve identified a potential ongoing pressure towards deceptive behaviour. And imagine if you find the neuron that codes for deception, and then you try to set it to be really negative. Could you then get an offsetting effect somewhere else in the model, where it tries to now undo that? Just as you can try to bottle water, but it just fills up and then tries to spill over, and it’s always trying to get to its lowest point, because it’s attracted by gravity. The system tries now to find a new way to engage in deception that’s coded differently and isn’t being prevented by this mechanism. Is that a natural thought to have?

Richard Ngo: That seems right. Yeah, I don’t think that any of these things would be final solutions, but I think they’re things that might allow you to leverage these systems to help you find better solutions. So maybe if you have one model that you’re pretty sure is not deceptive at a certain point in time, you could ask that model, “Hey, are these other models being deceptive? If so, how or why?” And then start to leverage the intelligence and the capabilities of those models to supervise other models in turn.

It really feels like we don’t have solid proposals for arbitrarily scalable solutions right now, but we do have things that might be helpful on models that are plausibly human level, or maybe even a bit more intelligent than humans, and hopefully we will be able to bootstrap our way into aligning much more sophisticated and much more capable models than we have today.

Rob Wiblin: Set a spy to catch a spy.

Richard Ngo: Yeah.

Reinforcement learning undermining obedience [01:40:07]

Rob Wiblin: OK, let’s do the third one. What do you mean by reinforcement learning “will generalise in ways which undermine obedience”?

Richard Ngo: So the things that I’ve talked about so far, none of those are the core things that I’m worried about causing large-scale problems. So if you have a deceptive model, maybe it’s doing insider trading on the stock market. Even if we can’t catch that directly, over time, we’re eventually going to figure out, “Oh, something is going off here.” Maybe we’re in a bit of a cat-and-mouse game, where we’re trying to come up with the correct rewards, and the model’s trying to come up with deceptive strategies, but as long as they’re roughly within the human range of intelligence, it feels like we can hopefully constrain a bunch of the worst behaviour that they perform.

Then I’m thinking, what happens when these models become significantly superintelligent? And, in particular, intelligent enough that we just can’t effectively supervise them? What might that look like? It might look like them just operating way too fast for us to understand. If you’ve got an automated CEO who’s sending hundreds of emails every minute, you’re just not going to be able to get many humans to scan all these emails and make sure that there’s not some sophisticated deceptive strategy implemented by them. Or you’ve got AI scientists, who are coming up with novel discoveries that are just well beyond the current state of human science. These are the types of systems where I’m thinking we just can’t really supervise them very well.

So what are they going to do? That basically depends on how they generalise the goals that they previously learned from the period when we were able to supervise them, into these novel domains or these novel regimes.

There are a few different arguments that make me worried that the generalisation is going to go poorly. Because you can imagine, for example, that in the regime where we could supervise, we always penalise deception, and they learn very strong anti-deception goals. Maybe we think that is going to hold up into even quite novel regimes, where deception might look very different from what it previously looked like. Instead of deception being lying to your human supervisor, deception could mean hiding information in the emails you send or something like that.

And I think there are a couple of core problems. The first one is just that the field of machine learning has very few ways to reason about systems that are generalising and transferring their knowledge from one domain to the other. This is just not a regime that has been very extensively studied, basically because it’s just so difficult to say, “You’ve got a model that’s learned one thing. How well can it do this other task? What’s its performance in this wildly different regime?” Because you can’t quantify the difference between Dota and StarCraft or the difference between speaking English and speaking French. These are just very difficult things to understand. So that’s one problem there. Just by default, the way that these systems generalise is in many ways totally obscure to us, and will become more so, as they generalise further and further into more and more novel regimes.

Then there are a few more specific arguments as to why I’m worried that the goals are going to generalise in bad ways. Maybe one way of making these arguments is to distinguish between two types of goals. I’m going to call one type “outcomes” and I’m going to call the other type “constraints.”

Outcomes are just achieving things in the world — like ending up with a lot of money, or ending up having people respect you, or building a really tall building, just things like that. And then constraints I’m going to say are goals that are related to how you get to that point. So what do you need to do? Do you need to be deceptive in order to make a lot of money? Do you need to hurt people? Do you need to go into a specific industry or take this specific job? You might imagine a system has goals related to those constraints as well. And so the concern here is something like: for goals that are in the form of outcomes, systems are going to benefit from applying deceptive or manipulative strategies there.

This is, broadly speaking, a version of Bostrom’s instrumental convergence argument, which basically states that there are just a lot of things that are really useful for achieving large-scale goals. Or another way of saying this is from Stuart Russell, who says, “You can’t fetch coffee if you’re dead.” Even if you only have the goal of fetching coffee, you want to avoid dying, because then you can’t fetch the coffee anymore.

So you can imagine systems that have goals related to outcomes in the world are going to generalise towards, “It seems reasonable for me not to want to die. It seems reasonable for me to want to acquire more resources, or develop better technology, or get in a better position,” and so on, and these are all things that we just don’t want our systems to be doing autonomously. We don’t want them to be consolidating power or gaining more resources if we haven’t told them to, or things like that.

And then the second problem is that if these goals are going to generalise — if the goal “make a lot of money” is going to generalise to motivate systems to take large-scale actions — what about a goal like, “never harm humans,” or “never lie to humans,” or things like that? And I think there the problem is that, as you get more and more capable, there are just more ways of working around any given constraint.

Rob Wiblin: OK, so I’m just trying to follow this. There was a lot there. So you’ve formed these kind of intermediate goals, and then the systems are being scaled up to try to accomplish more ambitious tasks, with more open-ended mechanisms. And if they’ve learned intermediate goals that we don’t exactly like, then they might apply those on a much broader scale. So for example, if we didn’t give appropriate negative feedback to deceptive methods, then who knows what deceptive strategies it might adopt in future when it’s given much broader goals, like “run this company profitably” and so on.

And then you were just saying, on the other hand, when you’ve got constraints — where you’re trying to code in things that a system is meant not to do — as it becomes larger and larger, there’s more and more ways for it to wriggle out of those rules. Am I understanding that right?

Richard Ngo: That seems broadly right. The thing I want to specify here is that at this point, we won’t really have any ability to specify constraints or code things into the system. The only thing we can do here is, while they’re still under our supervision, we can try and reward and penalise various things they’re doing. And then once they become smart enough that we just don’t know what they’re doing — don’t know if this thing is good or bad, don’t understand whether they’re telling the truth or not, and don’t really understand at all how to figure out whether they’re telling the truth or not — at that point it’s going to come down to which concepts they’ve learned.

And if they’ve learned a concept like honesty or obedience, maybe they’ve learned a version of it, which is, “never tell a lie” — but there are many ways that you can cause harm without explicitly telling a lie. So when it comes to systems that have a very wide range of strategies, because they’re very intelligent for achieving outcomes in the world, my worry is that we’re going to need them to generalise the constraints very broadly as well. So not just, “don’t explicitly tell a lie,” but also, “don’t mislead humans,” or “don’t work around humans without telling them what you’re doing,” and things like that.

Rob Wiblin: So the issue here is we have this very complicated set of actual underlying desires and so on, but we’re feeding into the system and the training process more binary messages, like, “Yes, that was good. That was bad. That was good. That was bad.” And it’s trying to infer from that what principles it should be operating on — like what means are acceptable in general, which ones aren’t, what are the goals. And the problem is, it’s always going to come up with all these proxies, all of these guesses as to what we cared about.

Richard Ngo: Right.

Rob Wiblin: But inasmuch as those aren’t exactly what we cared about, then that leaves always a crack, as it gets smarter and smarter and can come up with more and more strategies. So you’re saying, it might learn the general principle of “don’t actively deceive humans” — because that is sufficient to explain all of the negative feedback that it’s received so far — but that’s only one thing that we cared about in this general space. We also cared about lying by omission, but now the door is open to lying by omission, because it’s only learned this far more constrained principle from the feedback that it got.

And there’s always going to be this gap, or we should expect that there always will be a gap between what we actually cared about, and what proxies it has been capable of inferring. And then as it gets more and more powerful, it will learn to drive a truck through that crack. Is that kind of right?

Richard Ngo: Yeah, that seems right. So there’s one problem, which is that we haven’t covered all the proxies that it currently is capable of — so explicit lying versus lying by omission, but then you’ve also got the issue that as it becomes more intelligent, there’s just a whole swath of possible strategies, right?

Rob Wiblin: More actions are available.

Richard Ngo: Exactly. Maybe it has never been capable of lying by omission thus far. And now, once it becomes intelligent enough, that opens up, and you just don’t know how the thing we’ve trained it to do — which is “never lie” — is going to generalise to this totally new capability. It might be lying in a different language, for example. If it’s been only trained to not lie in English and it learns French, is it still going to have the goal of never lying when it’s speaking French? These are the types of things that we just have no particularly good way of reasoning about right now.

Rob Wiblin: Yeah. OK, great. I think I’ve got the basic thrust of the paper now. we could keep on talking much, much longer about this, I’m sure, but people who’d like to get more detail can, of course, check out the paper online: “The alignment problem from a deep learning perspective.”

Arguments for calm [01:49:44]

Rob Wiblin: I wanted to now have a bit of pushback, because so far I’ve given you lots of space and I’ve given myself lots of space to express reasons why we should be concerned about ways that things could go wrong.

What are the best arguments for thinking that this is all a bit overblown? I’m really alarmed about where AI might be in 10 or 20 or 30 years, but actually we should probably expect that things will be fine and we’ll muddle through and the outcome will probably be good? What are the most reassuring arguments that one can offer?

Richard Ngo: I think the basic reassuring argument is just that most specific arguments about the future are wrong. It’s really hard to make concrete predictions when we’re talking about systems that are very different from the ones we have today. So this paper has tried to be as concrete as possible, but at the expense of making claims that probably are just going to turn out to be wrong in a whole bunch of different ways.

I think it’s still useful to explore these claims though, because even if the specific manifestation of the phenomenon isn’t quite right, it still seems like there are some pretty robust underlying phenomena here. It’s just useful to be deceptive, for example. That just seems like it’s often true in a very wide range of contexts. And so, even if the specific ways of which I’ve described systems being deceptive don’t actually come out, it still feels like there’s something worth investigating there.

One principle that I think is pretty important is just that it’s harder to be right yourself than it is to prove other people wrong. And so a lot of arguments that people have given for why alignment isn’t important I think are pretty weak, to be honest. There are people who are saying things like, “There’s no such thing as general intelligence,” when humans are clearly doing something interesting that allows us to do science and mathematics and build buildings and build rockets and so on. And if AIs do that same thing, sure, maybe we shouldn’t call it “general intelligence” for some specific reason, but whatever it is, that still seems pretty worrying and interesting.

And then a bunch of other arguments that people have given for thinking that deep learning is totally on the wrong track to building general intelligence: again, I haven’t yet seen a version of those arguments that I find particularly compelling. So I think that’s enough to motivate looking into the problem ourselves. And that’s enough to motivate saying, “Actually, this thing seems pretty important. I should work on it.”

But then, there’s still a big jump from there to all of the specific claims that I make in this paper. So that’s one broad reason to be more optimistic, or at least to feel pretty uncertain about what’s going to happen.

Rob Wiblin: Is another way of framing this that we’ve got all of these arguments in our mind about our understanding of how these systems work, and our understandings of how they might go wrong. But it could be very easy for us to think that we’ve come up with a good argument of that type when in fact we have not, because it’s just so hard to understand and we’re not very good at predicting future technology. We’re just not very good at thinking about and understanding how these AI systems work. Indeed, that’s kind of the problem. But at least, when we don’t understand something and we think it’s extremely dangerous, it’s possible that we’ve misunderstood it because we just don’t have a very good grasp on it. So that can be reassuring in one sense.

Richard Ngo: Yeah, and that seems like my current thinking: that that line of argument feels sufficient to not be strongly confident in any particular claim about catastrophe, but at the same time, it’s not sufficient to dismiss or discard the field. I think that the arguments and evidence for AI posing significant risks are strong enough that people really should be focusing on this as one of the major issues, if not the major issue of our time — while still realising that there’s just a lot of stuff that we don’t know, and a lot of open empirical and theoretical questions that I’ve been trying to dig into and I’m excited about many more people digging into.

Rob Wiblin: Yeah. Are there any other things that somewhat reassure you?

Richard Ngo: Broadly speaking, the ways in which we can use language models to answer a bunch of questions without those language models themselves seeming like they’re pursuing goals or taking large-scale actions in the world is somewhat reassuring. I think a lot of scenarios that I envision for how things go well involve us just using AI to answer a bunch of difficult questions — like “How do we make sure to build systems in a more reliable and robust way?” — and just getting answers from the AIs that we’ve trained to do scientific research, for example.

And so, the further we can get on scientific-research-type AIs without having AIs that are actually acting in the world, trying to achieve goals, trying to make money or gain power or so on, the better I feel. So that’s one cause for optimism.

I think the response to that might be, “Sure, these language models are good at answering questions, but actually, it’s not that hard to get them to start doing tasks.” And we’re kind of seeing this — I think there are some products put out by Adept recently, which is another AI lab that is trying to leverage language models into performing more assistant-type tasks. And so, maybe it’s just the case that question answering is a little bit ahead of general assistant-style acting in the world right now, but that gap is going to close.

Rob Wiblin: Is another way of phrasing this that the more useful these oracle or non-agent-based models prove to be, the more you think maybe that’s what we’ll pursue: systems that don’t have situational awareness, that just spit back input-output stuff? I guess you’re saying there is a real blurred line between just doing input-output and trying to act and change the world. However, the more useful the less agent-seeming ones are, the more it’s possible we’ll just invest in that. And that’s probably a somewhat safer approach.

Richard Ngo: Right. And the main thing I’m excited about is less like reallocating all the funding and research, because that seems pretty difficult. But if it turns out to be the case that these capabilities are just easier than I expect, and we naturally get question-answering systems that are at a superhuman level before we get superhuman CEOs or superhuman assistants, then that seems like a reason to be optimistic.

Rob Wiblin: A member of the audience wrote in with this question: “Much or most of the intelligence/power of humanity is contained in social and societal understanding. Does that mean that even something much more intelligent than a human might not be that dangerous?”

Richard Ngo: I think it’s some evidence in that direction, because it just feels safer to have more AIs that are each individually less capable than to have one big, totally inscrutable system that is just way more capable than any given human. Having said that, I don’t think it’s particularly strong evidence. Because, suppose that it’s true that you need networks of many agents interacting with each other in order to produce very advanced capabilities. I don’t see a strong reason why that means that those agents will then cooperate very strongly with humans, as opposed to mainly cooperating with each other and shutting humans out of the picture. It seems that if humans are trying to maintain control of these whole societies of AIs that are interacting with each other rapidly — and trading with each other or building on each other’s work, and developing new capabilities together — this feels pretty worrying, because we just don’t have the leverage in the situation. We are less intelligent, we’ll think slower, we’ll have fewer capabilities. And so, I think it’s a better situation, but not a good one by any means.

Rob Wiblin: My reaction to this was thinking that it is true that one person, however bright they seem to be, if they’re just isolated and on their own then they’re not going to seem that smart. They’re going to be struggling just to survive, I guess. So much of what makes us seem intelligent as individuals is all of this accumulated experience and knowledge and wisdom that we can absorb from others, and our ability to coordinate in big groups using strategies and structures that we’ve accumulated over centuries, and on and on and on. So there’s that angle. Of course, AIs that are acting in the world as part of human society on behalf of people can leverage a lot of that as well. They’re already part of a society; they’re already part of organisational structures. So we bring with us to the table all of that existing situation.

Another angle on it might be that a single human can only output the amount of speech at any given time that a human can think, and can only think about one thing at a time. Even if they’re really smart, significantly sharper than any human that’s ever lived, with that level of processing capacity, they’re going to struggle to overpower everyone else, because they’re massively outnumbered and it’s an extremely challenging task. So let’s say we’re in a situation where we’ve managed to train a model that actually is substantially more capable than humans at basically any task, and it can learn things pretty fast. However, we only have enough compute to run one at roughly human speed. Then I would think we’ve got a really good shot that this thing is not going to be able to outsmart and outwit human civilisation.

The thing is that that’s not a situation that’s likely to persist for terribly long. Firstly, probably we won’t even be starting out that way at all, because we’re going to have what people call a compute overhang — where there is going to be a tonne of GPUs lying around that probably would be able to run many more instantiations than just one. But let’s say that that is how things started out. Then we’ll quickly learn how to make more GPUs. Moore’s law continues to operate, and eventually we’ll be dealing with a population of millions — kind of an AI society, where each individual member is a whole lot smarter than any human that’s around. And then, eventually, we will be back in a dangerous situation, though we will have bought ourselves more time in the meanwhile to figure out how to make these beings act cooperatively with us. Am I thinking about this straight?

Richard Ngo: Yeah. It does seem like we’re on track towards the largest models, having many instances of them rolled out. Maybe every person gets a personal assistant. And as those systems get more and more intelligent, the effects that they have on the world increase and increase. And the interactions that they have with the people who are nominally using them become much more complicated. Maybe it starts to become less clear whether they’re being deceptive and so on. So I think that’s broadly right, that we don’t really have mechanisms in place for steering towards limited deployment of the most advanced systems right now. And I’m optimistic that some types of governance work, including the types of things that my team is working on, might be helpful for this. But we don’t really have concrete solutions right now.

Solutions [02:01:07]

Rob Wiblin: OK, let’s push onto solutions. That was the next section header.

Richard Ngo: Amazing.

Rob Wiblin: So those are some reasons to worry more or less about how advances in AI are going to play out. What is your big-picture strategy personally for how you are going to try to hopefully help things get better rather than worse?

Richard Ngo: I guess this comes back down to the different strands of work that I outlined right at the beginning. On the alignment front, just really trying to understand what the problem is, what we’re up against, and what types of predictions we’re making about how advanced systems will play out. Just because we’re working in such an uncertain domain, just having a better understanding of how the problems that, phrased in intuitive language, translate into concrete and specific machine learning concepts, that feels like a big step towards solving them. So that’s my main priority when it comes to technical alignment work: just finding ways to bridge the high-level philosophical arguments with the concrete work that’s being done on a day-to-day basis. And this paper that we’ve been talking about is one attempt towards that. Then I’m planning to do some followup work that makes it even more concrete — hopefully to the extent that we can start running experiments to investigate some of the concepts that I’ve been talking about.

Rob Wiblin: So it’s a little bit standing in the middle between high-level conceptual reasoning about how agents operate and how they could be made safe, and then turning around to the empirical results and how the models actually operate. And switching back and forth between these two and trying to merge them into some plan for how to make things work well.

Richard Ngo: Right. And I think the field of alignment has done a surprisingly bad job at this in the past. I think people have just not really tried that hard to ground out these high-level concepts in concrete machine learning abstractions. To some extent this is reasonable, because a lot of classic machine learning paradigms don’t really account for things like systems that can generalise to new tasks, or neural networks that are doing internal reasoning in ways that we don’t understand. These are just hard things to think about from a classic machine learning perspective. But I think it’s still really valuable to bridge that gap in order to make these arguments more comprehensible and spur on concrete research that I’m pretty excited about. I’m pretty optimistic about the ability of humanity to solve a problem once it’s laid out clearly in front of us. So it feels like that’s a really big step.

And then when it comes to thinking more about governance and the ways in which we can implement regulations, again, the thing I want to do here is ground out these high-level concepts and arguments. Things like we should all coordinate in order to prevent the deployment of dangerous systems, and trying to get really specific concrete examples of ways that people can contribute, especially people with technical backgrounds.

I’ll give a couple of examples of this. One broad example is doing research on chip-level verification mechanisms, such that we can try and produce chips that can’t be used for certain types of particularly risky training. That might be because we can monitor them, or we can verify that they’ve been doing a certain type of activity. Maybe a chip that can verify that it’s been inactive for the past couple of months, so it hasn’t been used on a certain type of training run. Or that could involve more restrictive mechanisms — like NVIDIA has some restrictions on the way some of their GPUs can be used, so that you need to pay extra in order to unlock the ability to run many of them in parallel. That’s the type of hardware-level restriction that some people have been thinking about, and obviously people at NVIDIA and so on have been working on. But I think it would be great to have more people thinking about how to do this so that that can feed into more concrete plans for regulation and governance of AI.

Rob Wiblin: Wow. I didn’t know that sort of chip stuff was possible. We should probably do an episode on that at some point if that’s stuff that’s really live. I know so little about the hardware side of any of this — I’m just like, “Ah, it’s all chips.” Just from my point of view.

Richard Ngo: Right. It’s also not my comparative advantage, in terms of my background isn’t really hardware focused, and so I don’t want to make particular claims about what is and isn’t possible. There are some mechanisms that are in place along these lines, but also they can be circumvented, so it’s hard to make predictions about what would be effective at restraining particularly competent and motivated actors. But it sure seems like a set of research directions that I’d be excited about people looking into.

And then the second one I was going to mention is automated verification of certain properties of machine learning training runs. So if I’m trying to train an AI — and hopefully we’ll have made some progress on figuring out which types of training runs or which types of architectures or algorithms are most likely to produce, for example, deception and what mechanisms you need to put in place — how can we actually verify that this is something that’s going on? One way could be for me to show auditors the actual source code that I’m using for my training run, but that leaks all sorts of information. There are privacy concerns, and maybe it’s just hard to analyse that — maybe there’s just tens or hundreds of thousands of lines of code, and it’s just really difficult for humans to look through that. So it would be really great if we had better means of automatically verifying high-level properties of certain training code. Things like how many parameters are being trained, for how long, using what types of algorithms.

There’s been some work done in these directions, but I think it’s the type of stuff where it’s very unclear how well that sort of work would stand up to attempts to break it by powerful and motivated actors. And so, again, it feels like really an open question: how far we can get in terms of automatically verifying properties of these training runs and the resulting models. So that’s another thing I’m pretty excited about people looking into, especially people with more of a background in automated verification and programming language design and things along those lines.

Rob Wiblin: You’re working on all of these things?

Richard Ngo: The thing I’m trying to do is, as I said, build these bridges. So less specifically trying to scope it out myself and more, “Here’s what the governance people are saying would be useful for them: they want some ways of enforcing and monitoring treaties. And using my technical background, here’s some things that I think are somewhat plausible and worth investigating.” But really, I want to hand them over to the people who have way more domain expertise, and potentially serve as a bridge — such that those people can say, “What thing would be useful to have in these regulations? Which of these things should we prioritise?” Questions like that.

Rob Wiblin: Fantastic. And all of this is happening under the OpenAI banner?

Richard Ngo: The governance team of OpenAI has been thinking about a bunch of these high-level questions, and working top-down from “What would it look like to be in a world where nobody was going to build specific types of risky systems?” to “What progress can we concretely make today to move towards that?”

Debate and interpretability [02:08:29]

Rob Wiblin: Are there any technical approaches to dealing with the specific path to misbehaviour that you outlined in “The alignment problem from a deep learning perspective” paper? I know people have had all kinds of different ideas for how you might shift how the training is occurring in order to prevent things from going in the wrong direction. Do you like any of them?

Richard Ngo: Yeah. Two of the approaches that I’m most excited about are, firstly, this approach called “debate,” which is an attempt to get AIs to monitor and supervise each other by training them to criticise each other’s behaviour and report ways in which another AI is misbehaving. So far, there was a paper on this from Geoffrey Irving originally, that mostly outlined a bunch of the theoretical considerations. There’s been some more recent empirical work in this direction by the OpenAI alignment team. There’s a paper on AIs critiquing each other, which is starting to step towards AIs automating the process of giving high-level feedback, which I think is pretty exciting. Broadly speaking, it’s hard to know how far this type of thing is going to scale, because eventually you’re going to get into a regime where AIs are making arguments about each other’s behaviour, which is just very hard for humans to follow. But it certainly seems like something that we want to push as far as we can.

And then the second thing that I’m very excited about is progress in interpretability, because it feels like the classic regime — of rewarding agents for behaving well, and continually training them based on their behaviour alone without really understanding what’s going on internally — is just not a great regime to be in. So I’m pretty excited about anything we can do to move towards a more systematic scientific understanding of, “When we give rewards in this way, here’s the types of mechanisms that change within the neural networks. Here’s the types of ways that their representations shift, and here’s how those representations correspond to these high-level concepts.” A lot of that work is being done by Chris Olah and his team at Anthropic, and an increasingly wide range of people — for example, the people behind the Eiffel Tower paper that I mentioned a while back.

Rob Wiblin: In the debate and interpretability stuff, the AI safety space opinion is very fractured on what strategies are promising and which ones aren’t. I know that both of those have their critics — people who think that there’s just no way that either of these methods is going to pan out. Unfortunately, I don’t know it well enough to know exactly what criticisms they would actually mount. But what would you say to a sceptic who was saying this interpretability stuff is never actually going to work? That this is crazy talk?

Richard Ngo: My general response is just that it’s hard to predict science. It’s really difficult to know what clever and motivated researchers can manage to do, given access to these big models, and there are all sorts of surprising breakthroughs that happen across the field all over the place. So I think it does seem important to have some debate about how promising we think these approaches are at a high level — to help inform people who want to work in the field as to which agendas they should focus on and things like that.

But I think it’s plausible that we’re taking this too far, and actually, we just need to sit down and really get as much empirical work as we can done until we hit the wall, and then keep reevaluating. That just seems to be the way that most progress in science has happened throughout history. And so maybe I’d have more opinions if there were some concrete tradeoff that was involved, like, “Should we allocate resources here or there?” But mostly I think people who see something exciting in a given research agenda, and think that they can contribute, should choose mainly on that basis — rather than on these very high-level considerations about whether the whole line of thinking about understanding AI systems is going to work out or not.

Some conceptual alignment research projects [02:12:29]

Rob Wiblin: You wrote this post on the AI Alignment Forum a little while ago, called “Some conceptual alignment research projects,” which was kind of a call to arms for people to investigate a whole lot of quite specific questions. Are there any that you’d like to highlight? It was quite a long list, so people could go and check out the rest of them if they’re keen to find an article to work on or a specific question to investigate.

Richard Ngo: The main ones I want to highlight are the ones related to the definition of goals that I’ve been talking about in this podcast, which is that goals are internal representations within a neural network of outcomes in the real world.

I think one important aspect of this definition is that we don’t necessarily know which networks have goals or don’t. So for example, if you take something like GPT-3, when it gives an answer, how far ahead is it thinking? Is it thinking about what’s going to happen in the next paragraph? Is it planning out all sorts of long-winded responses? Or is it just thinking a couple of words ahead? We don’t really know at this point. I think there’s some possible things that GPT-3 could be doing that would fit the definition of goals that I gave here. Maybe not being a central example of goals, because it still probably wouldn’t be thinking about how its actions affect the world in a very sophisticated way, but still getting closer to it than maybe any other system has in the past.

And that’s just an open empirical question. I can imagine a bunch of research being done, especially interpretability research, to try and home in on that type of question, and even to just pin down more precisely what that would even mean. So there’s some mix of understanding these definitions and concepts better, and then tying them to existing systems, that I’m generally very excited about.

Overrated AI work [02:14:09]

Rob Wiblin: We’ve been talking about things that you think are good. Is there any work that you think is overrated, at least by the people doing it, that you’re willing to tell us? It’s always a little bit edgier to say that things are overstated.

Richard Ngo: I do think that none of the existing books on the alignment problem — including Bostrom’s Superintelligence, Russell’s Human Compatible, and then Brian Christian’s The Alignment Problem — really get at the key thing that I care about and which I’ve been trying to convey in the paper that we’ve been discussing.

And that’s for a range of different reasons. Superintelligence doesn’t really focus on modern machine learning and deep learning, which is understandable, given when it was written, but feels like it dates it today.

Human Compatible, I think the framing resonates with some people, but the way that Russell frames his solution to the problem, or his overarching agenda to solve the problem, just really doesn’t resonate with me. It feels like it relies on a bunch of assumptions that aren’t really made as clear as they should be — assumptions about, for example, how well are we going to understand the systems that end up being generally intelligent. I think Russell’s paradigm for solving the problem works, if you think that we’re going to make a whole bunch of progress on cognitive science and neuroscience, and we’re going to be building AIs that build on that knowledge in some quite specific ways. But in the regime that we’re currently in, which is we’re taking very large black box systems and just throwing a whole bunch of compute at them, it doesn’t really feel like the right type of approach. So that’s something that I think he’s over-claiming, or at least not identifying his assumptions as clearly as he should be doing.

Rob Wiblin: And what were you saying about Brian Christian’s book, The Alignment Problem? I guess that wasn’t setting out an approach to solve the problem; it was more just trying to explain the problem well.

Richard Ngo: I think it’s a pretty great book; it just mostly doesn’t deal with the alignment problem as it might manifest in the regime of human-level and superintelligent AIs. And I think that’s a very reasonable thing to do, because that’s much more speculative and much more difficult to discuss. He covers very well the alignment problem as it exists in systems today, but the thing I’m most concerned about is the manifestation of the alignment problem in these highly advanced systems. And the book just has relatively less to say about that.

Richard’s personal path and advice [02:16:39]

Rob Wiblin: OK, that wraps up our solutions section, I suppose. As always, we could talk about this stuff forever, but we can only take up so much of your time, and I suppose only so much of our listeners’ time on any one particular episode.

As we head towards the end, let’s talk a bit about how you managed to get into all of this in the first place. It’s an interesting path. Plenty of listeners, I imagine, are interested in making a career shift either into AI or some other thing that they have not yet engaged with. What were the most important inflection points in your career so far, in your mind?

Richard Ngo: First, reading about the alignment problem when I was in high school, actually on LessWrong. That laid a bunch of seeds, which I didn’t really do anything about for the next few years. But when I ended up going to undergrad at Oxford, I got involved with effective altruism and then slowly meandered my way towards this career. In hindsight, it took me way longer than it should have, given my beliefs and interests.

Rob Wiblin: When were you first reading about it?

Richard Ngo: This was in 2010, maybe.

Rob Wiblin: 2010. OK, so you were reading about it in general for quite a number of years before you were seriously thinking about taking any action? I guess that is an unfortunate delay, although I suppose the alternative thing is to end up getting bounced around all the time by things that you end up reading; constantly changing plans and being a bit too volatile. which I’ve also seen people do and can also be a mode of failure.

Richard Ngo: I think at the time I was just too conformist, and also hadn’t really been tracking progress in AI. Which at the time was still confined to the discipline and the academic field way more than it currently is, where everything’s being released to a lot of media attention and so on. So yeah, it took me a little while to get out of that rut. And there are some friends from my time studying in the UK who were really helpful with that.

Rob Wiblin: Fantastic. How long did it take you to start seriously contemplating doing something?

Richard Ngo: I was learning about AI all the way through undergrad. Then I think another inflection point was during my master’s, where I was basically focusing on machine learning, engaging with some of the cutting-edge research. And I felt like that was a big shift, from being in the mindset of consuming knowledge to the mindset of trying to produce knowledge — even if that knowledge was just summaries of the latest papers, and trying to puzzle my way around what was actually going on with these systems. That felt like a big shift.

In hindsight, again, that could have happened in undergrad. There was no real reason why it had to be in my master’s that I shifted into actually engaging with cutting-edge research. But especially in the UK, I think people tend to start doing research a little bit later. In the US, it’s fairly common for people to get research internships maybe even in their first or second year of undergrad. Whereas in the UK, people often save it until their master’s. So that’s just one thing that people, especially in the UK and other countries, should keep in mind: that there are often a bunch of research opportunities earlier on than you might expect.

Rob Wiblin: Yeah. So what were some of the first steps that you took, and what pushed you over the edge?

Richard Ngo: The steps I took were I just started reading papers — trying to summarise them, trying to figure out what I thought of them and where they could be improved and things like that. Implementing some papers and implementing some projects as part of my master’s.

And then the two things that shifted me into working on this full-time were an internship at the Future of Humanity Institute, and then a job offer at DeepMind on the alignment team there. I joined DeepMind straight after the internship, and spent two years there total, where the first year was just getting up to speed and doing a bunch of engineering mostly. But then the second year was really trying to focus on understanding the problem a bit better, and the lines of work that I’ve been talking about today and have been working on throughout the last couple years.

Rob Wiblin: What made you leave DeepMind and go and start the PhD?

Richard Ngo: I think I was officially a research engineer at DeepMind, and the type of work that I was thinking about — and the type of work that I’ve been talking about here — are less engineer-y work. It seems possible that I could have found a place for it at DeepMind, but it wasn’t really a central example of the thing that I’d been hired to do, and it felt like I could do a bunch of that thinking as part of the PhD with a bunch more time and freedom and so on. Which turned out to be true in some ways. And then, of course, you face tradeoffs by not being at the cutting edge of the field anymore, and not having access to the biggest models, and not really hearing a bunch of the discussions that are going on inside these top AI labs. It would have been a reasonable choice to finish the PhD, but when the offer from OpenAI came along, I think it was significantly better for me personally.

Rob Wiblin: But you started the PhD — what was the main motivation there? Was it that you wanted credentials in order to further your career, or you thought it was really important to learn this stuff that you could only learn when you had time to study in depth?

Richard Ngo: I think it was that I wanted to spend time thinking about these questions, and I thought, “Well, I may as well get a PhD out of it.” Especially because PhDs in the UK are very short — they can be three years, compared with five or six years in the US, and you can basically just dive straight into your research project. And getting a bunch of mentorship from people who, in my case, were mostly thinking about philosophy of science and less AI knowledge, but very careful thinkers. So it felt like a good opportunity for me at the time.

Rob Wiblin: Do you think it was a reasonable decision in retrospect?

Richard Ngo: Yeah, I think so. And I think a lot of the time, you get most of the value out of the first year or so of a given thing. So I think I got a bunch of value out of trying to lead my own research, and be more careful in how I approached a bunch of thorny conceptual problems and so on, from the PhD — in a way where probably there would’ve been diminishing returns from staying for the second and third and potentially fourth years. So in hindsight, I’m reasonably happy with how that turned out, although there’s a type of energy and pace of progress that happens over in California, which is always good to get access to earlier on maybe.

Rob Wiblin: What was the situation when you decided to bounce from the PhD after a year or so? You’re thinking, “I’ve been here for a year, I’ve learned a bit of what I can from this”? And I guess you had a job offer from a place, which was the sort of place that you would want to go when you finished the PhD anyway.

Richard Ngo: Exactly.

Rob Wiblin: I see. And you’re like, “Why would I stick around for another two or three years when I can just go do it right away?”

Richard Ngo: Yep.

Rob Wiblin: Do you think in general too many people stick it out in PhDs that aren’t serving them?

Richard Ngo: It’s hard to say, because the base rate of dropping out of PhDs is pretty high, so it’s hard to make a strong judgement either way. I do think that the credential of a PhD is becoming less and less important in machine learning in particular, because there’s just a massive talent shortage and so places OpenAI and Anthropic in particular are very open to having research leads or even heads of big teams who just don’t have PhDs or even undergrad degrees sometimes. I think the field is pretty meritocratic in that sense, and so there are a lot of opportunities out there for people. So maybe in the specific domain, it does seem like people are plausibly sticking to PhDs longer than they should, but that’s not a particularly strong opinion.

Rob Wiblin: It’s not uncommon among people I know, that someone gets a couple years into their PhD, and by that stage they are qualified for the role that ultimately they want to get when they graduate. And so they could go and do it, but then they feel like they’ll be wasting all the stuff that they’ve done so far to not stick around for another year or two and potentially go through what’s kind of the slog maybe — the difficult end of a PhD, when you actually have to put it all together and get over the line.

I suppose you didn’t face a very difficult decision with this one, because it seems like a PhD is just not such an important credential, not such a key issue in machine learning, and perhaps people face a somewhat more challenging decision when maybe in the long run they’re not sure whether they might need the PhD because they’re in an area where it gets you greater kudos.

Richard Ngo: Right. I think my situation was unusual in a bunch of ways, including the fact that I was doing this more philosophical work on the philosophy of machine learning and these high-level questions there. It feels like that’s a pretty unusual path, and it’s a little hard to infer generalisable lessons.

Rob Wiblin: Yeah, totally. Are there any mistakes that you’ve made that you think could be potentially instructive for some listeners?

Richard Ngo: Being too conformist, especially during undergrad. That made me slower to focus on things that I already agreed were important problems.

Maybe not focusing enough on continually learning about the cutting edge of the field — just thinking that, “Great, I’m working. My day to day is as an engineer, or my day to day is thinking about these governance stuff, and I can spend less time on really deeply trying to understand a new thing that came out.” Obviously the things that just came out yesterday, it’s hard to tell which ones have staying power, which ones are going to last. But just focusing on continually learning seems like a pretty good heuristic. It’s easy to just get a little bit lazy and then not get around to that.

Rob Wiblin: I mean, each of us only has so much time to stay abreast of things that are going on, both in our field and in the world as a whole. Just staying generally informed. Why do you think you were undervaluing what you would get out of just keeping abreast of new technical results?

Richard Ngo: I think in this field in particular, people get into alignment because they’re paying attention to these high-level abstract arguments. So we’re selecting for people who have a certain tendency to focus on this particular style of thinking, and then end up having a bias towards being a bit less interested in, “But actually, that latest result, how did that work? How did they make that happen?” That’s my guess, that people underrate the extent to which deep technical familiarity is actually going to have surprising or non-obvious returns by sparking insights or reframing the way you see a certain phenomenon.

Rob Wiblin: Is there anything else you want to say about your career that might help steer people in the right direction, or should we push on?

Richard Ngo: Personally, I’ve been very curiosity oriented, and that’s paid off pretty well for me. Especially when you’re younger, it seems like trying to coerce yourself into doing things that you’re not interested in is a tricky prospect. And maybe it pays off for people who are more conscientious than I am. But I do think that where possible, if you can find some pathway to just being really excited and obsessed by a research direction, that’s the thing that’s most likely to have really strong payoffs down the line.

It’s hard to say. Obviously, in some sense, I’m very lucky — because I find the stuff I’m doing really fascinating, while also thinking it’s very important. And this feels like not a coincidence. I find it fascinating, partly because it plays into such big-picture issues that are of such potential significance. But I do think people should spend more time trying to find ways in which they can hook the thing into their brain, so that it feels less like forcing themselves to study it and more like, “Oh yeah, this is what I want to spend my time thinking about in general.”

Characterising utopia [02:28:37]

Rob Wiblin: We’re almost out of time, but one of the non-AI blog posts you’ve written, which I really enjoyed reading this week when I was prepping for the conversation, is called Characterising utopia. Can you tell everyone what challenge with utopias you were trying to address in writing that?

Richard Ngo: It seems like people have kind of branched off in two directions when thinking about utopias. One of them is the sort of transhumanist direction: We’re going to have all of this amazing technology. We’re going to enhance ourselves radically. Life will be unrecognisable. And then the other direction is this very communitarian approach, utopias like Island by Huxley or Walden Two by Skinner, where they’re really focusing on the ways that people relate to each other and the ways in which fundamental human relationships are reshaped.

It just feels to me like, actually, why not both? Probably the world is going to be dramatically changed by technology, which also allows us to have better interpersonal relationships and have deeper connections with other people — which is the stuff that fundamentally most people care about most deeply, their relationships with others. So one of the things I was trying to do here is just see if I can actually describe a utopia in which, sure, we have all this advanced technology, but that’s not actually the main thing for people. The main thing is that it allows them to live richer, more fulfilling lives in a social and relational context.

I threw out a bunch of ideas and haven’t really managed to stitch them together into a cohesive picture. But this is the type of thing that I think people should be thinking about more: just being able to actually interact with each other and have technology facilitate the things that are most meaningful in our lives, rather than just picturing either relationships without technology or technology without relationships.

Rob Wiblin: Yeah, yeah. Some of the shifts that you envisaged wouldn’t be super surprising. Like we could reduce the amount that people experience physical pain, and we could make people be a lot more energetic and a lot more cheerful. But you had a section called “Contentious changes.” What are some of the contentious changes, or possible changes, that you envisage in a utopia?

Richard Ngo: One of the contentious changes here is to do with individualism, and how much more of it or less of it we have in the future than we have today. Because we’ve been on this trend towards much more individualistic societies, where there are fewer constraints on what people do that are externally imposed by society.

I could see this trend continuing, but I could also see it going in the opposite direction. Maybe, for example, in a digital future, we’ll be able to make many copies of ourselves, and so this whole concept of my “personal identity” starts to shift a little bit and maybe I start to think of myself as not just one individual, but a whole group of individuals or this larger entity. And in general, it feels like being part of a larger entity is really meaningful to people and really shapes a lot of people’s lives, whether that’s religion, whether that’s communities, families, things like that.

The problem historically has just been that you don’t get to choose it — you just have to get pushed into this entity that maybe isn’t looking out for your best interests. So it feels interesting to me to wonder if we can in fact design these larger entities or larger superorganisms that are really actually good for the individuals inside, as well as providing this more cohesive structure for them. Is that actually something we want? Would I be willing to lose my individuality if I were part of this group of people who were, for example, reading each other’s minds or just having much less privacy than we have today, if that was set up in such a way that I found it really fulfilling and satisfying?

I really don’t know at all, but it seems like the type of question that is really intriguing and provides a lot of scope for thinking about how technology could just change the ways in which we want to interact with each other.

Rob Wiblin: I’m so inculcated into the individualist culture that the idea slightly makes my skin crawl thinking about any of this stuff. But I think if you tried to look objectively at what has caused human wellbeing throughout history, then it does seem like a somewhat less individualistic culture, where people have deeper ties and commitments to one another, maybe that is totally fine — and I’ve just drunk the Kool-Aid thinking that being an atomised individual is so great.

Richard Ngo: If you know the book, The WEIRDest People in the World, which describes the trend towards individualism and weaker societal ties, I think the people in our circles are the WEIRDest people of the WEIRDest people in the world — where “WEIRD” here is an acronym meaning “Western, educated, industrialised, rich, and democratic,” not just “weird.” So we are the WEIRDest people of the WEIRDest countries. And then you’re not a bad candidate for the WEIRDest person in the WEIRDest community in the WEIRDest countries that we currently have, Rob. So I’m not really too surprised by that.

Rob Wiblin: Definitely out there on the tail. I think a case where this comes up is talking to people who have grown up and live in countries where it’s totally normal to continue to live with your parents into adulthood — into your 20s and 30s, potentially even 40s, until you get married. And I’ve asked a bunch of times, “Do you find this frustrating?” Because I got on really well with my parents, very happy with how they raised me, but the idea of having them up in my business, and I think for them, the idea of me being in their business, it’s just a very, very, very unpleasant idea. We really want to have lots of our own space.

But mostly, I’ve actually just gotten back blank stares. They were like, “Why would that be a problem? I don’t understand. What’s wrong with living with your family? It’s great.” That was far more surprising than any other specific answer that I could have gotten about how it was.

What are some other contentious changes on the list, if there are any others you would like to highlight?

Richard Ngo: This one’s not on the list, but I feel pretty interested in just how weird virtual reality could be. Because right now we imagine — and Meta is imagining — VR as like, you can wander around in these three-dimensional spaces that look like the real world. And there are some games that maybe have a little bit of time travel or a little bit of four-dimensional stuff.

I’m just like, man, it seems like the sky is really the limit here — or not even the sky. But could we have just totally different dimensions, totally different senses, in virtual reality? Especially if you’re a kid playing the latest VR games, I could totally imagine that you grow up and you’ve been playing these five-dimensional games for your entire childhood, and now you can mentally rotate five-dimensional shapes, and you could do the types of mathematics that we currently can’t even dream of.

Is that a good thing? Again, really feel very uncertain about it. I personally would have loved to have grown up like that, because it’s just so intriguing to grow up learning very naturally about these fundamentally alien mental possibilities. But I can imagine that being pretty controversial as well. I just think that there’s so much scope for designing these weird and wonderful worlds that we’ve really barely explored even a fraction of the possibilities, let alone implemented them.

Rob Wiblin: Were there any other changes in there which you think wouldn’t be found in many imagined utopias, which you think are distinctive to your list?

Richard Ngo: I think that people in science fiction have generally not been very creative with imagining new social roles. We have all these archetypes of different personalities and gender roles, for example, and people have just been pretty bad at designing their own. And that’s somewhat understandable, because it’s much easier to imagine new technologies than it is to rethink the fundamental ways in which humans interact with each other. But it seems to me that we could in fact just end up with people who have new, radically alien norms that are as basic and fundamental to them as our most basic norms — like the family norm, for example, or the friendship norm.

I don’t really have great examples of this to give, because I don’t think people have really been thinking about them that much. But for example, the whole conception of romance feels like something that is pretty modern in the historical context. If we imagine ourselves transported into a future where romance looks totally, fundamentally different than it does today, I wouldn’t be surprised, but I would end up very confused and have no idea how to orient around that.

So yeah, I’d love to read more stories like that. And maybe we’ll get the language models onto writing them soon and see what happens there.

Rob Wiblin: Well, the blog is called “Characterising utopia,” and it’s on your blog, which is titled Thinking Complete. There’s plenty more in there. It’s quite a long one, so if people enjoyed those three, they should go and read the whole thing.

Richard’s favourite thought experiment [02:37:33]

Rob Wiblin: One final question, one of my favourite questions for people at parties: What’s your favourite thought experiment ever?

Richard Ngo: I am a big fan of alternative history thought experiments. It feels really intriguing to me to wonder, for example, why did the Roman Empire not industrialise? Because they had a bunch of the technology, they had a bunch of the materials and so on. What’s the smallest little push that you can imagine that would start them rolling down the slope towards a fully industrial society? And that’s just kind of fun to play around with. It’s not very clear if we can ever have an actual answer to this, but it feels like these types of questions are just really fun to try and figure out, like, how did we actually get so lucky? Or were we in fact lucky? Was it really overdetermined that the world would end up in the current state that it is?

And then going back even further, why did we even evolve the way that we did? We’ve got humans and we’ve got plants. What’s the smallest change that you could make that would end up with the biological world looking totally different to how it currently does? One which I put out on Twitter a while back actually was something like, “If all animals went extinct, how long would it take for plants to evolve to start chasing each other around?” Because now there’s this new niche in the ecosystem, which is carnivorous creatures. It feels kind of crazy, but also, it only took a few hundred million years —

Rob Wiblin: It’s no crazier than what’s already happened.

Richard Ngo: Right. Exactly. Again, it’s not really the type of thing that we can properly know the answer to anytime soon, but I think it highlights just how radically contingent a lot of the stuff could have been, and how little we know about the forces that actually led us to where we are today. And so that’s always a fun thing I enjoy thinking about.

Rob Wiblin: Just turning back to the Roman one, there’s some science fiction book about someone who’s somehow accidentally sent back to Roman times, and they have to figure out how to make their way. What’s that one called?

Richard Ngo: I don’t know that one, I think. There are a couple that are about what happens if the Romans end up taking over the world. But nothing’s coming to mind.

Rob Wiblin: OK, well, people have told me that science fiction book is good, so we’ll figure out what it is and we’ll stick up a link to it in the blog post associated with the episode.

Richard Ngo: Nice.

Rob Wiblin: One of my favourite thought experiments is a very similar concept: Imagine that you were unexpectedly sent back to 200 BC. Would you be able to convince people that you were from the future? You can’t bring anything, you can’t bring any artefacts like an iPhone or something, but would you, just with your knowledge, be able to convince them? I’ve heard people argue no, but I think the answer is just very clearly yes. Well, you could persuade them either that you’re from the future or you’re from somewhere very weird, where you’ve got knowledge that’s far beyond what they have and very unexpected. But I think I’ll leave that as an exercise to the reader.

All right, my guest today has been Richard Ngo. Thanks so much for coming on The 80,000 Hours Podcast, Richard.

Richard Ngo: Thanks for having me, Rob. Loved the discussion.

Rob’s outro [02:40:52]

Rob Wiblin: As always we list a range of jobs related to all things related to positively shaping the development of AI on our job board at jobs.80000hours.org. There’s about 100 up at the moment.

Before we go, Richard designed this course called AGI Safety Fundamentals, and after the episode he wanted to add a few remarks to tell you all about it in case some of you might want to sign up. Here’s Richard:

Richard Ngo: AGI Safety Fundamentals is a course I designed with the goal of helping people learn about the alignment problem, and some of the research directions which are aiming to solve it. It’s an eight-week course. And each week, you have a couple of hours of readings, and then a small group meeting with four or five other participants and a facilitator who have a discussion about the readings and the ideas that came up and any confusions that people had. It’s intended for a pretty wide range of audiences. So ideally, people would have some background in a technical subject like computer science or mathematics. But you don’t need any machine learning background; we’ve had people who are just totally new to machine learning. We’ve also had people who are professors in machine learning and who mainly want to understand, you know, what’s going on with the field of alignment. And we stream those participants into different groups depending on their level of expertise and how familiar they are with the problem overall.

We’re running the next round of that in February this coming year. And so if you want to apply for that, which I recommend as a way to learn more about the alignment problem, the deadline is the fifth of January, and you just want to go to agisafetyfundamentals.com for that.

We’ve also got two other courses, which we run alongside that. One is the AI Governance Fundamentals course, which is a bit less technical and a bit more focused on the political and governance problems involved with the development of advanced AI. And then the other one is the Alignment 201 course for people who have already taken the Alignment Fundamentals course. Those ones aren’t being run quite yet. But you can go up to the website and find the curricula for each of those and work through it yourself or with a small group of others, if you want to do so.

Rob Wiblin: One other announcement is that Peter Hartree, who used to work at 80,000 Hours, has recently launched a new podcast feed that features human-read audio versions of popular or important posts from the Effective Altruism Forum, which might be of interest to some regular listeners to this show. If you want to search for it, the name of the feed is ‘EA Forum Podcast (All audio)’ and you can find it in any podcasting app.

All right, The 80,000 Hours Podcast is produced and edited by Keiran Harris.

Audio mastering and technical editing for this episode by Milo McGuire and Ben Cordell.

Full transcripts and an extensive collection of links to learn more are available on our site and put together by Katy Moore.

And our theme song is ‘La Vita e Bella’ by Jazzinuf.

Thanks for joining, talk to you again soon.

Learn more

Preventing an AI-related catastrophe

What could an AI-caused existential catastrophe actually look like?

Machine Learning PhDs

Working in US AI policy

Related episodes

August 4, 2021

#107 – Chris Olah on what the hell is going on inside neural networks

Listen now

October 2, 2018

#44 – Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems

Listen now

January 19, 2021

#90 – Ajeya Cotra on worldview diversification and how big the future could be

Listen now

July 9, 2020

#81 – Ben Garfinkel on scrutinising classic AI risk arguments

Listen now

August 19, 2021

#109 – Holden Karnofsky on the most important century

Listen now

March 19, 2019

#54 – Askell, Brundage & Clark on whether policy has a hope of keeping up with AI advances

Listen now

March 5, 2021

#92 – Brian Christian on the alignment problem

Listen now

July 1, 2022

#133 – Max Tegmark on how a ‘put-up-or-shut-up’ resolution led him to work on AI and algorithmic news selection

Listen now

About the show

The 80,000 Hours Podcast features unusually in-depth conversations about the world's most pressing problems and how you can use your career to solve them. We invite guests pursuing a wide range of career paths — from academics and activists to entrepreneurs and policymakers — to analyse the case for and against working on different issues and which approaches are best for solving them.

The 80,000 Hours Podcast is produced and edited by Keiran Harris. Get in touch with feedback or guest suggestions by emailing [email protected].

What should I listen to first?

We've carefully selected 10 episodes we think it could make sense to listen to first, on a separate podcast feed:

Check out 'Effective Altruism: An Introduction'

Subscribe here, or anywhere you get podcasts:

If you're new, see the podcast homepage for ideas on where to start, or browse our full episode archive.

On this page:

Highlights

What we've learned from recent AI advances

Key arguments for why this matters

What AI could teach us about ourselves

Bottlenecks to learning for humans

The most common and important misconception around ML

Situational awareness

Reinforcement learning undermining obedience

Articles, books, and other media discussed in the show

Transcript

Rob’s intro [00:00:00]

The interview begins [00:03:12]

How Richard feels about recent AI progress [00:05:56]

Regulation of AI [00:10:50]

Why we should care about AI at all [00:15:00]

Key arguments for why this matters [00:23:27]

What OpenAI is doing and why [00:34:40]

AIs with the same total computation ability as a human brain [00:45:25]

What we’ve learned from recent advances [00:51:19]

Bottlenecks to learning for humans [01:01:34]

The most common and important misconception around ML [01:09:16]

The alignment problem from a deep learning perspective [01:15:39]

Situational awareness [01:26:02]

Reinforcement learning undermining obedience [01:40:07]

Arguments for calm [01:49:44]

Solutions [02:01:07]

Debate and interpretability [02:08:29]

Some conceptual alignment research projects [02:12:29]

Overrated AI work [02:14:09]

Richard’s personal path and advice [02:16:39]

Characterising utopia [02:28:37]

Richard’s favourite thought experiment [02:37:33]

Rob’s outro [02:40:52]

Learn more

Preventing an AI-related catastrophe

What could an AI-caused existential catastrophe actually look like?

Machine Learning PhDs

Working in US AI policy

Related episodes

About the show

What should I listen to first?