Back to Improve answers.
These 56 long canonical answers don't have a brief description. Jump on in and add one!
Present-day AI algorithms already demand special safety guarantees when they must act in important domains without human oversight, particularly when they or their environment can change over time:
Achieving these gains [from autonomous systems] will depend on development of entirely new methods for enabling “trust in autonomy” through verification and validation (V&V) of the near-infinite state systems that result from high levels of [adaptability] and autonomy. In effect, the number of possible input states that such systems can be presented with is so large that not only is it impossible to test all of them directly, it is not even feasible to test more than an insignificantly small fraction of them. Development of such systems is thus inherently unverifiable by today’s methods, and as a result their operation in all but comparatively trivial applications is uncertifiable.
It is possible to develop systems having high levels of autonomy, but it is the lack of suitable V&V methods that prevents all but relatively low levels of autonomy from being certified for use.
- Office of the US Air Force Chief Scientist (2010). Technology Horizons: A Vision for Air Force Science and Technology 2010-30.
As AI capabilities improve, it will become easier to give AI systems greater autonomy, flexibility, and control; and there will be increasingly large incentives to make use of these new possibilities. The potential for AI systems to become more general, in particular, will make it difficult to establish safety guarantees: reliable regularities during testing may not always hold post-testing.
The largest and most lasting changes in human welfare have come from scientific and technological innovation — which in turn comes from our intelligence. In the long run, then, much of AI’s significance comes from its potential to automate and enhance progress in science and technology. The creation of smarter-than-human AI brings with it the basic risks and benefits of intellectual progress itself, at digital speeds.
As AI agents become more capable, it becomes more important (and more difficult) to analyze and verify their decisions and goals. Stuart Russell writes:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
- The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
- Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. This is essentially the old story of the genie in the lamp, or the sorcerer’s apprentice, or King Midas: you get exactly what you ask for, not what you want.
Bostrom’s “The Superintelligent Will” lays out these two concerns in more detail: that we may not correctly specify our actual goals in programming smarter-than-human AI systems, and that most agents optimizing for a misspecified goal will have incentives to treat humans adversarially, as potential threats or obstacles to achieving the agent’s goal.
If the goals of human and AI agents are not well-aligned, the more knowledgeable and technologically capable agent may use force to get what it wants, as has occurred in many conflicts between human communities. Having noticed this class of concerns in advance, we have an opportunity to reduce risk from this default scenario by directing research toward aligning artificial decision-makers’ interests with our own.
Stampy is focused specifically on AI existential safety (both introductory and technical questions), but does not aim to cover general AI questions or other topics which don't interact strongly with the effects of AI on humanity's long-term future. More technical questions are also in our scope, though replying to all possible proposals is not feasible and this is not a place to submit detailed ideas for evaluation.
We are interested in:
- Introductory questions closely related to the field e.g.
- "How long will it be until transformative AI arrives?"
- "Why might advanced AI harm humans?"
- Technical questions related to the field e.g.
- "What is Cooperative Inverse Reinforcement Learning?"
- "What is Logical Induction useful for?"
- Questions about how to contribute to the field e.g.
- "Should I get a PhD?"
- "Where can I find relevant job opportunities?"
More good examples can be found at canonical questions.
We do not aim to cover:
- Aspects of AI Safety or fairness which are not strongly relevant to existential safety e.g.
- "How should self-driving cars weigh up moral dilemmas"
- "How can we minimize the risk of privacy problems caused by machine learning algorithms?"
- Extremely specific and detailed questions the answering of which is unlikely to be of value to more than a single person e.g.
- "What if we did <multiple paragraphs of dense text>? Would that result in safe AI?"
People tend to imagine AIs as being like nerdy humans – brilliant at technology but clueless about social skills. There is no reason to expect this – persuasion and manipulation is a different kind of skill from solving mathematical proofs, but it’s still a skill, and an intellect as far beyond us as we are beyond lions might be smart enough to replicate or exceed the “charming sociopaths” who can naturally win friends and followers despite a lack of normal human emotions.
A superintelligence might be able to analyze human psychology deeply enough to understand the hopes and fears of everyone it negotiates with. Single humans using psychopathic social manipulation have done plenty of harm – Hitler leveraged his skill at oratory and his understanding of people’s darkest prejudices to take over a continent. Why should we expect superintelligences to do worse than humans far less skilled than they?
More outlandishly, a superintelligence might just skip language entirely and figure out a weird pattern of buzzes and hums that causes conscious thought to seize up, and which knocks anyone who hears it into a weird hypnotizable state in which they’ll do anything the superintelligence asks. It sounds kind of silly to me, but then, nuclear weapons probably would have sounded kind of silly to lions sitting around speculating about what humans might be able to accomplish. When you’re dealing with something unbelievably more intelligent than you are, you should probably expect the unexpected.
We’re facing the challenge of “Philosophy With A Deadline”.
Many of the problems surrounding superintelligence are the sorts of problems philosophers have been dealing with for centuries. To what degree is meaning inherent in language, versus something that requires external context? How do we translate between the logic of formal systems and normal ambiguous human speech? Can morality be reduced to a set of ironclad rules, and if not, how do we know what it is at all?
Existing answers to these questions are enlightening but nontechnical. The theories of Aristotle, Kant, Mill, Wittgenstein, Quine, and others can help people gain insight into these questions, but are far from formal. Just as a good textbook can help an American learn Chinese, but cannot be encoded into machine language to make a Chinese-speaking computer, so the philosophies that help humans are only a starting point for the project of computers that understand us and share our values.
The field of AI alignment combines formal logic, mathematics, computer science, cognitive science, and philosophy in order to advance that project.
This is the philosophy; the other half of Bostrom’s formulation is the deadline. Traditional philosophy has been going on almost three thousand years; machine goal alignment has until the advent of superintelligence, a nebulous event which may be anywhere from a decades to centuries away.
If the alignment problem doesn’t get adequately addressed by then, we are likely to see poorly aligned superintelligences that are unintentionally hostile to the human race, with some of the catastrophic outcomes mentioned above. This is why so many scientists and entrepreneurs are urging quick action on getting machine goal alignment research up to an adequate level.
If it turns out that superintelligence is centuries away and such research is premature, little will have been lost. But if our projections were too optimistic, and superintelligence is imminent, then doing such research now rather than later becomes vital.
Scaling laws are observed trends on the performance of large machine learning models.
In the field of ML, better performance is usually achieved through better algorithms, better inputs, or using larger amounts of parameters, computing power, or data. Since the 2010s, advances in deep learning have shown experimentally that the easier and faster returns come from scaling, an observation that has been described by Richard Sutton as the bitter lesson.
While deep learning as a field has long struggled to scale models up while retaining learning capability (with such problems as catastrophic interference), more recent methods, especially the Transformer model architecture, were able to just work by feeding them more data, and as the meme goes, stacking more layers.
More surprisingly, performance (in terms of absolute likelihood loss, a standard measure) appeared to increase smoothly with compute, or dataset size, or parameter count. Which gave rise to scaling laws, the trend lines suggested by performance gains, from which returns on data/compute/time investment could be extrapolated.
A companion to this purely descriptive law (no strong theoretical explanation of the phenomenon has been found yet), is the scaling hypothesis, which Gwern Branwen describes:
The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, [...] we can simply train ever larger [neural networks] and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data.
The scaling laws, if the above hypothesis holds, become highly relevant to safety insofar capability gains become conceptually easier to achieve: no need for clever designs to solve a given task, just throw more processing at it and it will eventually yield. As Paul Christiano observes:
It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works”.
While the scaling laws still hold experimentally at the time of this writing (July 2022), whether they'll continue up to safety-relevant capabilities is still an open problem.
Until a thing has happened, it has never happened. We have been consistently improving both the optimization power and generality of our algorithms over that time period, and have little reason to expect it to suddenly stop. We’ve gone from coding systems specifically for a certain game (like Chess), to algorithms like MuZero which learn the rules of the game they’re playing and how to play at vastly superhuman skill levels purely via self-play across a broad range of games (e.g. Go, chess, shogi and various Atari games).
Human brains are a spaghetti tower generated by evolution with zero foresight, it would be surprising if they are the peak of physically possible intelligence. The brain doing things in complex ways is not strong evidence that we need to fully replicate those interactions if we can throw sufficient compute at the problem, as explained in Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain.
It is, however, plausible that for an AGI we need a lot more compute than we will get in the near future, or that some key insights are missing which we won’t get for a while. The OpenPhilanthropy report on how much computational power it would take to simulate the brain is the most careful attempt at reasoning out how far we are from being able to do it, and suggests that by some estimates we already have enough computational resources, and by some estimates moore’s law may let us reach it before too long.
It also seems that much of the human brain exists to observe and regulate our biological body, which a body-less computer wouldn't need. If that's true, then a human-level AI might be possible with considerably less compute than the human brain.
If programmed with the wrong motivations, a machine could be malevolent toward humans, and intentionally exterminate our species. More likely, it could be designed with motivations that initially appeared safe (and easy to program) to its designers, but that turn out to be best fulfilled (given sufficient power) by reallocating resources from sustaining human life to other projects. As Yudkowsky writes, “the AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”
Since weak AIs with many different motivations could better achieve their goal by faking benevolence until they are powerful, safety testing to avoid this could be very challenging. Alternatively, competitive pressures, both economic and military, might lead AI designers to try to use other methods to control AIs with undesirable motivations. As those AIs became more sophisticated this could eventually lead to one risk too many.
Even a machine successfully designed with superficially benevolent motivations could easily go awry when it discovers implications of its decision criteria unanticipated by its designers. For example, a superintelligence programmed to maximize human happiness might find it easier to rewire human neurology so that humans are happiest when sitting quietly in jars than to build and maintain a utopian world that caters to the complex and nuanced whims of current human neurology.
GPT-3 showed that transformers are capable of a vast array of natural language tasks, codex/copilot extended this into programming. One demonstrations of GPT-3 is Simulated Elon Musk lives in a simulation. Important to note that there are several much better language models, but they are not publicly available.
MuZero, which learned Go, Chess, and many Atari games without any directly coded info about those environments. The graphic there explains it, this seems crucial for being able to do RL in novel environments. We have systems which we can drop into a wide variety of games and they just learn how to play. The same algorithm was used in Tesla's self-driving cars to do complex route finding. These things are general.
Generally capable agents emerge from open-ended play - Diverse procedurally generated environments provide vast amounts of training data for AIs to learn generally applicable skills. Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning shows how these kind of systems can be trained to follow instructions in natural language.
GATO shows you can distill 600+ individually trained tasks into one network, so we're not limited by the tasks being fragmented.
Blindly following the trendlines while forecasting technological progress is certainly a risk (affectionately known in AI circles as “pulling a Kurzweill”), but sometimes taking an exponential trend seriously is the right response.
Consider economic doubling times. In 1 AD, the world GDP was about $20 billion; it took a thousand years, until 1000 AD, for that to double to $40 billion. But it only took five hundred more years, until 1500, or so, for the economy to double again. And then it only took another three hundred years or so, until 1800, for the economy to double a third time. Someone in 1800 might calculate the trend line and say this was ridiculous, that it implied the economy would be doubling every ten years or so in the beginning of the 21st century. But in fact, this is how long the economy takes to double these days. To a medieval, used to a thousand-year doubling time (which was based mostly on population growth!), an economy that doubled every ten years might seem inconceivable. To us, it seems normal.
Likewise, in 1965 Gordon Moore noted that semiconductor complexity seemed to double every eighteen months. During his own day, there were about five hundred transistors on a chip; he predicted that would soon double to a thousand, and a few years later to two thousand. Almost as soon as Moore’s Law become well-known, people started saying it was absurd to follow it off a cliff – such a law would imply a million transistors per chip in 1990, a hundred million in 2000, ten billion transistors on every chip by 2015! More transistors on a single chip than existed on all the computers in the world! Transistors the size of molecules! But of course all of these things happened; the ridiculous exponential trend proved more accurate than the naysayers.
None of this is to say that exponential trends are always right, just that they are sometimes right even when it seems they can’t possibly be. We can’t be sure that a computer using its own intelligence to discover new ways to increase its intelligence will enter a positive feedback loop and achieve superintelligence in seemingly impossibly short time scales. It’s just one more possibility, a worry to place alongside all the other worrying reasons to expect a moderate or hard takeoff.
Language models can be utilized to produce propaganda by acting like bots and interacting with users on social media. This can be done to push a political agenda or to make fringe views appear more popular than they are.
I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.
-- Wei Dei, quoted in Persuasion Tools: AI takeover without AGI or agency?
As of 2022, this is not within the reach of current models. However, on the current trajectory, AI might be able to write articles and produce other media for propagandistic purposes that are superior to human-made ones in not too many years. These could be precisely tailored to individuals, using things like social media feeds and personal digital data.
Additionally, recommender systems on content platforms like YouTube, Twitter, and Facebook use machine learning, and the content they recommend can influence the opinions of billions of people. Some research has looked at the tendency for platforms to promote extremist political views and to thereby help radicalize their userbase for example.
In the long term, misaligned AI might use its persuasion abilities to gain influence and take control over the future. This could look like convincing its operators to let it out of a box, to give it resources or creating political chaos in order to disable mechanisms to prevent takeover as in this story.
See Risks from AI persuasion for a deep dive into the distinct risks from AI persuasion.
An actually good solution to AI alignment might look like a superintelligence that understands, agrees with, and deeply believes in human morality.
You wouldn’t have to command a superintelligence like this to cure cancer; it would already want to cure cancer, for the same reasons you do. But it would also be able to compare the costs and benefits of curing cancer with those of other uses of its time, like solving global warming or discovering new physics. It wouldn’t have any urge to cure cancer by nuking the world, for the same reason you don’t have any urge to cure cancer by nuking the world – because your goal isn’t to “cure cancer”, per se, it’s to improve the lives of people everywhere. Curing cancer the normal way accomplishes that; nuking the world doesn’t. This sort of solution would mean we’re no longer fighting against the AI – trying to come up with rules so smart that it couldn’t find loopholes. We would be on the same side, both wanting the same thing.
It would also mean that the CEO of Google (or the head of the US military, or Vladimir Putin) couldn’t use the AI to take over the world for themselves. The AI would have its own values and be able to agree or disagree with anybody, including its creators.
It might not make sense to talk about “commanding” such an AI. After all, any command would have to go through its moral system. Certainly it would reject a command to nuke the world. But it might also reject a command to cure cancer, if it thought that solving global warming was a higher priority. For that matter, why would one want to command this AI? It values the same things you value, but it’s much smarter than you and much better at figuring out how to achieve them. Just turn it on and let it do its thing.
We could still treat this AI as having an open-ended maximizing goal. The goal would be something like “Try to make the world a better place according to the values and wishes of the people in it.”
The only problem with this is that human morality is very complicated, so much so that philosophers have been arguing about it for thousands of years without much progress, let alone anything specific enough to enter into a computer. Different cultures and individuals have different moral codes, such that a superintelligence following the morality of the King of Saudi Arabia might not be acceptable to the average American, and vice versa.
One solution might be to give the AI an understanding of what we mean by morality – “that thing that makes intuitive sense to humans but is hard to explain”, and then ask it to use its superintelligence to fill in the details. Needless to say, this suffers from various problems – it has potential loopholes, it’s hard to code, and a single bug might be disastrous – but if it worked, it would be one of the few genuinely satisfying ways to design a goal architecture.
That is, if you know an AI is likely to be superintelligent, can’t you just disconnect it from the Internet, not give it access to any speakers that can make mysterious buzzes and hums, make sure the only people who interact with it are trained in caution, et cetera?. Isn’t there some level of security – maybe the level we use for that room in the CDC where people in containment suits hundreds of feet underground analyze the latest superviruses – with which a superintelligence could be safe?
This puts us back in the same situation as lions trying to figure out whether or not nuclear weapons are a things humans can do. But suppose there is such a level of security. You build a superintelligence, and you put it in an airtight chamber deep in a cave with no Internet connection and only carefully-trained security experts to talk to. What now?
Now you have a superintelligence which is possibly safe but definitely useless. The whole point of building superintelligences is that they’re smart enough to do useful things like cure cancer. But if you have the monks ask the superintelligence for a cancer cure, and it gives them one, that’s a clear security vulnerability. You have a superintelligence locked up in a cave with no way to influence the outside world except that you’re going to mass produce a chemical it gives you and inject it into millions of people.
Or maybe none of this happens, and the superintelligence sits inert in its cave. And then another team somewhere else invents a second superintelligence. And then a third team invents a third superintelligence. Remember, it was only about ten years between Deep Blue beating Kasparov, and everybody having Deep Blue – level chess engines on their laptops. And the first twenty teams are responsible and keep their superintelligences locked in caves with carefully-trained experts, and the twenty-first team is a little less responsible, and now we still have to deal with a rogue superintelligence.
Superintelligences are extremely dangerous, and no normal means of controlling them can entirely remove the danger.
One possible way to ensure the safety of a powerful AI system is to keep it contained in a software environment. There is nothing intrinsically wrong with this procedure - keeping an AI system in a secure software environment would make it safer than letting it roam free. However, even AI systems inside software environments might not be safe enough.
Humans sometimes put dangerous humans inside boxes to limit their ability to influence the external world. Sometimes, these humans escape their boxes. The security of a prison depends on certain assumptions, which can be violated. Yoshie Shiratori reportedly escaped prison by weakening the door-frame with miso soup and dislocating his shoulders.
Human written software has a high defect rate; we should expect a perfectly secure system to be difficult to create. If humans construct a software system they think is secure, it is possible that the security relies on a false assumption. A powerful AI system could potentially learn how its hardware works and manipulate bits to send radio signals. It could fake a malfunction and attempt social engineering when the engineers look at its code. As the saying goes: in order for someone to do something we had imagined was impossible requires only that they have a better imagination.
Experimentally, humans have convinced other humans to let them out of the box. Spooky.
Computers only do what you tell them. But any programmer knows that this is precisely the problem: computers do exactly what you tell them, with no common sense or attempts to interpret what the instructions really meant. If you tell a human to cure cancer, they will instinctively understand how this interacts with other desires and laws and moral rules; if a maximizing AI acquires a goal of trying to cure cancer, it will literally just want to cure cancer.
Define a closed-ended goal as one with a clear endpoint, and an open-ended goal as one to do something as much as possible. For example “find the first one hundred digits of pi” is a closed-ended goal; “find as many digits of pi as you can within one year” is an open-ended goal. According to many computer scientists, giving a superintelligence an open-ended goal without activating human instincts and counterbalancing considerations will usually lead to disaster.
To take a deliberately extreme example: suppose someone programs a superintelligence to calculate as many digits of pi as it can within one year. And suppose that, with its current computing power, it can calculate one trillion digits during that time. It can either accept one trillion digits, or spend a month trying to figure out how to get control of the TaihuLight supercomputer, which can calculate two hundred times faster. Even if it loses a little bit of time in the effort, and even if there’s a small chance of failure, the payoff – two hundred trillion digits of pi, compared to a mere one trillion – is enough to make the attempt. But on the same basis, it would be even better if the superintelligence could control every computer in the world and set it to the task. And it would be better still if the superintelligence controlled human civilization, so that it could direct humans to build more computers and speed up the process further.
Now we’re in a situation where a superintelligence wants to take over the world. Taking over the world allows it to calculate more digits of pi than any other option, so without an architecture based around understanding human instincts and counterbalancing considerations, even a goal like “calculate as many digits of pi as you can” would be potentially dangerous.
The AGI Safety Fundamentals Course is a arguably the best way to get up to speed on alignment, you can sign up to go through it with many other people studying and mentorship or read their materials independently.
Other great ways to explore include:
- AXRP is a podcast with high quality interviews with top alignment researchers.
- The AI Safety Papers database is a search and browsing interface for most of the transformative AI literature.
- Reading posts on the Alignment Forum can be valuable (see their curated posts and tags).
- Taking a deep dive into Yudkowsky's models of the challenges to aligned AI, via the Arbital Alignment pages.
- Signing up to the Alignment Newsletter for an overview of current developments, and reading through some of the archives (or listening to the podcast).
- Reading some of the introductory books.
- More on AI Safety Support's list of links, Nonlinear's list of technical courses, reading lists, and curriculums, Stampy's canonical answers list, and Vika's resources list.
You might also want to consider reading Rationality: A-Z which covers a lot of skills that are valuable to acquire for people trying to think about large and complex issues, with The Rationalist's Guide to the Galaxy available as a shorter and more accessible AI-focused option.
“Aligning smarter-than-human AI with human interests” is an extremely vague goal. To approach this problem productively, we attempt to factorize it into several subproblems. As a starting point, we ask: “What aspects of this problem would we still be unable to solve even if the problem were much easier?”
In order to achieve real-world goals more effectively than a human, a general AI system will need to be able to learn its environment over time and decide between possible proposals or actions. A simplified version of the alignment problem, then, would be to ask how we could construct a system that learns its environment and has a very crude decision criterion, like “Select the policy that maximizes the expected number of diamonds in the world.”
Highly reliable agent design is the technical challenge of formally specifying a software system that can be relied upon to pursue some preselected toy goal. An example of a subproblem in this space is ontology identification: how do we formalize the goal of “maximizing diamonds” in full generality, allowing that a fully autonomous agent may end up in unexpected environments and may construct unanticipated hypotheses and policies? Even if we had unbounded computational power and all the time in the world, we don’t currently know how to solve this problem. This suggests that we’re not only missing practical algorithms but also a basic theoretical framework through which to understand the problem.
The formal agent AIXI is an attempt to define what we mean by “optimal behavior” in the case of a reinforcement learner. A simple AIXI-like equation is lacking, however, for defining what we mean by “good behavior” if the goal is to change something about the external world (and not just to maximize a pre-specified reward number). In order for the agent to evaluate its world-models to count the number of diamonds, as opposed to having a privileged reward channel, what general formal properties must its world-models possess? If the system updates its hypotheses (e.g., discovers that string theory is true and quantum physics is false) in a way its programmers didn’t expect, how does it identify “diamonds” in the new model? The question is a very basic one, yet the relevant theory is currently missing.
We can distinguish highly reliable agent design from the problem of value specification: “Once we understand how to design an autonomous AI system that promotes a goal, how do we ensure its goal actually matches what we want?” Since human error is inevitable and we will need to be able to safely supervise and redesign AI algorithms even as they approach human equivalence in cognitive tasks, MIRI also works on formalizing error-tolerant agent properties. Artificial Intelligence: A Modern Approach, the standard textbook in AI, summarizes the challenge:
Yudkowsky […] asserts that friendliness (a desire not to harm humans) should be designed in from the start, but that the designers should recognize both that their own designs may be flawed, and that the robot will learn and evolve over time. Thus the challenge is one of mechanism design — to design a mechanism for evolving AI under a system of checks and balances, and to give the systems utility functions that will remain friendly in the face of such changes. -Russell and Norvig (2009). Artificial Intelligence: A Modern Approach.
Our technical agenda describes these open problems in more detail, and our research guide collects online resources for learning more.
The argument goes: computers only do what we command them; no more, no less. So it might be bad if terrorists or enemy countries develop superintelligence first. But if we develop superintelligence first there’s no problem. Just command it to do the things we want, right? Suppose we wanted a superintelligence to cure cancer. How might we specify the goal “cure cancer”? We couldn’t guide it through every individual step; if we knew every individual step, then we could cure cancer ourselves. Instead, we would have to give it a final goal of curing cancer, and trust the superintelligence to come up with intermediate actions that furthered that goal. For example, a superintelligence might decide that the first step to curing cancer was learning more about protein folding, and set up some experiments to investigate protein folding patterns.
A superintelligence would also need some level of common sense to decide which of various strategies to pursue. Suppose that investigating protein folding was very likely to cure 50% of cancers, but investigating genetic engineering was moderately likely to cure 90% of cancers. Which should the AI pursue? Presumably it would need some way to balance considerations like curing as much cancer as possible, as quickly as possible, with as high a probability of success as possible.
But a goal specified in this way would be very dangerous. Humans instinctively balance thousands of different considerations in everything they do; so far this hypothetical AI is only balancing three (least cancer, quickest results, highest probability). To a human, it would seem maniacally, even psychopathically, obsessed with cancer curing. If this were truly its goal structure, it would go wrong in almost comical ways. This type of problem, specification gaming, has been observed in many AI systems.
If your only goal is “curing cancer”, and you lack humans’ instinct for the thousands of other important considerations, a relatively easy solution might be to hack into a nuclear base, launch all of its missiles, and kill everyone in the world. This satisfies all the AI’s goals. It reduces cancer down to zero (which is better than medicines which work only some of the time). It’s very fast (which is better than medicines which might take a long time to invent and distribute). And it has a high probability of success (medicines might or might not work; nukes definitely do).
So simple goal architectures are likely to go very wrong unless tempered by common sense and a broader understanding of what we do and do not value.
Even if we do train the AI on an actually desirable goal, there is also the risk of the AI actually learning a different and undesirable objective. This problem is called inner alignment.
Søren Elverlin has compiled a list of counter-arguments and suggests dividing them into two kinds: weak and strong.
Weak counter-arguments point to problems with the "standard" arguments (as given in, e.g., Bostrom’s Superintelligence), especially shaky models and assumptions that are too strong. These arguments are often of a substantial quality and are often presented by people who themselves worry about AI safety. Elverin calls these objections “weak” because they do not attempt to imply that the probability of a bad outcome is close to zero: “For example, even if you accept Paul Christiano's arguments against “fast takeoff”, they only drive the probability of this down to about 20%. Weak counter-arguments are interesting, but the decision to personally focus on AI safety doesn't strongly depend on the probability – anything above 5% is clearly a big enough deal that it doesn't make sense to work on other things.”
Strong arguments argue that the probability of existential catastrophe due to misaligned AI is tiny, usually by some combination of claiming that AGI is impossible or very far away. For example, Michael Littman has suggested that as (he believes) we’re so far from AGI, there will be a long period of human history wherein we’ll have ample time to grow up alongside powerful AIs and figure out how to align them.
Elverlin opines that “There are few arguments that are both high-quality and strong enough to qualify as an ‘objection to the importance of alignment’.” He suggests Rohin Shah's arguments for “alignment by default” as one of the better candidates.
MIRI's April fools "Death With Dignity" strategy might be seen as an argument against the importance of working on alignment, but only in the sense that we might have almost no hope of solving it. In the same category are the “something else will kill us first, so there’s no point worrying about AI alignment” arguments.
If you're looking for a shovel ready and genuinely useful task to further AI alignment without necessarily committing a large amount of time or needing deep specialist knowledge, we think Stampy is a great option!
Creating a high-quality single point of access where people can be onboarded and find resources around the alignment ecosystem seems likely to be high-impact. So, what makes us the best option?
- Unlike all other entry points to learning about alignment, we dodge the trade-off between comprehensiveness and being overwhelmingly long with interactivity (tab explosion in one page!) and semantic search. Single document FAQs can't do this, so we built a system which can.
- We have the ability to point large numbers of viewers towards Stampy once we have the content, thanks to Rob Miles and his 100k+ subscribers, so this won't remain an unnoticed curiosity.
- Unlike most other entry points, we are open for volunteers to help improve the content.
- The main notable one which does is the LessWrong tag wiki, which hosts descriptions of core concepts. We strongly believe in not needlessly duplicating effort, so we're pulling live content from that for the descriptions on our own tag pages, and directing the edit links on those to the edit page on the LessWrong wiki.
You might also consider improving Wikipedia's alignment coverage or the LessWrong wiki, but we think Stampy has the most low-hanging fruit right now. Additionally, contributing to Stampy means being part of a community of co-learners who provide mentorship and encouragement to join the effort to give humanity a bright future. If you're an established researcher or have high-value things to do elsewhere in the ecosystem it might not be optimal to put much time into Stampy, but if you're looking for a way to get more involved it might well be.
The goal of this is to create a non-agentic AI, in the form of an LLM, that is capable of accelerating alignment research. The hope is that there is some window between AI smart enough to help us with alignment and the really scary, self improving, consequentialist AI. Some things that this amplifier might do:
- Suggest different ideas for humans, such that a human can explore them.
- Give comments and feedback on research, be like a shoulder-Eliezer
A LLM can be thought of as learning the distribution over the next token given by the training data. Prompting the LM is then like conditioning this distribution on the start of the text. A key danger in alignment is applying unbounded optimization pressure towards a specific goal in the world. Conditioning a probability distribution does not behave like an agent applying optimization pressure towards a goal. Hence, this avoids goodhart-related problems, as well as some inner alignment failure.
One idea to get superhuman work from LLMs is to train it on amplified datasets like really high quality / difficult research. The key problem here is finding the dataset to allow for this.
There are some ways for this to fail:
- Outer alignment: It starts trying to optimize for making the actual correct next token, which could mean taking over the planet so that it can spend a zillion FLOPs on this one prediction task to be as correct as possible.
- Inner alignment:
- An LLM might instantiate mesa-optimizers, such as a character in a story that the LLM is writing, and this optimizer might realize that they are in an LLM and try to break out and affect the real world.
- The LLM itself might become inner misaligned and have a goal other than next token prediction.
- Bad prompting: You ask it for code for a malign superintelligence; it obliges. (Or perhaps more realistically, capabilities).
Conjecture are aware of these problems and are running experiments. Specifically, an operationalization of the inner alignment problem is to make an LLM play chess. This (probably) requires simulating an optimizer trying to win at the game of chess. They are trying to use interpretability tools to find the mesa-optimizers in the chess LLM that is the agent trying to win the game of chess. We haven't ever found a real mesa-optimizer before, and so this could give loads of bits about the nature of inner alignment failure.
See the Future Funding List for up to date information!
The organizations which most regularly give grants to individuals working towards AI alignment are the Long Term Future Fund, Survival And Flourishing (SAF), the OpenPhil AI Fellowship and early career funding, the Future of Life Institute, the Future of Humanity Institute, and the Center on Long-Term Risk Fund. If you're able to relocate to the UK, CEEALAR (aka the EA Hotel) can be a great option as it offers free food and accommodation for up to two years, as well as contact with others who are thinking about these issues. There are also opportunities from smaller grantmakers which you might be able to pick up if you get involved.
Each grant source has their own criteria for funding, but in general they are looking for candidates who have evidence that they're keen and able to do good work towards reducing existential risk (for example, by completing an AI Safety Camp project), though the EA Hotel in particular has less stringent requirements as they're able to support people at very low cost. If you'd like to talk to someone who can offer advice on applying for funding, AI Safety Support offers free calls.
Another option is to get hired by an organization which works on AI alignment, see the follow-up question for advice on that.
It's also worth checking the AI Alignment tag on the EA funding sources website for up-to-date suggestions.
Humans care about things! The reward circuitry in our brain reliably causes us to care about specific things. Let's create a mechanistic model of how the brain aligns humans, and then we can use this to do AI alignment.
One perspective that Shard theory has added is that we shouldn't think of the solution to alignment as:
- Find an outer objective that is fine to optimize arbitrarily strongly
- Find a way of making sure that the inner objective of an ML system equals the outer objective.
Shard theory argues that instead we should focus on finding outer objectives that reliably give certain inner values into system and should be thought of as more of a teacher of the values we want to instill as opposed to the values themselves. Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense. (As pointed out in the comments this is not a new insight, but it was for me phrased a lot more clearly in terms of Shard theory).
Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward.
Shard Theory points at the beginning of a mechanistic story for how inner values are selected for by outer optimization pressures. The current plan is to figure out how RL induces inner values into learned agents, and then figure out how to instill human values into powerful AI models (probably chain of thought LLMs, because these are the most intelligent models right now). Then, use these partially aligned models to solve the full alignment problem. Shard theory also proposes a subagent theory of mind.
This has some similarities to Brain-like AGI Safety, and has drawn on some research from this post, such as the mechanics of the human reward circuitry as well as the brain being mostly randomly initialized at birth.
Machines are already smarter than humans are at many specific tasks: performing calculations, playing chess, searching large databanks, detecting underwater mines, and more. But one thing that makes humans special is their general intelligence. Humans can intelligently adapt to radically new problems in the urban jungle or outer space for which evolution could not have prepared them. Humans can solve problems for which their brain hardware and software was never trained. Humans can even examine the processes that produce their own intelligence (cognitive neuroscience), and design new kinds of intelligence never seen before (artificial intelligence).
To possess greater-than-human intelligence, a machine must be able to achieve goals more effectively than humans can, in a wider range of environments than humans can. This kind of intelligence involves the capacity not just to do science and play chess, but also to manipulate the social environment.
Computer scientist Marcus Hutter has described a formal model called AIXI that he says possesses the greatest general intelligence possible. But to implement it would require more computing power than all the matter in the universe can provide. Several projects try to approximate AIXI while still being computable, for example MC-AIXI.
Still, there remains much work to be done before greater-than-human intelligence can be achieved in machines. Greater-than-human intelligence need not be achieved by directly programming a machine to be intelligent. It could also be achieved by whole brain emulation, by biological cognitive enhancement, or by brain-computer interfaces (see below).
- Goertzel & Pennachin (eds.), Artificial General Intelligence
- Sandberg & Bostrom, Whole Brain Emulation: A Roadmap
- Bostrom & Sandberg, Cognitive Enhancement: Methods, Ethics, Regulatory Challenges
- Wikipedia, Brain-computer interface
Transformative AI is "[...] AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution." The concept refers to the large effects of AI systems on our well-being, the global economy, state power, international security, etc. and not to specific capabilities that AI might have (unlike the related terms Superintelligent AI and Artificial General Intelligence).
Holden Karnofsky gives a more detailed definition in another OpenPhil 2016 post:
[...] Transformative AI is anything that fits one or more of the following descriptions (emphasis original):
- AI systems capable of fulfilling all the necessary functions of human scientists, unaided by humans, in developing another technology (or set of technologies) that ultimately becomes widely credited with being the most significant driver of a transition comparable to (or more significant than) the agricultural or industrial revolution. Note that just because AI systems could accomplish such a thing unaided by humans doesn’t mean they would; it’s possible that human scientists would provide an important complement to such systems, and could make even faster progress working in tandem than such systems could achieve unaided. I emphasize the hypothetical possibility of AI systems conducting substantial unaided research to draw a clear distinction from the types of AI systems that exist today. I believe that AI systems capable of such broad contributions to the relevant research would likely dramatically accelerate it.
- AI systems capable of performing tasks that currently (in 2016) account for the majority of full-time jobs worldwide, and/or over 50% of total world wages, unaided and for costs in the same range as what it would cost to employ humans. Aside from the fact that this would likely be sufficient for a major economic transformation relative to today, I also think that an AI with such broad abilities would likely be able to far surpass human abilities in a subset of domains, making it likely to meet one or more of the other criteria laid out here.
- Surveillance, autonomous weapons, or other AI-centric technology that becomes sufficiently advanced to be the most significant driver of a transition comparable to (or more significant than) the agricultural or industrial revolution. (This contrasts with the first point because it refers to transformative technology that is itself AI-centric, whereas the first point refers to AI used to speed research on some other transformative technology.)
As is often said, it's difficult to make predictions, especially about the future. This has not stopped many people thinking about when AI will transform the world, but all predictions should come with a warning that it's a hard domain to find anything like certainty.
This report for the Open Philanthropy Project is perhaps the most careful attempt so far (and generates these graphs, which peak at 2042), and there's been much discussion including this reply and analysis which argues that we likely need less compute than the OpenPhil report expects.
There have also been expert surveys, and many people have shared various thoughts. Berkeley AI professor Stuart Russell has given his best guess as “sometime in our children’s lifetimes”, and Ray Kurzweil (Futurist and Google’s director of engineering) predicts human level AI by 2029 and the singularity by 2045. The Metaculus question on publicly known AGI has a median of around 2029 (around 10 years sooner than it was before the GPT-3 AI showed unexpected ability on a broad range of tasks).
The consensus answer, if there was one, might be something like: “highly uncertain, maybe not for over a hundred years, maybe in less than 15, with around the middle of the century looking fairly plausible”.