Initial questions

From Stampy's Wiki

What is AI safety?

Show your endorsement of this answer by giving it a stamp of approval!

AI safety is a research field that has the goal to avoid bad outcomes from AI systems.

Work on AI safety can be divided into near-term AI safety, and AI existential safety, which is strongly related to AI alignment:

  • Near-term AI safety is about preventing bad outcomes from current systems. Examples for work on near-term AI safety are
    • getting content recommender systems to not radicalize their users
    • ensuring autonomous cars don’t kill people
    • advocating strict regulations for lethal autonomous weapons
  • AI existential safety, or AGI safety is about reducing the existential risk from artificial general intelligence (AGI). Artificial general intelligence is AI that is at least as competent as humans in all skills that are relevant for making a difference in the world. AGI has not been developed yet, but will likely be developed in this century. A central part of AGI safety is ensuring that what AIs do is actually what we want. This is called AI alignment (also often just called alignment), because it’s about aligning an AI with human values. Alignment is difficult, and building AGI is probably very dangerous, so it is important to mitigate the risks as much as possible. Examples for work on AI existential safety are
    • trying to get a foundational understanding what intelligence is, e.g. agent foundations
    • Outer and inner alignment: Ensure the objective of the training process is actually what we want, and also ensure the objective of the resulting system is actually what we want.
    • AI policy/strategy: e.g. researching the best way to set up institutions and mechanisms that help with safe AGI development, making sure AI isn’t used by bad actors

There are also areas of research which are useful for both near-term, and for existential safety. For example, robustness to distribution shift, and interpretability both help with making current systems safer, and are likely to help with AGI safety.

Is this about AI systems becoming malevolent or conscious and turning on us?

Show your endorsement of this answer by giving it a stamp of approval!

The problem isn’t consciousness, but competence. You make machines that are incredibly competent at achieving objectives and they will cause accidents in trying to achieve those objectives. - Stuart Russell

Work on AI alignment is not concerned with the question of whether “consciousness”, “sentience” or “self-awareness” could arise in a machine or an algorithm. Unlike the frequently-referenced plotline in the Terminator movies, the standard catastrophic misalignment scenarios under discussion do not require computers to become conscious; they only require conventional computer systems (although usually faster and more powerful ones than those available today) blindly and deterministically following logical steps, in the same way that they currently do.

The primary concern (“AI misalignment”) is that powerful systems could inadvertently be programmed with goals that do not fully capture what the programmers actually want. The AI would then harm humanity in pursuit of goals which seemed benign or neutral. Nothing like malevolence or consciousness would need to be involved. A number of researchers studying the problem have concluded that it is surprisingly difficult to guard against this effect, and that it is likely to get much harder as the systems become more capable. AI systems are inevitably goal-directed and could, for example, consider our efforts to control them (or switch them off) as being impediments to attaining their goals.

Why can’t we just…

Show your endorsement of this answer by giving it a stamp of approval!

There are many approaches that initially look like they can eliminate these problems, but then turn out to have hidden difficulties. It’s surprisingly easy to come up with “solutions” which don’t actually solve the problem. This can be because…

  • …they require you to be smarter than the system. Many solutions only work when the system is relatively weak, but break when they achieve a certain level of capability (for multiple reasons, e.g. deceptive alignment).
  • …they rely on appearing to make sense in natural language, but when properly unpacked they’re not philosophically clear enough to be usable.
  • … despite being philosophically coherent, we have no idea how to turn them into computer code (or if that’s even possible).
  • …they’re things which we can’t do.
  • …although we can do them, they don’t solve the problem.
  • …they solve a relatively easy subcomponent of the problem but leave the hard problem untouched.
  • …they solve the problem but only as long as we stay “in distribution” with respect to the original training data (distributional shift will break them).

Here are some of the proposals which often come up:

Why might a superintelligent AI be dangerous?

Show your endorsement of this answer by giving it a stamp of approval!

A commonly heard argument goes: yes, a superintelligent AI might be far smarter than Einstein, but it’s still just one program, sitting in a supercomputer somewhere. That could be bad if an enemy government controls it and asks it to help invent superweapons – but then the problem is the enemy government, not the AI per se. Is there any reason to be afraid of the AI itself? Suppose the AI did appear to be hostile, suppose it even wanted to take over the world: why should we think it has any chance of doing so?

There are numerous carefully thought-out AGI-related scenarios which could result in the accidental extinction of humanity. But rather than focussing on any of these individually, it might be more helpful to think in general terms.

"Transistors can fire about 10 million times faster than human brain cells, so it's possible we'll eventually have digital minds operating 10 million times faster than us, meaning from a decision-making perspective we'd look to them like stationary objects, like plants or rocks... To give you a sense, here's what humans look like when slowed down by only around 100x."


Watch that, and now try to imagine advanced AI technology running for a single year around the world, making decisions and taking actions 10 million times faster than we can. That year for us becomes 10 million subjective years for the AI, in which "...there are these nearly-stationary plant-like or rock-like "human" objects around that could easily be taken apart for, say, biofuel or carbon atoms, if you could just get started building a human-disassembler. Visualizing things this way, you can start to see all the ways that a digital civilization can develop very quickly into a situation where there are no humans left alive, just as human civilization doesn't show much regard for plants or wildlife or insects."

Andrew Critch - Slow Motion Videos as AI Risk Intuition Pumps

And even putting aside these issues of speed and subjective time, the difference in (intelligence-based) power-to-manipulate-the-world between a self-improving superintelligent AGI and humanity could be far more extreme than the difference in such power between humanity and insects.

AI Could Defeat All Of Us Combined” is a more in-depth argument by the CEO of Open Philanthropy.

Why is AI alignment a hard problem?

Show your endorsement of this answer by giving it a stamp of approval!
The problem of AI alignment can be compared in difficulty to a combination of rocket science (extreme stresses on components of the system, very narrow safety margins), launching space probes (once something goes wrong, it may be too late to be able to go back in and fix your code) and developing totally secure cryptography (your code may become a superintelligent adversary and seek to find and exploit even the tiniest flaws in your system). "AI alignment: treat it like a cryptographic rocket probe” - Eliezer Yudkowsky
One sense in which alignment is a hard problem is analogous to the reason rocket science is a hard problem. Relative to other engineering endeavors, rocket science had so many disasters because of the extreme stresses placed on various mechanical components and the narrow margins of safety required by stringent weight limits. A superintelligence would put vastly more “stress” on the software and hardware stack it is running on, which could cause many classes of failure which don’t occur when you’re working with subhuman systems.

Alignment is also hard like space probes are hard. With recursively self-improving systems, you won’t be able to go back and edit the code later if there is a catastrophic failure because it will competently deceive and resist you.

"You may have only one shot. If something goes wrong, the system might be too 'high' for you to reach up and suddenly fix it. You can build error recovery mechanisms into it; space probes are supposed to accept software updates. If something goes wrong in a way that precludes getting future updates, though, you’re screwed. You have lost the space probe."

Additionally, alignment is hard like cryptographic security. Cryptographers attempt to safeguard against “intelligent adversaries” who search for flaws in a system which they can exploit to break it. “Your code is not an intelligent adversary if everything goes right. If something goes wrong, it might try to defeat your safeguards…” And at the stage where it’s trying to defeat your safeguards, your code may have achieved the capabilities of a vast and perfectly coordinated team of superhuman-level hackers! So if there is even the tiniest flaw in your design, you can be certain that it will be found and exploited. As with standard cybersecurity, "good under normal circumstances" is just not good enough – your system needs to be unbreakably robust.

"AI alignment: treat it like a cryptographic rocket probe. This is about how difficult you would expect it to be to build something smarter than you that was nice – given that basic agent theory says they’re not automatically nice – and not die. You would expect that intuitively to be hard." Eliezer Yudkowsky

Another immense challenge is the fact that we currently have no idea how to reliably instill AIs with human-friendly goals. Even if a consensus could be reached on a system of human values and morality, it’s entirely unclear how this could be fully and faithfully captured in code.

For a more in-depth view of this argument, see Yudkowsky's talk "AI Alignment: Why It’s Hard, and Where to Start" below (full transcript here). For alternative views, see Paul Christiano's “AI alignment landscape” talk, Daniel Kokotajlo and Wei Dai’s “The Main Sources of AI Risk?” list, and Rohin Shah’s much more optimistic position.

Wouldn't a superintelligence be smart enough to know right from wrong?

Show your endorsement of this answer by giving it a stamp of approval!

The issue isn't that a superintelligence wouldn’t be able to understand what humans value, but rather that it would understand human values but nonetheless would value something else itself. There’s a difference between knowing how humans want the world to be, and wanting that yourself.

This is a separate matter from the complexity of defining what "the" moral way to behave is (or even what "a" moral way to behave is). Even if that were possible, an AI could potentially figure out what it was but still not be configured in such a way as to follow it. This is related to the so-called "orthogonality thesis":

Superintelligence sounds like science fiction. Do people think about this in the real world?

Show your endorsement of this answer by giving it a stamp of approval!

Many of the people with the deepest understanding of artificial intelligence are concerned about the risks of unaligned superintelligence. In 2014, Google bought world-leading artificial intelligence startup DeepMind for $400 million; DeepMind added the condition that Google promise to set up an AI Ethics Board. DeepMind cofounder Shane Legg has said in interviews that he believes superintelligent AI will be “something approaching absolute power” and “the number one risk for this century”.

Stuart Russell, Professor of Computer Science at Berkeley, author of the standard AI textbook, and world-famous AI expert, warns of “species-ending problems” and wants his field to pivot to make superintelligence-related risks a central concern. He went so far as to write Human Compatible, a book focused on bringing attention to the dangers of artificial intelligence and the need for more work to address them.

Many other science and technology leaders agree. Late astrophysicist Stephen Hawking said that superintelligence “could spell the end of the human race.” Tech billionaire Bill Gates describes himself as “in the camp that is concerned about superintelligence…I don’t understand why some people are not concerned”. Oxford Professor Nick Bostrom, who has been studying AI risks for over 20 years, has said: “Superintelligence is a challenge for which we are not ready now and will not be ready for a long time.”

Holden Karnofsky, the CEO of Open Philanthropy, has written a carefully reasoned account of why transformative artificial intelligence means that this might be the most important century.

Where can I learn about AI alignment?

Show your endorsement of this answer by giving it a stamp of approval!

If you like interactive FAQs, you're in the right place already! Joking aside, some great entry points are the AI alignment playlist on YouTube, “The Road to Superintelligence” and “Our Immortality or Extinction” posts on WaitBuyWhy for a fun, accessible introduction, and Vox'sThe case for taking AI seriously as a threat to humanity” as a high-quality mainstream explainer piece.

The free online Cambridge course on AGI Safety Fundamentals provides a strong grounding in much of the field and a cohort + mentor to learn with. There's even an anki deck for people who like spaced repetition!

There are many resources in this post on Levelling Up in AI Safety Research Engineering with a list of other guides at the bottom. There is also a twitter thread here with some programs for upskilling and some for safety-specific learning.

The Alignment Newsletter (podcast), Alignment Forum, and AGI Control Problem Subreddit are great for keeping up with latest developments.

How close do AI experts think we are to creating superintelligence?

Show your endorsement of this answer by giving it a stamp of approval!

Nobody knows for sure when we will have AGI, or if we’ll ever get there. Open Philanthropy CEO Holden Karnofsky has analyzed a selection of recent expert surveys on the matter, as well as taking into account findings of computational neuroscience, economic history, probabilistic methods and failures of previous AI timeline estimates. This all led him to estimate that "there is more than a 10% chance we'll see transformative AI within 15 years (by 2036); a ~50% chance we'll see it within 40 years (by 2060); and a ~2/3 chance we'll see it this century (by 2100)." Karnofsky bemoans the lack of robust expert consensus on the matter and invites rebuttals to his claims in order to further the conversation. He compares AI forecasting to election forecasting (as opposed to academic political science) or market forecasting (as opposed to theoretical academics), thereby arguing that AI researchers may not be the "experts” we should trust in predicting AI timelines.

Opinions proliferate, but given experts’ (and non-experts’) poor track record at predicting progress in AI, many researchers tend to be fairly agnostic about when superintelligent AI will be invented.

UC-Berkeley AI professor Stuart Russell has given his best guess as “sometime in our children’s lifetimes”, while Ray Kurzweil (Google’s Director of Engineering) predicts human level AI by 2029 and an intelligence explosion by 2045. Eliezer Yudkowsky expects the end of the world, and Elon Musk expects AGI, before 2030.

If there’s anything like a consensus answer at this stage, it would be something like: “highly uncertain, maybe not for over a hundred years, maybe in less than fifteen, with around the middle of the century looking fairly plausible”.

What are some objections to the importance of AI alignment?

Show your endorsement of this answer by giving it a stamp of approval!

Søren Elverlin has compiled a list of counter-arguments and suggests dividing them into two kinds: weak and strong.

Weak counter-arguments point to problems with the "standard" arguments (as given in, e.g., Bostrom’s Superintelligence), especially shaky models and assumptions that are too strong. These arguments are often of a substantial quality and are often presented by people who themselves worry about AI safety. Elverin calls these objections “weak” because they do not attempt to imply that the probability of a bad outcome is close to zero: “For example, even if you accept Paul Christiano's arguments against “fast takeoff”, they only drive the probability of this down to about 20%. Weak counter-arguments are interesting, but the decision to personally focus on AI safety doesn't strongly depend on the probability – anything above 5% is clearly a big enough deal that it doesn't make sense to work on other things.”

Strong arguments argue that the probability of existential catastrophe due to misaligned AI is tiny, usually by some combination of claiming that AGI is impossible or very far away. For example, Michael Littman has suggested that as (he believes) we’re so far from AGI, there will be a long period of human history wherein we’ll have ample time to grow up alongside powerful AIs and figure out how to align them.

Elverlin opines that “There are few arguments that are both high-quality and strong enough to qualify as an ‘objection to the importance of alignment’.” He suggests Rohin Shah's arguments for “alignment by default” as one of the better candidates.

MIRI's April fools "Death With Dignity" strategy might be seen as an argument against the importance of working on alignment, but only in the sense that we might have almost no hope of solving it. In the same category are the “something else will kill us first, so there’s no point worrying about AI alignment” arguments.

What are some AI alignment research agendas currently being pursued?

Show your endorsement of this answer by giving it a stamp of approval!

Research at the Alignment Research Center is led by Paul Christiano, best known for introducing the “Iterated Distillation and Amplification” and “Humans Consulting HCH” approaches. He and his team are now “trying to figure out how to train ML systems to answer questions by straightforwardly ‘translating’ their beliefs into natural language rather than by reasoning about what a human wants to hear.”

Chris Olah (after work at DeepMind and OpenAI) recently launched Anthropic, an AI lab focussed on the safety of large models. While his previous work was concerned with “transparency” and “interpretability” of large neural networks, especially vision models, Anthropic is focussing more on large language models, among other things working towards a "general-purpose, text-based assistant that is aligned with human values, meaning that it is helpful, honest, and harmless".

Stuart Russell and his team at the Center for Human-Compatible Artificial Intelligence (CHAI) have been working on inverse reinforcement learning (where the AI infers human values from observing human behavior) and corrigibility, as well as attempts to disaggregate neural networks into “meaningful” subcomponents (see Filan, et al.’s “Clusterability in neural networks” and Hod et al.'s “Detecting modularity in deep neural networks”).

Alongside the more abstract “agent foundations” work they have become known for, MIRI recently announced their “Visible Thoughts Project” to test the hypothesis that “Language models can be made more understandable (and perhaps also more capable, though this is not the goal) by training them to produce visible thoughts.”

OpenAI have recently been doing work on iteratively summarizing books (summarizing, and then summarizing the summary, etc.) as a method for scaling human oversight.

Stuart Armstrong’s recently launched AlignedAI are mainly working on concept extrapolation from familiar to novel contexts, something he believes is “necessary and almost sufficient” for AI alignment.

Redwood Research (Buck Shlegeris, et al.) are trying to “handicap' GPT-3 to only produce non-violent completions of text prompts. “The idea is that there are many reasons we might ultimately want to apply some oversight function to an AI model, like ‘don't be deceitful’, and if we want to get AI teams to apply this we need to be able to incorporate these oversight predicates into the original model in an efficient manner.”

Ought is an independent AI safety research organization led by Andreas Stuhlmüller and Jungwon Byun. They are researching methods for breaking up complex, hard-to-verify tasks into simpler, easier-to-verify tasks, with the aim of allowing us to maintain effective oversight over AIs.

Where can I find people to talk to about AI alignment?

Show your endorsement of this answer by giving it a stamp of approval!

You can join:

Or book free calls with AI Safety Support.

OK, I’m convinced. How can I help?

Show your endorsement of this answer by giving it a stamp of approval!

Great! I’ll ask you a few follow-up questions to help figure out how you can best contribute, give you some advice, and link you to resources which should help you on whichever path you choose. Feel free to scroll up and explore multiple branches of the FAQ if you want answers to more than one of the questions offered :)

Note: We’re still building out and improving this tree of questions and answers, any feedback is appreciated.

At what level of involvement were you thinking of helping?

Please view and suggest to this google doc for improvements: https://docs.google.com/document/d/1S-CUcoX63uiFdW-GIFC8wJyVwo4VIl60IJHodcRfXJA/edit#

See also: Full list of canonical questions.

More basic objections: e.g. Something -> instrumental convergence.