All Articles

430 articles

Basic sections

Intro to AI safety

I: AI progress is leading to superintelligence

1: AI is advancing fast

2: AI may attain human level soon

3: Human-level is not the limit

4: The road from human-level to superintelligent AI may be short

II: AI may end up opposed to us

5: AI may pursue goals

6: AI’s goals may not match ours

7: Different goals may bring AI into conflict with us

III: Consequences could be major, including human extinction

8: AI can win a conflict against us

9: Defeat may be irreversibly catastrophic

10: Advanced AI is a big deal even if we don’t lose control

11: If we get things right, AI could have huge benefits

IV: We need to get our act together

12: Experts are highly concerned

13: We are not reliably on track to solving these problems

14: You can learn more and perhaps help

Understanding AI systems

Fundamentals

What is artificial intelligence (AI)?

What are large language models?

Notable AI systems

What are OpenAI Codex and GitHub Copilot?

What is GPT-3?

What is ChatGPT?

What is GPT-4?

What is Deep Research?

Capabilities and limitations

What are AI "capabilities”?

What is AI slop?

What does it mean when an AI is "hallucinating"?

What is a world model?

What are "reasoning" AI models?

Learning mechanisms

What is reinforcement learning (RL)?

What is gradient descent?

Prompting techniques

What is "jailbreaking" a large language model (LLM)?

How does "chain-of-thought" prompting work?

What is least-to-most prompting?

What is zero-shot prompting?

Alignment techniques

What alignment techniques are used on LLMs?

What is reinforcement learning from human feedback (RLHF)?

What is "Constitutional AI"?

What is chain-of-thought monitoring?

Future AI

General intelligence

What is artificial general intelligence (AGI)?

What is "transformative AI"?

What is "narrow AI"?

What is intelligence?

Superintelligence

What is "superintelligence"?

What are the differences between AGI, transformative AI, and superintelligence?

Time scales

What is "AI takeoff"?

What are the different possible AI takeoff speeds?

When do experts think human-level AI will be created?

Agency and autonomy

What is an agent?

What is an optimizer?

What is tool AI?

The alignment problem

Foundations of alignment

What is AI alignment?

What is AI safety?

What are the differences between AI safety, AI alignment, AI control, Friendly AI, AI ethics, AI existential safety, and AGI safety?

What is corrigibility?

Core challenges

What is the orthogonality thesis?

What is Goodhart's law?

What is instrumental convergence?

What is reward hacking?

What is the "sharp left turn"?

Common misconceptions

Aren't there easy solutions to AI alignment?

Is the worry that AI will become malevolent or conscious?

Difficulty of alignment

At a high level, what is the challenge of AI alignment?

Why is AI alignment a hard problem?

How do fictional stories illustrate AI misalignment?

Inner and outer alignment

What is the difference between inner and outer alignment?

What is inner alignment?

What is outer alignment?

Deception

What is deceptive alignment?

Do current AI models show deception?

Implications of superintelligence

Is this serious?

Do people seriously worry about existential risk from AI?

Why would misaligned AI pose a threat that we can’t deal with?

What is an "AI doomer"?

What is "p(doom)"?

Extreme outcomes

What are existential risks (x-risks)?

What are astronomical suffering risks (s-risks)?

Misuse

What are accident and misuse risks?

Might someone use AI to destroy human civilization?

AI takeover scenarios

What is an "AI takeover"?

Could a superintelligent AI use the internet to take over the physical world?

What is a "decisive strategic advantage"?

Pathways to risk

What are the main sources of AI existential risk?

Why would a misaligned superintelligence kill us?

Wouldn’t AI takeover leave survivors?

Positive futures

What are the potential benefits of advanced AI?

What would a good future with AGI look like?

Objections and responses

Is smarter-than-human AI unrealistic?

Surely an AI can’t be smarter than all humanity?

Wouldn't a superintelligence be slowed down by the need to do physical experiments?

Is AI alignment easy?

Won't we merge with the machines?

Isn’t AI just a tool like any other? Won’t it just do what we tell it to?

Wouldn't a superintelligence be smart enough to know right from wrong?

Wouldn't a superintelligence be smart enough to avoid misunderstanding our instructions?

Why not just set AI goals?

Aren't there easy solutions to AI alignment?

Can we list the ways a task could go disastrously wrong and tell an AI to avoid them?

Could we tell the AI to do what's morally right?

Can you give an AI a goal which involves “minimally impacting the world”?

Can we constrain a goal-directed AI using specified rules?

Why can’t we just use Asimov’s Three Laws of Robotics?

Why can't we just make a "child AI" and raise it?

What is "Do what I mean"?

Why not just control AI?

Is it possible to limit an AI's interactions with the Internet?

Why can't we just turn the AI off if it starts to misbehave?

Can you stop an advanced AI from upgrading itself?

Can we test an AI to make sure it won't misbehave if it becomes superintelligent?

Why don't we just not build AGI if it's so dangerous?

Why can’t we just “put the AI in a box” so that it can’t influence the outside world?

Dealing with misaligned AGI after deployment

How can AI cause harm if it can't manipulate the physical world?

Wouldn't humans triumph over a rogue AI because there are more of us?

Are corporations superintelligent?

Can't we limit damage from AI systems in the same ways we limit damage from companies?

Will AI be able to think faster than humans?

Why would misaligned AI pose a threat that we can’t deal with?

Other issues from AI

Isn’t the real concern with AI something else?

What about people misusing AI?

What about technological unemployment from AI?

What about autonomous weapons?

What about AI-enabled surveillance?

What about automated AI persuasion and propaganda?

What about AI that is biased?

What about deepfakes?

What about the environmental impacts of AI?

What is AI psychosis?

What about AI companions?

Morality

Might an aligned superintelligence force people to change?

Isn’t it immoral to control and impose our values on AI?

If I only care about helping people alive today, does AI safety still matter?

Does the importance of AI risk depend on caring about the long-term future?

Wouldn't it be a good thing for humanity to die out?

Objections to AI safety research

Why should we prepare for human-level AI now rather than when it’s closer?

Could AI alignment research be bad? How?

What are some arguments why AI safety might be less important?

Miscellaneous arguments

Isn't capitalism the real unaligned superintelligence?

Aren't AI existential risk concerns just an example of Pascal's mugging?

Wouldn't AIs need to have a power-seeking drive to pose a serious risk?

Is the AI safety movement about stopping all technology?

Other resources

About Us

How can I contribute to AISafety.info as a volunteer?

What is Stampy's AI Safety Info?

What is this site about?

How does the Stampy chatbot work?

Resources elsewhere

What are some other introductions to AI safety?

What are some good podcasts about AI safety?

What are some good books about AI safety?

Where can I find videos about AI safety?

Research resources

What are some exercises and projects I can try?

Advanced sections

Beyond the basics

Interpreting language models

How can LLMs be understood as “simulators”?

What is cyborgism?

What is a shoggoth?

Mesa-optimizers and subagents

What are "mesa-optimizers"?

What is a subagent?

What are tiling agents?

What are the differences between subagents and mesa-optimizers?

Decision theory

What should I read to learn about decision theory?

What are the different versions of decision theory?

What is "causal decision theory"?

What is "evidential decision theory"?

What is "functional decision theory"?

What is a "value handshake"?

What is Newcomb’s paradox?

What is timeless decision theory?

What is updateless decision theory?

Mathematics of agents

What is AIXI?

What are the power-seeking theorems?

What is the Von Neumann-Morgenstern (VNM) utility theorem?

What is Savage's subjective expected utility model?

What is a representation theorem?

Strategy and outcomes

What are "pivotal acts"?

What is the "long reflection"?

What are astronomical suffering risks (s-risks)?

What is an alignment tax?

What is a singleton?

What is a multipolar scenario?

What are infohazards?

Brain emulation

What is "whole brain emulation"?

Will whole brain emulation arrive before other forms of AGI?

What safety problems are associated with whole brain emulation?

What are the ethical challenges related to whole brain emulation?

Human intelligence enhancement

What are brain-computer interfaces?

What is "biological cognitive enhancement"?

Computer science

What is the Church-Turing thesis?

What are the "no free lunch" theorems?

What is mutual information?

What is meta-RL?

What are inductive biases?

Values

What are "human values"?

Which moral theories would be easiest to encode into an AI?

What is "coherent extrapolated volition (CEV)"?

What is perverse instantiation?

AI consciousness

Could AI have emotions?

Making LLMs useful

What is Retrieval-Augmented Generation (RAG)?

What is scaffolding?

Predictions about future AI

Timelines

When do experts think human-level AI will be created?

What evidence do experts usually base their timeline predictions on?

Are AI self-improvement projections extrapolating an exponential trend too far?

What are AI timelines?

What is an AI's "time horizon length"?

Compute and scaling

Can we get AGI by scaling up architectures similar to current ones, or are we missing key insights?

What is compute?

What are scaling laws?

What is the "Bitter Lesson"?

How much computing power did evolution use to create the human brain?

Nature of AI

Why might people build AGI rather than better narrow AIs?

How can progress in non-agentic LLMs lead to capable AI agents?

How might things go wrong even without an agentic AI?

Can we think of AIs as human-like?

What is Moravec’s paradox?

What are some tasks where AI exceeded human expectations?

Takeoff

What is "AI takeoff"?

What are the different possible AI takeoff speeds?

What are the differences between a singularity, an intelligence explosion, and a hard takeoff?

How quickly could an AI go from harmless to existentially dangerous?

How might we get from artificial general intelligence to a superintelligent system?

How long will it take to go from human-level AI to superintelligence?

Will there be a discontinuity in AI capabilities?

Why does AI takeoff speed matter?

Takeover

What is a “treacherous turn”?

Could a superintelligent AI use the internet to take over the physical world?

How might AI socially manipulate humans?

How might AGI kill people?

How likely is it that an AI would pretend to be a human to further its goals?

Why would we only get one chance to align a superintelligence?

What is an intelligence explosion?

Relative capabilities

How powerful could a superintelligence become?

Why would intelligence lead to power?

What could a superintelligent AI do, and what would be physically impossible even for it?

What is Vinge’s principle?

What is Vingean uncertainty?

Would AI technology favor offense or defense in a conflict?

Good outcomes

What are the potential benefits of advanced AI?

If we solve alignment, are we sure of a good future?

What would a good future with AGI look like?

Catastrophic outcomes

What is a "warning shot"?

What is an “AGI fire alarm”?

How likely is extinction from superintelligent AI?

If we go extinct due to misaligned AI, at least nature will continue, right?

Are there any detailed example stories of what unaligned AGI would look like?

Alignment research

Current techniques

What alignment techniques are used on LLMs?

What is reinforcement learning from human feedback (RLHF)?

What is "Constitutional AI"?

What is "jailbreaking" a large language model (LLM)?

What is adversarial training?

How does Redwood Research do adversarial training?

How does DeepMind do adversarial training?

What is behavioral cloning?

What is imitation learning?

Benchmarks and evals

How is red teaming used in AI alignment?

What is the Abstraction and Reasoning Corpus (ARC)?

Prosaic alignment

What is prosaic alignment?

What is "externalized reasoning oversight"?

What is Iterated Distillation and Amplification (IDA)?

What is AI Safety via Debate?

What is "HCH"?

What is Eliciting Latent Knowledge (ELK)?

What is discovering latent knowledge (DLK)?

How is the Alignment Research Center (ARC) trying to solve Eliciting Latent Knowledge (ELK)?

What is superalignment?

What is scalable oversight?

Interpretability

What is interpretability and what approaches are there?

What is the difference between verifiability, interpretability, transparency, and explainability?

How might interpretability be helpful?

What is neural network modularity?

What is feature visualization?

What are polysemantic neurons?

What is a "polytope" in a neural network?

What is mechanistic interpretability?

Agent foundations

What is "agent foundations"?

What is a "quantilizer"?

Would it help to cut off the top few percent of a quantilizer's distribution?

What are “type signatures”?

What are "true names" in the context of AI alignment?

What is Infra-Bayesianism?

Why might a maximizing AI cause bad outcomes?

What is the "natural abstraction hypothesis"?

What is "wireheading"?

What are satisficers?

Other alignment approaches

What is "metaphilosophy" and how does it relate to AI safety?

What is shard theory?

What is "Safeguarded AI"?

Organizations and agendas

What is everyone working on in AI alignment?

What is FAR AI's research agenda?

What is Anthropic's alignment research agenda?

What technical problems is MIRI working on?

What is OpenAI's alignment research agenda?

What is DeepMind's safety team working on?

What projects is CAIS working on?

What is the Alignment Research Center (ARC)'s research agenda?

What is the Center for AI Safety (CAIS)'s research agenda?

What is Conjecture's research agenda?

What is Ought's research agenda?

What is the Center for Human Compatible AI (CHAI)?

What is the Center on Long-Term Risk (CLR)'s research agenda?

What is Redwood Research's agenda?

What is Obelisk's research agenda?

What is Encultured's research agenda?

What is the UK's AI Security Institute?

What is the AI control research agenda?

Researchers

What is Sam Bowman researching?

What is David Krueger working on?

What are Scott Garrabrant and Abram Demski working on?

What is John Wentworth's research agenda?

What is Aligned AI / Stuart Armstrong working on?

AI governance

Governance research

What are the key problems in AI governance?

What technical research would be helpful for governance?

What are open-weights AI models?

Compute governance

What is compute governance?

What makes compute usage regulation promising for AI governance?

What are the limitations of compute governance?

Major labs

What are some of the leading AI capabilities organizations?

Are Google, OpenAI, etc. aware of the risk?

International politics

Is the UN concerned about existential risk from AI?

What might an international treaty on the development of AGI look like?

What is the EU AI Act?

What is the UN AI Advisory Body?

What open letters have been written about AI safety?

Policies

What are Responsible Scaling Policies (RSPs)?

Would a slowdown in AI capabilities development decrease existential risk?

Could governmental investments help with AI alignment?

What is the "windfall clause"?

What is the White House executive order on AI?

What is the "Open Global Investment" model?

What criticisms have been made of Bostrom's Open Global Investment model?

Policy resources

What are some helpful AI policy resources?

Governance research organizations

What is everyone working on in AI governance?

What does the Future of Life Institute (FLI) work on?

All Articles

Basic sections

Advanced sections

Other articles