From Stampy's Wiki
Questions Asked: 57
Answers written: 78

I'm the resident MediaWiki guy and general organizer of all things stampy. Talk to me if you have questions or need access to something.

Also, I'm practicing to be a life coach specializing in people with an interest in longtermism. I've been working with sudonym and chriscanal for a few months now, and I'm ready to take on a few more people for free calls to discuss your goals and try to disentangle things holding you back from them. While I have spare capacity I can do weekly calls, but may have to move to less regular if other things in my life pick up or a lot of people sign up.

Questions by plex which have been answered

How can I collect questions for Stampy?

As well as simply adding your own questions over at ask question, you could also message your friends with something like:


I'm working on a project to create a comprehensive FAQ about AI alignment (you can read about it here if interested). We're looking for questions and I thought you may have some good ones. If you'd be willing to write up a google doc with you top 5-10ish questions we'd be happy to write a personalized FAQ for you. explains the kinds of questions we're looking for.


and maybe bring the google doc to a Stampy editing session so we can collaborate on answering them or improving your answers to them.

How can I contact the Stampy team?

Talk to us on Discord! Try #suggestions or #general, depending on what you want to talk about.

How can I contribute to Stampy?

If you're not already there, join the public Discord and ask for an invite to the semi-private one where contributors generally hang out.

The main ways you can help are to answer questions or add questions, or help to review questions, review answers, or improve answers (instructions for helping out with each of these tasks are on the linked pages). You could also join the dev team if you have programming skills.

One great thing you can do is write up a Google Doc with your top ~10 questions and post it to the Discord, or ask you friends to do the same (see follow-up question on collecting questions for a template message).

The organizations which most regularly give grants to individuals working towards AI alignment are the Long Term Future Fund, Survival And Flourishing (SAF), the OpenPhil AI Fellowship and early career funding, the Future of Life Institute, the Future of Humanity Institute, and the Center on Long-Term Risk Fund. If you're able to relocate to the UK, CEEALAR (aka the EA Hotel) can be a great option as it offers free food and accommodation for up to two years, as well as contact with others who are thinking about these issues. There are also opportunities from smaller grantmakers which you might be able to pick up if you get involved.

If you want to work on support or infrastructure rather than directly on research, the EA Infrastructure Fund may be able to help. In general, you can talk to EA funds before applying.

Each grant source has their own criteria for funding, but in general they are looking for candidates who have evidence that they're keen and able to do good work towards reducing existential risk (for example, by completing an AI Safety Camp project), though the EA Hotel in particular has less stringent requirements as they're able to support people at very low cost. If you'd like to talk to someone who can offer advice on applying for funding, AI Safety Support offers free calls.

Another option is to get hired by an organization which works on AI alignment, see the follow-up question for advice on that.

Each major organization has a different approach. The research agendas are detailed and complex (see also AI Watch). Getting more brains working on any of them (and more money to fund them) may pay off in a big way, but it’s very hard to be confident which (if any) of them will actually work.

The following is a massive oversimplification, each organization actually pursues many different avenues of research, read the 2020 AI Alignment Literature Review and Charity Comparison for much more detail. That being said:

  • The Machine Intelligence Research Institute focuses on foundational mathematical research to understand reliable reasoning, which they think is necessary to provide anything like an assurance that a seed AI built will do good things if activated.
  • The Center for Human-Compatible AI focuses on Cooperative Inverse Reinforcement Learning and Assistance Games, a new paradigm for AI where they try to optimize for doing the kinds of things humans want rather than for a pre-specified utility function
  • Paul Christano's Alignment Research Center focuses is on prosaic alignment, particularly on creating tools that empower humans to understand and guide systems much smarter than ourselves. His methodology is explained on his blog.
  • The Future of Humanity Institute does work on crucial considerations and other x-risks, as well as AI safety research and outreach.
  • Anthropic is a new organization exploring natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability.
  • OpenAI is in a state of flux after major changes to their safety team.
  • DeepMind’s safety team is working on various approaches designed to work with modern machine learning, and does some communication via the Alignment Newsletter.
  • EleutherAI is a Machine Learning collective aiming to build large open source language models to allow more alignment research to take place.
  • Ought is a research lab that develops mechanisms for delegating open-ended thinking to advanced machine learning systems.

There are many other projects around AI Safety, such as the Windfall clause, Rob Miles’s YouTube channel, AI Safety Support, etc.

What are alternate phrasings for?

Alternate phrasings are used to improve the semantic search which Stampy uses to serve people questions, by giving alternate ways to say a question which might trigger a match when the main wording won't. They should generally only be used when there is a significantly different wording, rather than for only very minor changes.

There's the "we never figure out how to reliably instill AIs with human friendly goals" filter, which seems pretty challenging, especially with inner alignment, solving morality in a way which is possible to code up, interpretability, etc.

There's the "race dynamics mean that even though we know how to build the thing safely the first group to cross the recursive self-improvement line ends up not implementing it safely" which is potentially made worse by the twin issues of "maybe robustly aligned AIs are much harder to build" and "maybe robustly aligned AIs are much less compute efficient".

There's the "we solved the previous problems but writing perfectly reliably code in a whole new domain is hard and there is some fatal bug which we don't find until too late" filter. The paper The Pursuit of Exploitable Bugs in Machine Learning explores this.

For a much more in depth analysis, see Paul Christiano's AI Alignment Landscape talk and The Main Sources of AI Risk?.

Try to avoid directly referencing the wording of the question in the answer, in order to make the answer more robust to alternate phrasings. For example, that question might be "Can we do X" and the reply is "Yes, if we can manage Y", but then the question might be "why can't we do X" or "What would happen if we tried to do X" so the answer should be like "We might be able to do X, if we can do Y", which works for all of those.

Superintelligence has an advantage that an early human didn’t – the entire context of human civilization and technology, there for it to manipulate socially or technologically.

Canonical questions are the questions which we've checked are in scope and not duplicates, so we want answers to them. They may be edited to represent a class of question more broadly, rather than keeping all their idosyncracies. Once they're answered canonically Stampy will serve them to readers.

An existing question is a duplicate of a new one if it is reasonable to expect whoever asked the new question to be satisfied if they received an answer to the existing question instead.

Follow-up questions are responses to an answer which reader might have, either because they want more information or are providing information to Stampy about what they're looking for. We don't expect to have great coverage of the former for a long time because there will be so many, but hopefully we'll be able to handle some of the most common ones.

What is the Stampy project?

The Stampy Project is an open effort to build a comprehensive FAQ about artificial intelligence existential safety—the field trying to make sure that when we build superintelligent artificial systems they are aligned with human values so that they do the kinds of things we would like them to do.

We're also building Stampy the AI Safety Bot, who will soon be capable of using the FAQ and other sources to educate people about AI alignment via an interactive natural language interface.
The Stampy project is an open effort to build a comprehensive FAQ about artificial intelligence existential safety—the field trying to make sure that when we build superintelligent artificial systems they are aligned with human values so that they do the kinds of things we would like them to do.

We're also building Stampy the AI Safety Bot, who will soon be capable of using the FAQ and other sources to educate people about AI alignment via an interactive natural language interface.

The goals of the project are to:

  • Offer a one-stop-shop for high-quality answers to common questions about AI alignment.
    • Let people answer questions in a way which scales, freeing up researcher time while allowing more people to learn from a reliable source.
    • Make external resources more easy to find by having links to them connected to a search engine which gets smarter the more it's used.
  • Provide a form of legitimate peripheral participation for the AI Safety community, as an on-boarding path with a flexible level of commitment.
    • Encourage people to think, read, and talk about AI alignment while answering questions, creating a community of co-learners who can give each other feedback and social reinforcement.
    • Provide a way for budding researchers to prove their understanding of the topic and ability to produce good work.
  • Collect data about the kinds of questions people actually ask and how they respond, so we can better focus resources on answering them.
Stampy is focused specifically on AI existential safety (both introductory and technical questions), but does not aim to cover general AI questions or other topics which don't interact strongly with the effects of AI on humanity's long-term future.
Stampy is focused on answering common questions people have which are specifically about AI existential safety. More technical questions are also in our scope, though replying to all possible proposals is not feasible and this is not a great place to submit detailed ideas for evaluation.

We are interested in:

  • Introductory questions closely related to the field e.g.
    • "How long will it be until transformative AI arrives?"
    • "Why might advanced AI harm humans?"
  • Technical questions related to the field e.g.
    • "What is Cooperative Inverse Reinforcement Learning?"
    • "What is Logical Induction useful for?"
  • Questions about how to contribute to the field e.g.
    • "Should I get a PhD?"
    • "Where can I find relevant job opportunities?"

More good examples can be found at canonical questions.

We do not aim to cover:

  • Aspects of AI Safety or fairness which are not strongly relevant to existential safety e.g.
    • "How should self-driving cars weigh up moral dilemmas"
    • "How can we minimize the risk of privacy problems caused by machine learning algorithms?"
  • Extremely specific and detailed questions the answering of which is unlikely to be of value to more than a single person e.g.
    • "What if we did <multiple paragraphs of dense text>? Would that result in safe AI?"
We will generally not delete out-of-scope content, but it will be reviewed as low priority to answer, not be marked as a canonical question, and not be served to readers by stampy.

Canonical answers may be served to readers by Stampy, so only answers which have a reasonably high stamp score should be marked as canonical. All canonical answers are open to be collaboratively edited and updated, and they should represent a consensus response (written from the Stampy Point Of View) to a question which is within Stampy's scope.

Answers to YouTube questions should not be marked as canonical, and will generally remain as they were when originally written since they have details which are specific to an idiosyncratic question. YouTube answers may be forked into wiki answers, in order to better respond to a particular question, in which case the YouTube question should have its canonical version field set to the new more widely useful question.

As well as pulling human written answers to AI alignment questions from Stampy's Wiki, Stampy can:

  • Search for AI safety papers e.g. "stampy, what's that paper about corrigibility?"
  • Search for videos e.g. "what's that video where Rob talks about mesa optimizers, stampy?"
  • Calculate with Wolfram Alpha e.g. "s, what's the square root of 345?"
  • Search DuckDuckGo and return snippets
  • And (at least in the patron Discord) falls back to polling GPT-3 to answer uncaught questions

Where can I learn about interpretability?

Christoph Molnar's online book and distill are great sources.

It certainly would be very unwise to purposefully create an artificial general intelligence now, before we have found a way to be certain it will act purely in our interests. But "general intelligence" is more of a description of a system's capabilities, and a vague one at that. We don't know what it takes to build such a system. This leads to the worrying possibility that our existing, narrow AI systems require only minor tweaks, or even just more computer power, to achieve general intelligence.

The pace of research in the field suggests that there's a lot of low-hanging fruit left to pick, after all, and the results of this research produce better, more effective AI in a landscape of strong competitive pressure to produce as highly competitive systems as we can. "Just" not building an AGI means ensuring that every organization in the world with lots of computer hardware doesn't build an AGI, either accidentally or mistakenly thinking they have a solution to the alignment problem, forever. It's simply far safer to also work on solving the alignment problem.

This is a really interesting question! Because, yeah it certainly seems to me that doing something like this would at least help, but it's not mentioned in the paper the video is based on. So I asked the author of the paper, and she said "It wouldn't improve the security guarantee in the paper, so it wasn't discussed. Like, there's a plausible case that it's helpful, but nothing like a proof that it is". To explain this I need to talk about something I gloss over in the video, which is that the quantilizer isn't really something you can actually build. The systems we study in AI Safety tend to fall somewhere on a spectrum from "real, practical AI system that is so messy and complex that it's hard to really think about or draw any solid conclusions from" on one end, to "mathematical formalism that we can prove beautiful theorems about but not actually build" on the other, and quantilizers are pretty far towards the 'mathematical' end. It's not practical to run an expected utility calculation on every possible action like that, for one thing. But, proving things about quantilizers gives us insight into how more practical AI systems may behave, or we may be able to build approximations of quantilizers, etc. So it's like, if we built something that was quantilizer-like, using a sensible human utility function and a good choice of safe distribution, this idea would probably help make it safer. BUT you can't prove that mathematically, without making probably a lot of extra assumptions about the utility function and/or the action distribution. So it's a potentially good idea that's nonetheless hard to express within the framework in which the quantilizer exists. TL;DR: This is likely a good idea! But can we prove it?