Main Page

From Stampy's Wiki
Welcome to Stampy's Wiki, a volunteer effort to build an interactive FAQ about Artificial Intelligence existential safety—which is the field trying to make sure that when we build superintelligent artificial systems they are aligned with human values so that they do the kinds of things we would like them to do.

We're still in alpha, so you can't yet talk to Stampy to interact with the FAQ, but if you want to help out we could always use more contributors! If you're interested in guiding or contributing to the project you're welcome to drop by the Discord and talk with us about it (let plex#1874 or the wiki team know if you want to help and need an invite).

Answers which have lots of stamps are displayed on this tab. Browse tags or good questions

I'm interested in working on AI Safety, what should I do?

AI Safety Support offers free calls to advise people interested in a career in AI Safety, so that's a great place to start. We're working on creating a bunch of detailed information for Stampy to use, but in the meantime check out these resources:

This is actually an active area of AI alignment research, called "Impact Measures"! It's not trivial to formalize in a way which won't predictably go wrong (entropy minimization likely leads to an AI which tries really hard to put out all the stars ASAP since they produce so much entropy, for example), but progress is being made. You can read about it on the Alignment Forum tag, or watch Rob's videos Avoiding Negative Side Effects and Avoiding Positive Side Effects

If the AI system was deceptively aligned (i.e. pretending to be nice until it was in control of the situation) or had been in stealth mode while getting things in place for a takeover, quite possibly within hours. We may get more warning with weaker systems, if the AGI does not feel at all threatened by us, or if a complex ecosystem of AI systems is built over time and we gradually lose control.

Paul Christiano writes an story of alignment failure which shows a relatively fast transition.

In previous decades, AI research had proceeded more slowly than some experts predicted. According to experts in the field, however, this trend has reversed in the past 5 years or so. AI researchers have been repeatedly surprised by, for example, the effectiveness of new visual and speech recognition systems. AI systems can solve CAPTCHAs that were specifically devised to foil AIs, translate spoken text on-the-fly, and teach themselves how to play games they have neither seen before nor been programmed to play. Moreover, the real-world value of this effectiveness has prompted massive investment by large tech firms such as Google, Facebook, and IBM, creating a positive feedback cycle that could dramatically speed progress.

The organizations which most regularly give grants to individuals working towards AI alignment are the Long Term Future Fund, Survival And Flourishing (SAF), the OpenPhil AI Fellowship, and the Center on Long-Term Risk Fund though there are opportunities from smaller grantmakers which you might be able to pick up if you're able to prove you do good work. If you're able to relocate to the UK, CEEALAR (aka the EA Hotel) can be a great option as it offers free food and accommodation for up to two years, as well as contact with others who are thinking about these issues.

Each grant source has their own criteria for funding, but in general they are looking for candidates who have evidence that they're keen and able to do good work towards reducing existential risk, though the EA Hotel in particular has less stringent requirements as they're able to support people at very low cost. If you'd like to talk to someone who can offer advice on applying for funding, AI Safety Support offers free calls.

Another option is to get hired by an organization which works on AI alignment, see the follow-up question for advice on that.

See more...

Here are the questions which have been added to the wiki but not yet sorted through. To process a question:

  1. Check whether the question is in Stampy's scope (cheatsheet: is it likely to be helpful to
    somei.e. more than just one person with an extremely specific question, such as five paragraphs describing a suggested method for aligning AGI.
    people learning about about AI existential safety). If it's not in scope, mark as 'Out of scope' and move on to the next question.
  2. Check whether there are any existing canonical questions which are duplicates of it by looking at likely tags or running ctrl-f with some plausible words over the list of all canonical questions. If you find a
    duplicateAn existing question is a duplicate of a new one if it is reasonable to expect whoever asked the new question to be satisfied if they received an answer to the existing question instead.
    , click 'Mark as duplicate' and add the name of duplicate to that page, else continue.
  3. Add some relevant tags to the page using the add/edit tags link.
  4. Consider whether the question could be rephrased to be clearer, then if you think of a better name change it by clicking edit then opening the advanced options and writing a new title in the top field.
  5. Mark the question as the appropriate review level (tip: hover over the buttons for descriptions of what each level means).
  6. (optional) Add the question as a related question to some existing questions (either from tag pages, the full list, or memory) and vice versa, to make it more discoverable. Click edit on the question to find the related questions field.
  7. Finally, mark the question as canonical, and bask in the joy of a slightly more organized wiki and a tiny expected increase in the chances of humanity's future going well (hopefully).

The question will now show up on the list of questions we want answered, and the main page.

For sorting incoming YouTube questions, see Prioritize YouTube questions.

There are 49 incoming questions to sort!

See more...

If you think an answer is good (i.e. accurate, helpful, and replies well to the question) then please give it a stamp of approval. If you're a more experienced editor and have a good feel for when there is consensus around an answer being accepted, feel free to mark answers as canonical, so that Stampy will start serving them to users.

The below list puts most recent answers at the top, with alternate sortings available from the link directly below.

Alternate orderings

Many AI designs that would generate an intelligence explosion would not have a ‘slot’ in which a goal (such as ‘be friendly to human interests’) could be placed. For example, if AI is made via whole brain emulation, or evolutionary algorithms, or neural nets, or reinforcement learning, the AI will end up with some goal as it self-improves, but that stable eventual goal may be very difficult to predict in advance.

Thus, in order to design a friendly AI, it is not sufficient to determine what ‘friendliness’ is (and to specify it clearly enough that even a superintelligence will interpret it the way we want it to). We must also figure out how to build a general intelligence that satisfies a goal at all, and that stably retains that goal as it edits its own code to make itself smarter. This task is perhaps the primary difficulty in designing friendly AI.

Stamps: None

Tags: friendly ai (create tag) (edit tags)

Eliezer Yudkowsky has proposed Coherent Extrapolated Volition as a solution to at least two problems facing Friendly AI design:

  1. The fragility of human values: Yudkowsky writes that “any future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals will contain almost nothing of worth.” The problem is that what humans value is complex and subtle, and difficult to specify. Consider the seemingly minor value of novelty. If a human-like value of novelty is not programmed into a superintelligent machine, it might explore the universe for valuable things up to a certain point, and then maximize the most valuable thing it finds (the exploration-exploitation tradeoff[58]) — tiling the solar system with brains in vats wired into happiness machines, for example. When a superintelligence is in charge, you have to get its motivational system exactly right in order to not make the future undesirable.
  2. The locality of human values: Imagine if the Friendly AI problem had faced the ancient Greeks, and they had programmed it with the most progressive moral values of their time. That would have led the world to a rather horrifying fate. But why should we think that humans have, in the 21st century, arrived at the apex of human morality? We can’t risk programming a superintelligent machine with the moral values we happen to hold today. But then, which moral values do we give it?

Yudkowsky suggests that we build a ‘seed AI’ to discover and then extrapolate the ‘coherent extrapolated volition’ of humanity:

> In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.

The seed AI would use the results of this examination and extrapolation of human values to program the motivational system of the superintelligence that would determine the fate of the galaxy.

However, some worry that the collective will of humanity won’t converge on a coherent set of goals. Others believe that guaranteed Friendliness is not possible, even by such elaborate and careful means.

Some have proposed[49][50][51][52] that we teach machines a moral code with case-based machine learning. The basic idea is this: Human judges would rate thousands of actions, character traits, desires, laws, or institutions as having varying degrees of moral acceptability. The machine would then find the connections between these cases and learn the principles behind morality, such that it could apply those principles to determine the morality of new cases not encountered during its training. This kind of machine learning has already been used to design machines that can, for example, detect underwater mines[53] after feeding the machine hundreds of cases of mines and not-mines.

There are several reasons machine learning does not present an easy solution for Friendly AI. The first is that, of course, humans themselves hold deep disagreements about what is moral and immoral. But even if humans could be made to agree on all the training cases, at least two problems remain.

The first problem is that training on cases from our present reality may not result in a machine that will make correct ethical decisions in a world radically reshaped by superintelligence.

The second problem is that a superintelligence may generalize the wrong principles due to coincidental patterns in the training data.[54] Consider the parable of the machine trained to recognize camouflaged tanks in a forest. Researchers take 100 photos of camouflaged tanks and 100 photos of trees. They then train the machine on 50 photos of each, so that it learns to distinguish camouflaged tanks from trees. As a test, they show the machine the remaining 50 photos of each, and it classifies each one correctly. Success! However, later tests show that the machine classifies additional photos of camouflaged tanks and trees poorly. The problem turns out to be that the researchers’ photos of camouflaged tanks had been taken on cloudy days, while their photos of trees had been taken on sunny days. The machine had learned to distinguish cloudy days from sunny days, not camouflaged tanks from trees.

Thus, it seems that trustworthy Friendly AI design must involve detailed models of the underlying processes generating human moral judgments, not only surface similarities of cases.

See also:

Stamps: None

Tags: machine learning, value learning (create tag) (edit tags)

The organizations which most regularly give grants to individuals working towards AI alignment are the Long Term Future Fund, Survival And Flourishing (SAF), the OpenPhil AI Fellowship, and the Center on Long-Term Risk Fund though there are opportunities from smaller grantmakers which you might be able to pick up if you're able to prove you do good work. If you're able to relocate to the UK, CEEALAR (aka the EA Hotel) can be a great option as it offers free food and accommodation for up to two years, as well as contact with others who are thinking about these issues.

Each grant source has their own criteria for funding, but in general they are looking for candidates who have evidence that they're keen and able to do good work towards reducing existential risk, though the EA Hotel in particular has less stringent requirements as they're able to support people at very low cost. If you'd like to talk to someone who can offer advice on applying for funding, AI Safety Support offers free calls.

Another option is to get hired by an organization which works on AI alignment, see the follow-up question for advice on that.

Let’s consider the likely consequences of some utilitarian designs for Friendly AI.

An AI designed to minimize human suffering might simply kill all humans: no humans, no human suffering.[44][45]

Or, consider an AI designed to maximize human pleasure. Rather than build an ambitious utopia that caters to the complex and demanding wants of humanity for billions of years, it could achieve its goal more efficiently by wiring humans into Nozick’s experience machines. Or, it could rewire the ‘liking’ component of the brain’s reward system so that whichever hedonic hotspot paints sensations with a ‘pleasure gloss’[46][47] is wired to maximize pleasure when humans sit in jars. That would be an easier world for the AI to build than one that caters to the complex and nuanced set of world states currently painted with the pleasure gloss by most human brains.

Likewise, an AI motivated to maximize objective desire satisfaction or reported subjective well-being could rewire human neurology so that both ends are realized whenever humans sit in jars. Or it could kill all humans (and animals) and replace them with beings made from scratch to attain objective desire satisfaction or subjective well-being when sitting in jars. Either option might be easier for the AI to achieve than maintaining a utopian society catering to the complexity of human (and animal) desires. Similar problems afflict other utilitarian AI designs.

It’s not just a problem of specifying goals, either. It is hard to predict how goals will change in a self-modifying agent. No current mathematical decision theory can process the decisions of a self-modifying agent.

So, while it may be possible to design a superintelligence that would do what we want, it’s harder than one might initially think.

Stamps: None


See more...