Review answers

From Stampy's Wiki

If you think an answer is good (i.e. accurate, helpful, and replies well to the question) then please give it a stamp of approval. If you're a more experienced editor and have a good feel for when there is consensus around an answer being accepted, feel free to mark answers as canonical, so that Stampy will start serving them to users.

Review answers

If I know how to use formal proof assistants (e.g. Mathcomp/Ssreflect), could this be useful in helping with well-aligned AGI? Is computer-certified proof something we might want?


80k links to an article on high impact careers in formal verification in the few paragraphs they've written about formal verification.
80k links to an article on high impact careers in formal verification in the few paragraphs they've written about formal verification.

Some other notes

  • https://github.com/deepmind/cartesian-frames I emailed Scott about doing this in coq before this repo was published and he said "I wouldn't personally find such a software useful but sounds like a valuable exercise for the implementer" or something like this.
  • When I mentioned the possibility of rolling some of infrabayesianism in coq to diffractor he wasn't like "omg we really need someone to do that" he was just like "oh that sounds cool" -- I never got around to it, if I would I'd talk to vanessa and diffractor about weakening/particularizing stuff beforehand.
  • if you extrapolate a pattern from those two examples, you start to think that agent foundations is the principle area of interest with proof assistants! and again- does the proof assistant exercise advance the research or provide a nutritious exercise to the programmer?
  • A sketch of a more prosaic scenario in which proof assistants play a role is "someone proposes isInnerAligned : GradientDescent -> Prop and someone else implements a galaxybrained new type theory/tool in which gradient descent is a primitive (whatever that means)", when I mentioned this scenario to Buck he said "yeah if that happened I'd direct all the engineers at redwood to making that tool easier to use", when I mentioned that scenario to Evan about a year ago he said didn't seem to think it was remotely plausible. probably a nonstarter.
Stamps: None


AI alignment encompasses everything related to solving the problem of how to build artificial intelligences such that they share the values of their creators. This appears to be very hard and may be the most important problem humanity will ever face, as whether we succeed might mean the difference between extinction and humanity flourishing until the end of the universe.

Stamps: plex


A value handshake is a form of trade between superintelligences, when two AI's with incompatible utility functions meet, instead of going to war, since they have superhuman prediction abilities and likely know the outcome before any attack even happens, they can decide to split the universe into chunks with volumes according to their respective military strength or chance of victory, and if their utility functions are compatible, they might even decide to merge into an AI with an utility function that is the weighted average of the two previous ones.

This could happen if multiple AI's are active on earth at the same time, and then maybe if at least one of them is aligned with humans, the resulting value handshake could leave humanity in a pretty okay situation.

See The Hour I First Believed By Scott Alexander for some further thoughts and an introduction to related topics.

Stamps: plex

Tags: None (add tags)

You show stamp an answer when you think it is accurate and well presented enough that you'd be happy to see it served to readers by Stampy.

Stamps: None

Tags: stampy (edit tags)

"The real concern" isn't a particularly meaningful concept here. Deep learning has proven to be a very powerful technology, with far reaching implications across a number of aspects of human existence. There are significant benefits to be found if we manage the technology properly, but that management means addressing a broad range of concerns, one of which is the alignment problem.

Stamps: None


See more...

As well as pulling human written answers to AI alignment questions from Stampy's Wiki, Stampy can:

  • Search for AI safety papers e.g. "stampy, what's that paper about corrigibility?"
  • Search for videos e.g. "what's that video where Rob talks about mesa optimizers, stampy?"
  • Calculate with Wolfram Alpha e.g. "s, what's the square root of 345?"
  • Search DuckDuckGo and return snippets
  • And (at least in the patron Discord) falls back to polling GPT-3 to answer uncaught questions
Stamps: plex

Tags: stampy (edit tags)

Try to avoid directly referencing the wording of the question in the answer, in order to make the answer more robust to alternate phrasings. For example, that question might be "Can we do X" and the reply is "Yes, if we can manage Y", but then the question might be "why can't we do X" or "What would happen if we tried to do X" so the answer should be like "We might be able to do X, if we can do Y", which works for all of those.

Stamps: None


Ajeya Cotra has written an excellent article named Why AI alignment could be hard with modern deep learning on this question.

Stamps: plex

Tags: None (add tags)

There are many plausible-sounding ways to align an AI, but so far none have been convincingly shown to be both implementable and reliably safe, despite a great deal of thought.

For implementability the key question is: How do we code this? Converting something to formal mathematics that can be understood by a computer program is much harder than just saying it in natural language, and proposed AI goal architectures are no exception. Complicated computer programs are usually the result of months of testing and debugging. But this one will be more complicated than any ever attempted before, and live tests are impossible: a superintelligence with a buggy goal system will display goal stability and try to prevent its programmers from discovering or changing the error.

Then, even if an idea sounds pretty good to us right now, it's hard to be at all confident it has no fatal flaws or loopholes. After all, many other proposals that originally sounded promising, like “just give commands to the AI” and “just tell the AI to figure out what makes us happy” end up, after more thought, to be dangerous.

Can we be sure that we’ve thought this through enough? Can we be sure that there isn’t some extremely subtle problem with it, so subtle that no human would ever notice it, but which might seem obvious to a superintelligence?

Stamps: plex

Tags: instrumental convergence, why not just, security mindset (create tag), implementation (create tag) (edit tags)

It certainly would be very unwise to purposefully create an artificial general intelligence now, before we have found a way to be certain it will act purely in our interests. But "general intelligence" is more of a description of a system's capabilities, and a vague one at that. We don't know what it takes to build such a system. This leads to the worrying possibility that our existing, narrow AI systems require only minor tweaks, or even just more computer power, to achieve general intelligence.

The pace of research in the field suggests that there's a lot of low-hanging fruit left to pick, after all, and the results of this research produce better, more effective AI in a landscape of strong competitive pressure to produce as highly competitive systems as we can. "Just" not building an AGI means ensuring that every organization in the world with lots of computer hardware doesn't build an AGI, either accidentally or mistakenly thinking they have a solution to the alignment problem, forever. It's simply far safer to also work on solving the alignment problem.

Stamps: plex


See more...

A potential solution is to create an AI that has the same values and morality as a human by creating a child AI and raising it. There’s nothing intrinsically flawed with this procedure. However, this suggestion is deceptive because it sounds simpler than it is.

If you get a chimpanzee baby and raise it in a human family, it does not learn to speak a human language. Human babies can grow into adult humans because the babies have specific properties, e.g. a prebuilt language module that gets activated during childhood.

In order to make a child AI that has the potential to turn into the type of adult AI we would find acceptable, the child AI has to have specific properties. The task of building a child AI with these properties involves building a system that can interpret what humans mean when we try to teach the child to do various tasks. People are currently working on ways to program agents that can cooperatively interact with humans to learn what they want.

Stamps: None


People tend to imagine AIs as being like nerdy humans – brilliant at technology but clueless about social skills. There is no reason to expect this – persuasion and manipulation is a different kind of skill from solving mathematical proofs, but it’s still a skill, and an intellect as far beyond us as we are beyond lions might be smart enough to replicate or exceed the “charming sociopaths” who can naturally win friends and followers despite a lack of normal human emotions. A superintelligence might be able to analyze human psychology deeply enough to understand the hopes and fears of everyone it negotiates with. Single humans using psychopathic social manipulation have done plenty of harm – Hitler leveraged his skill at oratory and his understanding of people’s darkest prejudices to take over a continent. Why should we expect superintelligences to do worse than humans far less skilled than they?

(More outlandishly, a superintelligence might just skip language entirely and figure out a weird pattern of buzzes and hums that causes conscious thought to seize up, and which knocks anyone who hears it into a weird hypnotizable state in which they’ll do anything the superintelligence asks. It sounds kind of silly to me, but then, nuclear weapons probably would have sounded kind of silly to lions sitting around speculating about what humans might be able to accomplish. When you’re dealing with something unbelievably more intelligent than you are, you should probably expect the unexpected.)

Superintelligence has an advantage that an early human didn’t – the entire context of human civilization and technology, there for it to manipulate socially or technologically.

Stamps: None

Tags: superintelligence, civilization (create tag) (edit tags)

This is certainly a risk (affectionately known in AI circles as “pulling a Kurzweill”), but sometimes taking an exponential trend seriously is the right response.

Consider economic doubling times. In 1 AD, the world GDP was about $20 billion; it took a thousand years, until 1000 AD, for that to double to $40 billion. But it only took five hundred more years, until 1500, or so, for the economy to double again. And then it only took another three hundred years or so, until 1800, for the economy to double a third time. Someone in 1800 might calculate the trend line and say this was ridiculous, that it implied the economy would be doubling every ten years or so in the beginning of the 21st century. But in fact, this is how long the economy takes to double these days. To a medieval, used to a thousand-year doubling time (which was based mostly on population growth!), an economy that doubled every ten years might seem inconceivable. To us, it seems normal.

Likewise, in 1965 Gordon Moore noted that semiconductor complexity seemed to double every eighteen months. During his own day, there were about five hundred transistors on a chip; he predicted that would soon double to a thousand, and a few years later to two thousand. Almost as soon as Moore’s Law become well-known, people started saying it was absurd to follow it off a cliff – such a law would imply a million transistors per chip in 1990, a hundred million in 2000, ten billion transistors on every chip by 2015! More transistors on a single chip than existed on all the computers in the world! Transistors the size of molecules! But of course all of these things happened; the ridiculous exponential trend proved more accurate than the naysayers.

None of this is to say that exponential trends are always right, just that they are sometimes right even when it seems they can’t possibly be. We can’t be sure that a computer using its own intelligence to discover new ways to increase its intelligence will enter a positive feedback loop and achieve superintelligence in seemingly impossibly short time scales. It’s just one more possibility, a worry to place alongside all the other worrying reasons to expect a moderate or hard takeoff.

Stamps: None


There are many ways that look like they can eliminate these problems, but most of them turn out to have hidden difficulties.

Stamps: None


See more...

These are non-canonical answers linked to canonical questions which don't have a canonical answer.

AI alignment encompasses everything related to solving the problem of how to build artificial intelligences such that they share the values of their creators. This appears to be very hard and may be the most important problem humanity will ever face, as whether we succeed might mean the difference between extinction and humanity flourishing until the end of the universe.

Stamps: plex


A value handshake is a form of trade between superintelligences, when two AI's with incompatible utility functions meet, instead of going to war, since they have superhuman prediction abilities and likely know the outcome before any attack even happens, they can decide to split the universe into chunks with volumes according to their respective military strength or chance of victory, and if their utility functions are compatible, they might even decide to merge into an AI with an utility function that is the weighted average of the two previous ones.

This could happen if multiple AI's are active on earth at the same time, and then maybe if at least one of them is aligned with humans, the resulting value handshake could leave humanity in a pretty okay situation.

See The Hour I First Believed By Scott Alexander for some further thoughts and an introduction to related topics.

Stamps: plex

Tags: None (add tags)

Yes, a superintelligence should be able to figure out that humans will not like curing cancer by destroying the world. However, in the example above, the superintelligence is programmed to follow human commands, not to do what it thinks humans will “like”. It was given a very specific command – cure cancer as effectively as possible. The command makes no reference to “doing this in a way humans will like”, so it doesn’t.

(by analogy: we humans are smart enough to understand our own “programming”. For example, we know that – pardon the anthromorphizing – evolution gave us the urge to have sex so that we could reproduce. But we still use contraception anyway. Evolution gave us the urge to have sex, not the urge to satisfy evolution’s values directly. We appreciate intellectually that our having sex while using condoms doesn’t carry out evolution’s original plan, but – not having any particular connection to evolution’s values – we don’t care)

We started out by saying that computers only do what you tell them. But any programmer knows that this is precisely the problem: computers do exactly what you tell them, with no common sense or attempts to interpret what the instructions really meant. If you tell a human to cure cancer, they will instinctively understand how this interacts with other desires and laws and moral rules; if you tell an AI to cure cancer, it will literally just want to cure cancer.

Define a closed-ended goal as one with a clear endpoint, and an open-ended goal as one to do something as much as possible. For example “find the first one hundred digits of pi” is a closed-ended goal; “find as many digits of pi as you can within one year” is an open-ended goal. According to many computer scientists, giving a superintelligence an open-ended goal without activating human instincts and counterbalancing considerations will usually lead to disaster.

To take a deliberately extreme example: suppose someone programs a superintelligence to calculate as many digits of pi as it can within one year. And suppose that, with its current computing power, it can calculate one trillion digits during that time. It can either accept one trillion digits, or spend a month trying to figure out how to get control of the TaihuLight supercomputer, which can calculate two hundred times faster. Even if it loses a little bit of time in the effort, and even if there’s a small chance of failure, the payoff – two hundred trillion digits of pi, compared to a mere one trillion – is enough to make the attempt. But on the same basis, it would be even better if the superintelligence could control every computer in the world and set it to the task. And it would be better still if the superintelligence controlled human civilization, so that it could direct humans to build more computers and speed up the process further.

Now we’re back at the situation that started Part III – a superintelligence that wants to take over the world. Taking over the world allows it to calculate more digits of pi than any other option, so without an architecture based around understanding human instincts and counterbalancing considerations, even a goal like “calculate as many digits of pi as you can” would be potentially dangerous.

Stamps: plex

Tags: None (add tags)

The argument goes: computers only do what we command them; no more, no less. So it might be bad if terrorists or enemy countries develop superintelligence first. But if we develop superintelligence first there’s no problem. Just command it to do the things we want, right?

Suppose we wanted a superintelligence to cure cancer. How might we specify the goal “cure cancer”? We couldn’t guide it through every individual step; if we knew every individual step, then we could cure cancer ourselves. Instead, we would have to give it a final goal of curing cancer, and trust the superintelligence to come up with intermediate actions that furthered that goal. For example, a superintelligence might decide that the first step to curing cancer was learning more about protein folding, and set up some experiments to investigate protein folding patterns.

A superintelligence would also need some level of common sense to decide which of various strategies to pursue. Suppose that investigating protein folding was very likely to cure 50% of cancers, but investigating genetic engineering was moderately likely to cure 90% of cancers. Which should the AI pursue? Presumably it would need some way to balance considerations like curing as much cancer as possible, as quickly as possible, with as high a probability of success as possible.

But a goal specified in this way would be very dangerous. Humans instinctively balance thousands of different considerations in everything they do; so far this hypothetical AI is only balancing three (least cancer, quickest results, highest probability). To a human, it would seem maniacally, even psychopathically, obsessed with cancer curing. If this were truly its goal structure, it would go wrong in almost comical ways.

If your only goal is “curing cancer”, and you lack humans’ instinct for the thousands of other important considerations, a relatively easy solution might be to hack into a nuclear base, launch all of its missiles, and kill everyone in the world. This satisfies all the AI’s goals. It reduces cancer down to zero (which is better than medicines which work only some of the time). It’s very fast (which is better than medicines which might take a long time to invent and distribute). And it has a high probability of success (medicines might or might not work; nukes definitely do).

So simple goal architectures are likely to go very wrong unless tempered by common sense and a broader understanding of what we do and do not value.

See more...