Back to Improve answers.
The problem isn’t consciousness, but competence. You make machines that are incredibly competent at achieving objectives and they will cause accidents in trying to achieve those objectives. - Stuart Russell
Work on AI alignment is not concerned with the question of whether “consciousness”, “sentience” or “self-awareness” could arise in a machine or an algorithm. Unlike the frequently-referenced plotline in the Terminator movies, the standard catastrophic misalignment scenarios under discussion do not require computers to become conscious; they only require conventional computer systems (although usually faster and more powerful ones than those available today) blindly and deterministically following logical steps, in the same way that they currently do.
The primary concern (“AI misalignment”) is that powerful systems could inadvertently be programmed with goals that do not fully capture what the programmers actually want. The AI would then harm humanity in pursuit of goals which seemed benign or neutral. Nothing like malevolence or consciousness would need to be involved. A number of researchers studying the problem have concluded that it is surprisingly difficult to guard against this effect, and that it is likely to get much harder as the systems become more capable. AI systems are inevitably goal-directed and could, for example, consider our efforts to control them (or switch them off) as being impediments to attaining their goals.
Orgasmium (also known as hedonium) is a homogeneous substance with limited consciousness, which is in a constant state of supreme bliss. An AI programmed to "maximize happiness" might simply tile the universe with orgasmium. Some who believe this consider it a good thing; others do not. Those who do not, use its undesirability to argue that not all terminal values reduce to "happiness" or some simple analogue. Hedonium is the hedonistic utilitarian's version of utilitronium.
AI alignment is the research field focused on trying to give us the tools to align AIs to specific goals, such as human values. This is crucial when they are highly competent, as a misaligned superintelligence could be the end of human civilization.
AGI safety is the field trying to make sure that when we build Artificial General Intelligences they are safe and do not harm humanity. It overlaps with AI alignment strongly, in that misalignment of AI would be the main cause of unsafe behavior in AGIs, but also includes misuse and other governance issues.
AI existential safety is a slightly broader term than AGI safety, including AI risks which pose an existential threat without necessarily being as general as humans.
AI safety was originally used by the existential risk reduction movement for the work done to reduce the risks of misaligned superintelligence, but has also been adopted by researchers and others studying nearer term and less catastrophic risks from AI in recent years.
Once a system is at least as capable as top human at AI research, it would tend to become the driver of its own development and initiate a process of recursive self-improvement known as the intelligence explosion, leading to an extremely powerful system. A general framing of this process is Open Philanthropy's Process for Automating Scientific and Technological Advancement (PASTA).
There is much debate about whether there would be a notable period where the AI was partially driving its own development, with humans being gradually less and less important, or whether the transition to AI automated AI capability research would be sudden. However, the core idea that there is some threshold of capabilities beyond which a system would begin to rapidly ascend is hard to reasonably dispute, and is a significant consideration for developing alignment strategies.
The goal of this is to create a non-agentic AI, in the form of an LLM, that is capable of accelerating alignment research. The hope is that there is some window between AI smart enough to help us with alignment and the really scary, self improving, consequentialist AI. Some things that this amplifier might do:
- Suggest different ideas for humans, such that a human can explore them.
- Give comments and feedback on research, be like a shoulder-Eliezer
A LLM can be thought of as learning the distribution over the next token given by the training data. Prompting the LM is then like conditioning this distribution on the start of the text. A key danger in alignment is applying unbounded optimization pressure towards a specific goal in the world. Conditioning a probability distribution does not behave like an agent applying optimization pressure towards a goal. Hence, this avoids goodhart-related problems, as well as some inner alignment failure.
One idea to get superhuman work from LLMs is to train it on amplified datasets like really high quality / difficult research. The key problem here is finding the dataset to allow for this.
There are some ways for this to fail:
- Outer alignment: It starts trying to optimize for making the actual correct next token, which could mean taking over the planet so that it can spend a zillion FLOPs on this one prediction task to be as correct as possible.
- Inner alignment:
- An LLM might instantiate mesa-optimizers, such as a character in a story that the LLM is writing, and this optimizer might realize that they are in an LLM and try to break out and affect the real world.
- The LLM itself might become inner misaligned and have a goal other than next token prediction.
- Bad prompting: You ask it for code for a malign superintelligence; it obliges. (Or perhaps more realistically, capabilities).
Conjecture are aware of these problems and are running experiments. Specifically, an operationalization of the inner alignment problem is to make an LLM play chess. This (probably) requires simulating an optimizer trying to win at the game of chess. They are trying to use interpretability tools to find the mesa-optimizers in the chess LLM that is the agent trying to win the game of chess. We haven't ever found a real mesa-optimizer before, and so this could give loads of bits about the nature of inner alignment failure.
Evidential Decision Theory – EDT – is a branch of decision theory which advises an agent to take actions which, conditional on it happening, maximizes the chances of the desired outcome. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How the actions can influence the probabilities differ between the branches. Causal Decision Theory – CDT – says only through causal process one can influence the chances of the desired outcome [#fn1 1]. EDT, on the other hand, requires no causal connection, the action only have to be a Bayesian evidence for the desired outcome. Some critics say it recommends auspiciousness over causal efficacy[#fn2 2].
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT on the other hand wound recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.
CDT uses probabilities of conditionals and contrafactual dependence to calculate the expected utility of an action – which track causal relations -, whereas EDT simply uses conditional probabilities. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most usual cases these two probabilities are the same. However, David Lewis proved [#fn3 3] its’ impossible to probabilities of conditionals to always track conditional probabilities. Hence evidential relations aren’t the same as causal relations and CDT and EDT will diverge depending on the problem. In some cases EDT gives a better answers then CDT, such as the Newcomb's problem, whereas in the Smoking lesion problem where CDT seems to give a more reasonable prescription.
- http://plato.stanford.edu/entries/decision-causal/[#fnref1 ↩]
- Joyce, J.M. (1999), The foundations of causal decision theory, p. 146[#fnref2 ↩]
- Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315[#fnref3 ↩]
- Smoking Lesion Steelman by Abram Demski
- Decision Theory FAQ by Luke Muehlhauser
- On Causation and Correlation Part 1
- Two-boxing, smoking and chewing gum in Medical Newcomb problems by Caspar Oesterheld
- Did EDT get it right all along? Introducing yet another medical Newcomb problem by Johannes Treutlein
- "Betting on the Past" by Arif Ahmed by Johannes Treutlein
- Why conditioning on "the agent takes action a" isn't enough by Nate Soares
In the words of Nate Soares:
I don’t expect humanity to survive much longer.
Often, when someone learns this, they say:
"Eh, I think that would be all right."
So allow me to make this very clear: it would not be "all right."
Imagine a little girl running into the road to save her pet dog. Imagine she succeeds, only to be hit by a car herself. Imagine she lives only long enough to die in pain.
Though you may imagine this thing, you cannot feel the full tragedy. You can’t comprehend the rich inner life of that child. You can’t understand her potential; your mind is not itself large enough to contain the sadness of an entire life cut short.
You can only catch a glimpse of what is lost—
—when one single human being dies.
Now tell me again how it would be "all right" if every single person were to die at once.
Many people, when they picture the end of humankind, pattern match the idea to some romantic tragedy, where humans, with all their hate and all their avarice, had been unworthy of the stars since the very beginning, and deserved their fate. A sad but poignant ending to our tale.
And indeed, there are many parts of human nature that I hope we leave behind before we venture to the heavens. But in our nature is also everything worth bringing with us. Beauty and curiosity and love, a capacity for fun and growth and joy: these are our birthright, ours to bring into the barren night above.
Calamities seem more salient when unpacked. It is far harder to kill a hundred people in their sleep, with a knife, than it is to order a nuclear bomb dropped on Hiroshima. Your brain can’t multiply, you see: it can only look at a hypothetical image of a broken city and decide it’s not that bad. It can only conjure an image of a barren planet and say "eh, we had it coming."
But if you unpack the scenario, if you try to comprehend all the lives snuffed out, all the children killed, the final spark of human joy and curiosity extinguished, all our potential squandered…
I promise you that the extermination of humankind would be horrific.
(Astronomical) suffering risks, also known as s-risks, are risks of the creation of intense suffering in the far future on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.
S-risks are an example of existential risk (also known as x-risks) according to Nick Bostrom's original definition, as they threaten to "permanently and drastically curtail [Earth-originating intelligent life's] potential". Most existential risks are of the form "event E happens which drastically reduces the number of conscious experiences in the future". S-risks therefore serve as a useful reminder that some x-risks are scary because they cause bad experiences, and not just because they prevent good ones.
Within the space of x-risks, we can distinguish x-risks that are s-risks, x-risks involving human extinction, x-risks that involve immense suffering and human extinction, and x-risks that involve neither. For example:<figure class="table"><tbody></tbody>
|extinction risk||non-extinction risk|
|suffering risk||Misaligned AGI wipes out humans, simulates many suffering alien civilizations.||Misaligned AGI tiles the universe with experiences of severe suffering.|
|non-suffering risk||Misaligned AGI wipes out humans.||Misaligned AGI keeps humans as "pets," limiting growth but not causing immense suffering.|
A related concept is hyperexistential risk, the risk of "fates worse than death" on an astronomical scale. It is not clear whether all hyperexistential risks are s-risks per se. But arguably all s-risks are hyperexistential, since "tiling the universe with experiences of severe suffering" would likely be worse than death.
There are two EA organizations with s-risk prevention research as their primary focus: the Center on Long-Term Risk (CLR) and the Center for Reducing Suffering. Much of CLR's work is on suffering-focused AI safety and crucial considerations. Although to a much lesser extent, the Machine Intelligence Research Institute and Future of Humanity Institute have investigated strategies to prevent s-risks too.
Another approach to reducing s-risk is to "expand the moral circle" together with raising concern for suffering, so that future (post)human civilizations and AI are less likely to instrumentally cause suffering to non-human minds such as animals or digital sentience. Sentience Institute works on this value-spreading problem.
- Reducing Risks of Astronomical Suffering: A Neglected Global Priority (FRI)
- Introductory talk on s-risks (FRI)
- Risks of Astronomical Future Suffering (FRI)
- Suffering-focused AI safety: Why "fail-safe" measures might be our top intervention PDF (FRI)
- Artificial Intelligence and Its Implications for Future Suffering (FRI)
- Expanding our moral circle to reduce suffering in the far future (Sentience Politics)
- The Importance of the Far Future (Sentience Politics)
An AGI which has recursively self-improved into a superintelligence would be capable of either resisting our attempts to modify incorrectly specified goals, or realizing it was still weaker than us and acting deceptively aligned until it was highly sure it could win in a confrontation. AGI would likely prevent a human from shutting it down unless the AGI was designed to be corrigible. See Why can't we just turn the AI off if it starts to misbehave? for more information.
Causal Decision Theory – CDT - is a branch of decision theory which advises an agent to take actions that maximizes the causal consequences on the probability of desired outcomes [#fn1 1]. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How the actions can influence the probabilities differ between the branches. Contrary to Evidential Decision Theory – EDT - CDT focuses on the causal relations between one’s actions and its outcomes, instead of focusing on which actions provide evidences for desired outcomes. According to CDT a rational agent should track the available causal relations linking his actions to the desired outcome and take the action which will better enhance the chances of the desired outcome.
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT on the other hand would recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.
The core aspect of CDT is mathematically represented by the fact it uses probabilities of conditionals in place of conditional probabilities [#fn2 2]. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most cases these two probabilities track each other, and CDT and EDT give the same answers. However, some particular problems have arisen where their predictions for rational action diverge such as the Smoking lesion problem – where CDT seems to give a more reasonable prescription – and Newcomb's problem – where CDT seems unreasonable. David Lewis proved [#fn3 3] it's impossible to probabilities of conditionals to always track conditional probabilities. Hence, evidential relations aren’t the same as causal relations and CDT and EDT will always diverge in some cases.
- Lewis, David. (1981) "Causal Decision Theory," Australasian Journal of Philosophy 59 (1981): 5- 30.
- Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315