Questions by Nico Hill2 which have been answered
Here we ask about the additional cost of building an aligned powerful system, compare to its unaligned version. We often assume it to be nonzero, in the same way it's easier and cheaper to build an elevator without emergency brakes. This is referred as the alignment tax, and most AI alignment research is geared toward reducing it.
One operational guess by Eliezer Yudkowsky about its magnitude is "[an aligned project will take] at least 50% longer serial time to complete than [its unaligned version], or two years longer, whichever is less". This holds for agents with enough capability that their behavior is qualitatively different from a safety engineering perspective (for instance, an agent that is not corrigible by default).
An essay by John Wentworth argues for a small chance of alignment happening "by default", with an alignment tax of effectively zero.
As with most things, the best way to form your views on AI safety is to read up on the various ideas and opinions that knowledgeable people in the field have, and to compare them and form your own perspective. There are several good places to start. One of them is the Machine Intelligence Research Institute`s "Why AI safety?" info page. The article contains links to relevant research. The Effective Altruism Forum has an article called "How I formed my own views on AI safety", which could also be pretty helpful. Here is a Robert Miles youtube video that can be a good place to start as well. Otherwise, there are various articles about it, like this one, from Vox.
There is significant controversy on how quickly AI will grow into a superintelligence. The Alignment Forum tag has many views on how things might unfold, where the probabilities of a soft (happening over years/decades) takeoff and a hard (happening in months, or less) takeoff are discussed.
OK, it’s great that you want to help, here are some ideas for ways you could do so without making a huge commitment:
- Learning more about AI alignment will provide you with good foundations for any path towards helping. You could start by absorbing content (e.g. books, videos, posts), and thinking about challenges or possible solutions.
- Getting involved with the movement by joining a local Effective Altruism or LessWrong group, Rob Miles’s Discord, and/or the AI Safety Slack is a great way to find friends who are interested and will help you stay motivated.
- Donating to organizations or individuals working on AI alignment, possibly via a donor lottery or the Long Term Future Fund, can be a great way to provide support.
- Writing or improving answers on my wiki so that other people can learn about AI alignment more easily is a great way to dip your toe into contributing. You can always ask on the Discord for feedback on things you write.
- Getting good at giving an AI alignment elevator pitch, and sharing it with people who may be valuable to have working on the problem can make a big difference. However you should avoid putting them off the topic by presenting it in a way which causes them to dismiss it as sci-fi (dos and don’ts in the elevator pitch follow-up question).
- Writing thoughtful comments on AI posts on LessWrong.
- Participating in the AGI Safety Fundamentals program – either the AI alignment or governance track – and then facilitating discussions for it in the following round. The program involves nine weeks of content, with about two hours of readings + exercises per week and 1.5 hours of discussion, followed by four weeks to work on an independent project. As a facilitator, you'll be helping others learn about AI safety in-depth, many of whom are considering a career in AI safety. In the early 2022 round, facilitators were offered a stipend, and this seems likely to be the case for future rounds as well! You can learn more about facilitating in this post from December 2021.
This largely depends on when you think AI will be advanced enough to constitute an immediate threat to humanity. This is difficult to estimate, but the field is surveyed at How long will it be until transformative AI is created?, which comes to the conclusion that it is relatively widely believed that AI will transform the world in our lifetimes.
We probably shouldn't rely too strongly on these opinions as predicting the future is hard. But, due to the enormous damage a misaligned AGI could do, it's worth putting a great deal of effort towards AI alignment even if you just care about currently existing humans (such as yourself).
Language models can be utilized to produce propaganda by acting like bots and interacting with users on social media. This can be done to push a political agenda or to make fringe views appear more popular than they are.
I'm envisioning that in the future there will also be systems where you can input any conclusion that you want to argue (including moral conclusions) and the target audience, and the system will give you the most convincing arguments for it. At that point people won't be able to participate in any online (or offline for that matter) discussions without risking their object-level values being hijacked.
-- Wei Dei, quoted in Persuasion Tools: AI takeover without AGI or agency?
As of 2022, this is not within the reach of current models. However, on the current trajectory, AI might be able to write articles and produce other media for propagandistic purposes that are superior to human-made ones in not too many years. These could be precisely tailored to individuals, using things like social media feeds and personal digital data.
Additionally, recommender systems on content platforms like YouTube, Twitter, and Facebook use machine learning, and the content they recommend can influence the opinions of billions of people. Some research has looked at the tendency for platforms to promote extremist political views and to thereby help radicalize their userbase for example.
In the long term, misaligned AI might use its persuasion abilities to gain influence and take control over the future. This could look like convincing its operators to let it out of a box, to give it resources or creating political chaos in order to disable mechanisms to prevent takeover as in this story.
See Risks from AI persuasion for a deep dive into the distinct risks from AI persuasion.
All the content below is in English:
- The AI technical safety section of the 80,000 Hours Podcast;
- The AI X-risk Research Podcast, hosted by Daniel Filan;
- The AI Alignment Podcast hosted by Lucas Perry from the Future of Life Institute (ran ~monthly from April 2018 to March 2021);
- The Alignment Newsletter Podcast by Rob Miles (an audio version of the weekly newsletter).
Elon Musk has expressed his concerns about AI safety many times and founded OpenAI in an attempt to make safe AI more widely distributed (as opposed to allowing a singleton, which he fears would be misused or dangerously unaligned). In a YouTube video from November 2019 Musk stated that there's a lack of investment in AI safety and that there should be a government agency to reduce risk to the public from AI.
Evidential Decision Theory – EDT – is a branch of decision theory which advises an agent to take actions which, conditional on it happening, maximizes the chances of the desired outcome. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How the actions can influence the probabilities differ between the branches. Causal Decision Theory – CDT – says only through causal process one can influence the chances of the desired outcome [#fn1 1]. EDT, on the other hand, requires no causal connection, the action only have to be a Bayesian evidence for the desired outcome. Some critics say it recommends auspiciousness over causal efficacy[#fn2 2].
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT on the other hand wound recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.
CDT uses probabilities of conditionals and contrafactual dependence to calculate the expected utility of an action – which track causal relations -, whereas EDT simply uses conditional probabilities. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most usual cases these two probabilities are the same. However, David Lewis proved [#fn3 3] its’ impossible to probabilities of conditionals to always track conditional probabilities. Hence evidential relations aren’t the same as causal relations and CDT and EDT will diverge depending on the problem. In some cases EDT gives a better answers then CDT, such as the Newcomb's problem, whereas in the Smoking lesion problem where CDT seems to give a more reasonable prescription.
- http://plato.stanford.edu/entries/decision-causal/[#fnref1 ↩]
- Joyce, J.M. (1999), The foundations of causal decision theory, p. 146[#fnref2 ↩]
- Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315[#fnref3 ↩]
- Smoking Lesion Steelman by Abram Demski
- Decision Theory FAQ by Luke Muehlhauser
- On Causation and Correlation Part 1
- Two-boxing, smoking and chewing gum in Medical Newcomb problems by Caspar Oesterheld
- Did EDT get it right all along? Introducing yet another medical Newcomb problem by Johannes Treutlein
- "Betting on the Past" by Arif Ahmed by Johannes Treutlein
- Why conditioning on "the agent takes action a" isn't enough by Nate Soares
Functional Decision Theory is a decision theory described by Eliezer Yudkowsky and Nate Soares which says that agents should treat one’s decision as the output of a ﬁxed mathematical function that answers the question, “Which output of this very function would yield the best outcome?”. It is a replacement of Timeless Decision Theory, and it outperforms other decision theories such as Causal Decision Theory (CDT) and Evidential Decision Theory (EDT). For example, it does better than CDT on Newcomb's Problem, better than EDT on the smoking lesion problem, and better than both in Parﬁt’s hitchhiker problem.
In Newcomb's Problem, an FDT agent reasons that Omega must have used some kind of model of her decision procedure in order to make an accurate prediction of her behavior. Omega's model and the agent are therefore both calculating the same function (the agent's decision procedure): they are subjunctively dependent on that function. Given perfect prediction by Omega, there are therefore only two outcomes in Newcomb's Problem: either the agent one-boxes and Omega predicted it (because its model also one-boxed), or the agent two-boxes and Omega predicted that. Because one-boxing then results in a million and two-boxing only in a thousand dollars, the FDT agent one-boxes.
- Functional decision theory: A new theory of instrumental rationality
- Cheating Death in Damascus
- Decisions are for making bad outcomes inconsistent
Orgasmium (also known as hedonium) is a homogeneous substance with limited consciousness, which is in a constant state of supreme bliss. An AI programmed to "maximize happiness" might simply tile the universe with orgasmium. Some who believe this consider it a good thing; others do not. Those who do not, use its undesirability to argue that not all terminal values reduce to "happiness" or some simple analogue. Hedonium is the hedonistic utilitarian's version of utilitronium.
A rational agent is an entity which has a utility function, forms beliefs about its environment, evaluates the consequences of possible actions, and then takes the action which maximizes its utility. They are also referred to as goal-seeking. The concept of a rational agent is used in economics, game theory, decision theory, and artificial intelligence.
Editor note: there is work to be done reconciling this page, Agency page, and Robust Agents. Currently they overlap and I'm not sure they're consistent. - Ruby, 2020-09-15
More generally, an agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators.[#fn1 1]
There has been much discussion as to whether certain AGI designs can be made into mere tools or whether they will necessarily be agents which will attempt to actively carry out their goals. Any minds that actively engage in goal-directed behavior are potentially dangerous, due to considerations such as basic AI drives possibly causing behavior which is in conflict with humanity's values.
In Dreams of Friendliness and in Reply to Holden on Tool AI, Eliezer Yudkowsky argues that, since all intelligences select correct beliefs from the much larger space of incorrect beliefs, they are necessarily agents.
- Russel, S. & Norvig, P. (2003) Artificial Intelligence: A Modern Approach. Second Edition. Page 32.[#fnref1 ↩]
(Astronomical) suffering risks, also known as s-risks, are risks of the creation of intense suffering in the far future on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.
S-risks are an example of existential risk (also known as x-risks) according to Nick Bostrom's original definition, as they threaten to "permanently and drastically curtail [Earth-originating intelligent life's] potential". Most existential risks are of the form "event E happens which drastically reduces the number of conscious experiences in the future". S-risks therefore serve as a useful reminder that some x-risks are scary because they cause bad experiences, and not just because they prevent good ones.
Within the space of x-risks, we can distinguish x-risks that are s-risks, x-risks involving human extinction, x-risks that involve immense suffering and human extinction, and x-risks that involve neither. For example:<figure class="table"><tbody></tbody>
|extinction risk||non-extinction risk|
|suffering risk||Misaligned AGI wipes out humans, simulates many suffering alien civilizations.||Misaligned AGI tiles the universe with experiences of severe suffering.|
|non-suffering risk||Misaligned AGI wipes out humans.||Misaligned AGI keeps humans as "pets," limiting growth but not causing immense suffering.|
A related concept is hyperexistential risk, the risk of "fates worse than death" on an astronomical scale. It is not clear whether all hyperexistential risks are s-risks per se. But arguably all s-risks are hyperexistential, since "tiling the universe with experiences of severe suffering" would likely be worse than death.
There are two EA organizations with s-risk prevention research as their primary focus: the Center on Long-Term Risk (CLR) and the Center for Reducing Suffering. Much of CLR's work is on suffering-focused AI safety and crucial considerations. Although to a much lesser extent, the Machine Intelligence Research Institute and Future of Humanity Institute have investigated strategies to prevent s-risks too.
Another approach to reducing s-risk is to "expand the moral circle" together with raising concern for suffering, so that future (post)human civilizations and AI are less likely to instrumentally cause suffering to non-human minds such as animals or digital sentience. Sentience Institute works on this value-spreading problem.
- Reducing Risks of Astronomical Suffering: A Neglected Global Priority (FRI)
- Introductory talk on s-risks (FRI)
- Risks of Astronomical Future Suffering (FRI)
- Suffering-focused AI safety: Why "fail-safe" measures might be our top intervention PDF (FRI)
- Artificial Intelligence and Its Implications for Future Suffering (FRI)
- Expanding our moral circle to reduce suffering in the far future (Sentience Politics)
- The Importance of the Far Future (Sentience Politics)
Causal Decision Theory – CDT - is a branch of decision theory which advises an agent to take actions that maximizes the causal consequences on the probability of desired outcomes [#fn1 1]. As any branch of decision theory, it prescribes taking the action that maximizes utility, that which utility equals or exceeds the utility of every other option. The utility of each action is measured by the expected utility, the averaged by probabilities sum of the utility of each of its possible results. How the actions can influence the probabilities differ between the branches. Contrary to Evidential Decision Theory – EDT - CDT focuses on the causal relations between one’s actions and its outcomes, instead of focusing on which actions provide evidences for desired outcomes. According to CDT a rational agent should track the available causal relations linking his actions to the desired outcome and take the action which will better enhance the chances of the desired outcome.
One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT on the other hand would recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.
The core aspect of CDT is mathematically represented by the fact it uses probabilities of conditionals in place of conditional probabilities [#fn2 2]. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most cases these two probabilities track each other, and CDT and EDT give the same answers. However, some particular problems have arisen where their predictions for rational action diverge such as the Smoking lesion problem – where CDT seems to give a more reasonable prescription – and Newcomb's problem – where CDT seems unreasonable. David Lewis proved [#fn3 3] it's impossible to probabilities of conditionals to always track conditional probabilities. Hence, evidential relations aren’t the same as causal relations and CDT and EDT will always diverge in some cases.
- Lewis, David. (1981) "Causal Decision Theory," Australasian Journal of Philosophy 59 (1981): 5- 30.
- Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315
The long reflection is a hypothesized period of time during which humanity works out how best to realize its long-term potential.
Some effective altruists, including Toby Ord and William MacAskill, have argued that, if humanity succeeds in eliminating existential risk or reducing it to acceptable levels, it should not immediately embark on an ambitious and potentially irreversible project of arranging the universe's resources in accordance to its values, but ought instead to spend considerable time— "centuries (or more)"; "perhaps tens of thousands of years"; "thousands or millions of years"; "[p]erhaps... a million years"—figuring out what is in fact of value. The long reflection may thus be seen as an intermediate stage in a rational long-term human developmental trajectory, following an initial stage of existential security when existential risk is drastically reduced and followed by a final stage when humanity's potential is fully realized.
The idea of a long reflection has been criticized on the grounds that virtually eliminating all existential risk will almost certainly require taking a variety of large-scale, irreversible decisions—related to space colonization, global governance, cognitive enhancement, and so on—which are precisely the decisions meant to be discussed during the long reflection. Since there are pervasive and inescapable tradeoffs between reducing existential risk and retaining moral option value, it may be argued that it does not make sense to frame humanity's long-term strategic picture as one consisting of two distinct stages, with one taking precedence over the other.
Aird, Michael (2020) Collection of sources that are highly relevant to the idea of the Long Reflection, Effective Altruism Forum, June 20.
Many additional resources on this topic.
Wiblin, Robert & Keiran Harris (2018) Our descendants will probably see us as moral monsters. what should we do about that?, 80,000 Hours, January 19.
Interview with William MacAskill about the long reflection and other topics.
Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.
Greaves, Hilary et al. (2019) A research agenda for the Global Priorities Institute, Oxford.
Dai, Wei (2019) The argument from philosophical difficulty, LessWrong, February 9.
William MacAskill, in Perry, Lucas (2018) AI alignment podcast: moral uncertainty and the path to AI alignment with William MacAskill, AI Alignment podcast, September 17.
Ord, Toby (2020) The Precipice: Existential Risk and the Future of Humanity, London: Bloomsbury Publishing.
Stocker, Felix (2020) Reflecting on the long reflection, Felix Stocker’s Blog, August 14.
Hanson, Robin (2021) ‘Long reflection’ is crazy bad idea, Overcoming Bias, October 20.
The windfall clause is pretty well explained on the Future of Humanity Institute site.
Here's a quick summary:
It is an agreement between AI firms to donate significant amounts of any profits made as a consequence of economically transformative breakthroughs in AI capabilities. The donations are intended to help benefit humanity.
An aligned superintelligence will have a set of human values. As mentioned in What are "human values"? the set of values are complex, which means that the implementation of these values will decide whether the superintelligence cares about nonhuman animals. In AI Ethics and Value Alignment for Nonhuman Animals Soenke Ziesche argues that the alignment should include the values of nonhuman animals.
Ajeya Cotra has written an excellent article named Why AI alignment could be hard with modern deep learning on this question.
Many parts of the AI alignment ecosystem are already well-funded, but a savvy donor can still make a difference by picking up grantmaking opportunities which are too small to catch the attention of the major funding bodies or are based on personal knowledge of the recipient.
One way to leverage a small amount of money to the potential of a large amount is to enter a donor lottery, where you donate to win a chance to direct a much larger amount of money (with probability proportional to donation size). This means that the person directing the money will be allocating enough that it's worth their time to do more in-depth research.
For an overview of the work the major organizations are doing, see the 2021 AI Alignment Literature Review and Charity Comparison. The Long-Term Future Fund seems to be an outstanding place to donate based on that, as they are the organization which most other organizations are most excited to see funded.
In the words of Nate Soares:
I don’t expect humanity to survive much longer.
Often, when someone learns this, they say:
"Eh, I think that would be all right."
So allow me to make this very clear: it would not be "all right."
Imagine a little girl running into the road to save her pet dog. Imagine she succeeds, only to be hit by a car herself. Imagine she lives only long enough to die in pain.
Though you may imagine this thing, you cannot feel the full tragedy. You can’t comprehend the rich inner life of that child. You can’t understand her potential; your mind is not itself large enough to contain the sadness of an entire life cut short.
You can only catch a glimpse of what is lost—
—when one single human being dies.
Now tell me again how it would be "all right" if every single person were to die at once.
Many people, when they picture the end of humankind, pattern match the idea to some romantic tragedy, where humans, with all their hate and all their avarice, had been unworthy of the stars since the very beginning, and deserved their fate. A sad but poignant ending to our tale.
And indeed, there are many parts of human nature that I hope we leave behind before we venture to the heavens. But in our nature is also everything worth bringing with us. Beauty and curiosity and love, a capacity for fun and growth and joy: these are our birthright, ours to bring into the barren night above.
Calamities seem more salient when unpacked. It is far harder to kill a hundred people in their sleep, with a knife, than it is to order a nuclear bomb dropped on Hiroshima. Your brain can’t multiply, you see: it can only look at a hypothetical image of a broken city and decide it’s not that bad. It can only conjure an image of a barren planet and say "eh, we had it coming."
But if you unpack the scenario, if you try to comprehend all the lives snuffed out, all the children killed, the final spark of human joy and curiosity extinguished, all our potential squandered…
I promise you that the extermination of humankind would be horrific.