Untagged answers

From Stampy's Wiki
(Redirected from Untagged answers)

Back to Improve answers.

These 23 canonical answers don't have any tags, please add some!

Humans care about things! The reward circuitry in our brain reliably causes us to care about specific things. Let's create a mechanistic model of how the brain aligns humans, and then we can use this to do AI alignment.

One perspective that Shard theory has added is that we shouldn't think of the solution to alignment as:

  1. Find an outer objective that is fine to optimize arbitrarily strongly
  2. Find a way of making sure that the inner objective of an ML system equals the outer objective.

Shard theory argues that instead we should focus on finding outer objectives that reliably give certain inner values into system and should be thought of as more of a teacher of the values we want to instill as opposed to the values themselves. Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense. (As pointed out in the comments this is not a new insight, but it was for me phrased a lot more clearly in terms of Shard theory).

Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward.

Shard Theory points at the beginning of a mechanistic story for how inner values are selected for by outer optimization pressures. The current plan is to figure out how RL induces inner values into learned agents, and then figure out how to instill human values into powerful AI models (probably chain of thought LLMs, because these are the most intelligent models right now). Then, use these partially aligned models to solve the full alignment problem. Shard theory also proposes a subagent theory of mind.

This has some similarities to Brain-like AGI Safety, and has drawn on some research from this post, such as the mechanics of the human reward circuitry as well as the brain being mostly randomly initialized at birth.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

AI safety is a research field that has the goal to avoid bad outcomes from AI systems.

Work on AI safety can be divided into near-term AI safety, and AI existential safety, which is strongly related to AI alignment:

  • Near-term AI safety is about preventing bad outcomes from current systems. Examples for work on near-term AI safety are
    • getting content recommender systems to not radicalize their users
    • ensuring autonomous cars don’t kill people
    • advocating strict regulations for lethal autonomous weapons
  • AI existential safety, or AGI safety is about reducing the existential risk from artificial general intelligence (AGI). Artificial general intelligence is AI that is at least as competent as humans in all skills that are relevant for making a difference in the world. AGI has not been developed yet, but will likely be developed in this century. A central part of AGI safety is ensuring that what AIs do is actually what we want. This is called AI alignment (also often just called alignment), because it’s about aligning an AI with human values. Alignment is difficult, and building AGI is probably very dangerous, so it is important to mitigate the risks as much as possible. Examples for work on AI existential safety are
    • trying to get a foundational understanding what intelligence is, e.g. agent foundations
    • Outer and inner alignment: Ensure the objective of the training process is actually what we want, and also ensure the objective of the resulting system is actually what we want.
    • AI policy/strategy: e.g. researching the best way to set up institutions and mechanisms that help with safe AGI development, making sure AI isn’t used by bad actors

There are also areas of research which are useful for both near-term, and for existential safety. For example, robustness to distribution shift, and interpretability both help with making current systems safer, and are likely to help with AGI safety.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Once a system is at least as capable as top human at AI research, it would tend to become the driver of its own development and initiate a process of recursive self-improvement known as the intelligence explosion, leading to an extremely powerful system. A general framing of this process is Open Philanthropy's Process for Automating Scientific and Technological Advancement (PASTA).

There is much debate about whether there would be a notable period where the AI was partially driving its own development, with humans being gradually less and less important, or whether the transition to AI automated AI capability research would be sudden. However, the core idea that there is some threshold of capabilities beyond which a system would begin to rapidly ascend is hard to reasonably dispute, and is a significant consideration for developing alignment strategies.

Stamps: Aprillion, plex
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

See Encultured AI: Building a Video Game.

Encultured are making a multiplayer online video game as a test environment for AI: an aligned AI should be able to play the game without ruining the fun or doing something obviously destructive like completely taking over the world, even if it has this capabilities. This seems roughly analogous to setting an AGI loose on the real world.

Motivation: Andrew Critch is primarily concerned about a multipolar AI scenario: there are multiple actors with comparably powerful AI, on the cusp of recursive self improvement. The worst case is a race, and even though each actor would want to take more time checking their AGI for safety, worry that another actor will deploy will push each actor to take shortcuts and try to pull off a world-saving act. Instead of working directly on AI, which can accelerate timelines and encourage racing, creating this standardized test environment where alignment failures are observable is one component of a good global outcome.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Humans Consulting HCH (HCH) is a recursive acronym describing a setup where humans can consult simulations of themselves to help answer questions. It is a concept used in discussion of the iterated amplification proposal to solve the alignment problem.

It was first described by Paul Christiano in his post Humans Consulting HCH:

Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine.

That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh…

Let’s call this process HCH, for “Humans Consulting HCH.”

Stamps: plex
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

ARC is trying to solve Eliciting Latent Knowledge (ELK). Suppose that you are training an AI agent that predicts the state of the world and then performs some actions, called a predictor. This predictor is the AGI that will be acting to accomplish goals in the world. How can you create another model, called a reporter, that tells you what the predictor believes about the world? A key challenge in training this reporter is that training your reporter on human labeled training data, by default, incentivizes the predictor to just model what the human thinks is true, because the human is a simpler model than the AI.

Motivation: At a high level, Paul's plan seems to be to produce a minimal AI that can help to do AI safety research. To do this, preventing deception and inner alignment failure are on the critical path, and the only known solution paths to this require interpretability (this is how all of Evan's 11 proposals plan to get around this problem).

If ARC can solve ELK, this would be a very strong form of interpretability: our reporter is able to tell us what the predictor believes about the world. Some ways this could end up being useful for aligning the predictor include:

  • Using the reporter to find deceptive/misaligned thoughts in the predictor, and then optimizing against those interpreted thoughts. At any given point in time, SGD only updates the weights a small amount. If an AI becomes misaligned, it won't be very misaligned, and the interpretability tools will be able to figure this out and do a gradient step to make it aligned again. In this way, we can prevent deception at any point in training.
  • Stopping training if the AI is misaligned.
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

An AGI which has recursively self-improved into a superintelligence would be capable of either resisting our attempts to modify incorrectly specified goals, or realizing it was still weaker than us and acting deceptively aligned until it was highly sure it could win in a confrontation. AGI would likely prevent a human from shutting it down unless the AGI was designed to be corrigible. See Why can't we just turn the AI off if it starts to misbehave? for more information.

Stamps: tayler6000, plex
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

See Vanessa's research agenda for more detail.

If we don't know how to do something given unbounded compute, we are just confused about the thing. Going from thinking that chess was impossible for machines to understanding minimax was a really good step forward for designing chess AIs, even though minimax is completely intractable.

Thus, we should seek to figure out how alignment might look in theory, and then try to bridge the theory-practice gap by making our proposal ever more efficient. The first step along this path is to figure out a universal Reinforcement Learning setting that we can place our formal agents in, and then prove regret bounds in.

A key problem in doing this is embeddedness. AIs can't have a perfect self model — this would be like imagining your ENTIRE brain, inside your brain. There are finite memory constraints. Infra-Bayesianism (IB) is essentially a theory of imprecise probability that lets you specify local / fuzzy things. IB allows agents to have abstract models of themselves, and thus works in an embedded setting.

Infra-Bayesian Physicalism (IBP) is an extension of this to RL. IBP allows us to

  • Figure out what agents are running [by evaluating the counterfactual where the computation of the agent would output something different, and see if the physical universe is different].
  • Give a program, classify it as an agent or a non agent, and then find its utility function.

Vanessa uses this formalism to describe PreDCA, an alignment proposal based on IBP. This proposal assumes that an agent is an IBP agent, meaning that it is an RL agent with fuzzy probability distributions (along with some other things). The general outline of this proposal is as follows:

  1. Find all of the agents that preceded the AI
  2. Discard all of these agents that are powerful / non-human like
  3. Find all of the utility functions in the remaining agents
  4. Use combination of all of these utilities as the agent's utility function

Vanessa models an AI as a model based RL system with a WM, a reward function, and a policy derived from the WM + reward. She claims that this avoids the sharp left turn. The generalization problems come from the world model, but this is dealt with by having an epistemology that doesn't contain bridge rules, and so the true world is the simplest explanation for the observed data.

It is open to show that this proposal also solves inner alignment, but there is some chance that it does.

This approach deviates from MIRI's plan, which is to focus on a narrow task to perform the pivotal act, and then add corrigibility. Vanessa instead tries to directly learn the user's preferences, and optimize those.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Dylan's PhD thesis argues three main claims (paraphrased):

  1. Outer alignment failures are a problem.
  2. We can mitigate this problem by adding in uncertainty.
  3. We can model this as Cooperative Inverse Reinforcement Learning (CIRL).

Thus, his motivations seem to be modeling AGI coming in some multi-agent form, and also being heavily connected with human operators.

We're not certain what he is currently working on, but some recent alignment-relevant papers that he has published include:

Dylan has also published a number of articles that seem less directly relevant for alignment.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

DeepMind has both a ML safety team focused on near-term risks, and an alignment team that is working on risks from AGI. The alignment team is pursuing many different research avenues, and is not best described by a single agenda.

Some of the work they are doing is:

See Rohin's comment for more research that they are doing, including description of some that is currently unpublished so far.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

One of the key problems in AI safety is that there are many ways for an AI to generalize off-distribution, so it is very likely that an arbitrary generalization will be unaligned. See the model splintering post for more detail. Stuart's plan to solve this problem is as follows:

  1. Maintain a set of all possible extrapolations of reward data that are consistent with the training process.
  2. Pick among these for a safe reward extrapolation.

They are currently working on algorithms to accomplish step 1: see Value Extrapolation.

Their initial operationalization of this problem is the lion and husky problem. Basically: if you train an image model on a dataset of images of lions and huskies, the lions are always in the desert, and the huskies are always in the snow. So the problem of learning a classifier is under-defined: should the classifier be classifying based on the background environment (e.g. snow vs sand), or based on the animal in the image?

A good extrapolation algorithm, on this problem, would generate classifiers that extrapolate in all the different ways[4], and so the 'correct' extrapolation must be in this generated set of classifiers. They have also introduced a new dataset for this, with a similar idea: Happy Faces.

Step 2 could be done in different ways. Possibilities for doing this include: conservatism, generalized deference to humans, or an automated process for removing some goals. like wireheading/deception/killing everyone.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

CLR is focused primarily on reducing suffering-risk (s-risk), where the future has a large negative value. They do foundational research in game theory / decision theory, primarily aimed at multipolar AI scenarios. One result relevant to this work is that transparency can increase cooperation.

Update after Jesse Clifton commented: CLR also works on improving coordination for prosaic AI scenarios, risks from malevolent actors and AI forecasting. The Cooperative AI Foundation (CAIF) shares personnel with CLR, but is not formally affiliated with CLR, and does not focus just on s-risks.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

The goal of this is to create a non-agentic AI, in the form of an LLM, that is capable of accelerating alignment research. The hope is that there is some window between AI smart enough to help us with alignment and the really scary, self improving, consequentialist AI. Some things that this amplifier might do:

  • Suggest different ideas for humans, such that a human can explore them.
  • Give comments and feedback on research, be like a shoulder-Eliezer

A LLM can be thought of as learning the distribution over the next token given by the training data. Prompting the LM is then like conditioning this distribution on the start of the text. A key danger in alignment is applying unbounded optimization pressure towards a specific goal in the world. Conditioning a probability distribution does not behave like an agent applying optimization pressure towards a goal. Hence, this avoids goodhart-related problems, as well as some inner alignment failure.

One idea to get superhuman work from LLMs is to train it on amplified datasets like really high quality / difficult research. The key problem here is finding the dataset to allow for this.

There are some ways for this to fail:

  • Outer alignment: It starts trying to optimize for making the actual correct next token, which could mean taking over the planet so that it can spend a zillion FLOPs on this one prediction task to be as correct as possible.
  • Inner alignment:
    • An LLM might instantiate mesa-optimizers, such as a character in a story that the LLM is writing, and this optimizer might realize that they are in an LLM and try to break out and affect the real world.
    • The LLM itself might become inner misaligned and have a goal other than next token prediction.
  • Bad prompting: You ask it for code for a malign superintelligence; it obliges. (Or perhaps more realistically, capabilities).

Conjecture are aware of these problems and are running experiments. Specifically, an operationalization of the inner alignment problem is to make an LLM play chess. This (probably) requires simulating an optimizer trying to win at the game of chess. They are trying to use interpretability tools to find the mesa-optimizers in the chess LLM that is the agent trying to win the game of chess. We haven't ever found a real mesa-optimizer before, and so this could give loads of bits about the nature of inner alignment failure.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

AI alignment is the research field focused on trying to give us the tools to align AIs to specific goals, such as human values. This is crucial when they are highly competent, as a misaligned superintelligence could be the end of human civilization.

AGI safety is the field trying to make sure that when we build Artificial General Intelligences they are safe and do not harm humanity. It overlaps with AI alignment strongly, in that misalignment of AI would be the main cause of unsafe behavior in AGIs, but also includes misuse and other governance issues.

AI existential safety is a slightly broader term than AGI safety, including AI risks which pose an existential threat without necessarily being as general as humans.

AI safety was originally used by the existential risk reduction movement for the work done to reduce the risks of misaligned superintelligence, but has also been adopted by researchers and others studying nearer term and less catastrophic risks from AI in recent years.

Stamps: Damaged, plex
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

MIRI thinks technical alignment is really hard, and that we are very far from a solution. However, they think that policy solutions have even less hope. Generally, I think of their approach as supporting a bunch of independent researchers following their own directions, hoping that one of them will find some promise. They mostly buy into the security mindset: we need to know exactly (probably mathematically formally) what we are doing, or the massive optimization pressure will default in ruin.

How does MIRI communicate their view on alignment?

Recently they've been trying to communicate their worldview, in particular, how incredibly doomy they are, perhaps in order to move other research efforts towards what they see as the hard problems.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

John's plan is:

Step 1: sort out our fundamental confusions about agency

Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)

Step 3: …

Step 4: profit!

… and do all that before AGI kills us all.

He is working on step 1: figuring out what the heck is going on with agency. His current approach is based on selection theorems: try to figure out what types of agents are selected for in a broad range of environments. Examples of selection pressures include: evolution, SGD, and markets. This is an approach to agent foundations that comes from the opposite direction as MIRI: it's more about observing existing structures (whether they be mathematical or real things in the world like markets or e coli), whereas MIRI is trying to write out some desiderata and then finding mathematical notions that satisfy those desiderata.

Two key properties that might be selected for are modularity and abstractions.

Abstractions are higher level things that people tend to use to describe things. Like "Tree" and "Chair" and "Person". These are all vague categories that contain lots of different things, but are really useful for narrowing down things. Humans tend to use really similar abstractions, even across different cultures / societies. The Natural Abstraction Hypothesis (NAH) states that a wide variety of cognitive architectures will tend to use similar abstractions to reason about the world. This might be helpful for alignment because we could say things like "person" without having to rigorously and precisely say exactly what we mean by person.

The NAH seems very plausibly true for physical objects in the world, and so it might be true for the inputs to human values. If so, it would be really helpful for AI alignment because understanding this would amount to a solution to the ontology identification problem: we can understand when environments induce certain abstractions, and so we can design this so that the network has the same abstractions as humans.

Modularity: In pretty much any selection environment, we see lots of obvious modularity. Biological species have cells and organs and limbs. Companies have departments. We might expect neural networks to be similar, but it is really hard to find modules in neural networks. We need to find the right lens to look through to find this modularity in neural networks. Aiming at this can lead us to really good interpretability.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

The problem isn’t consciousness, but competence. You make machines that are incredibly competent at achieving objectives and they will cause accidents in trying to achieve those objectives. - Stuart Russell

Work on AI alignment is not concerned with the question of whether “consciousness”, “sentience” or “self-awareness” could arise in a machine or an algorithm. Unlike the frequently-referenced plotline in the Terminator movies, the standard catastrophic misalignment scenarios under discussion do not require computers to become conscious; they only require conventional computer systems (although usually faster and more powerful ones than those available today) blindly and deterministically following logical steps, in the same way that they currently do.

The primary concern (“AI misalignment”) is that powerful systems could inadvertently be programmed with goals that do not fully capture what the programmers actually want. The AI would then harm humanity in pursuit of goals which seemed benign or neutral. Nothing like malevolence or consciousness would need to be involved. A number of researchers studying the problem have concluded that it is surprisingly difficult to guard against this effect, and that it is likely to get much harder as the systems become more capable. AI systems are inevitably goal-directed and could, for example, consider our efforts to control them (or switch them off) as being impediments to attaining their goals.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

I don't know much about their research here, other than that they train their own models, which allow them to work on models that are bigger than the biggest publicly available models, which seems like a difference from Redwood.

Current interpretability methods are very low level (e.g., "what does x neuron do"), which does not help us answer high level questions like "is this AI trying to kill us".

They are trying a bunch of weird approaches, with the goal of scalable mechanistic interpretability, but I do not know what these approaches actually are.

Motivation: Conjecture wants to build towards a better paradigm that will give us a lot more information, primarily from the empirical direction (as distinct from ARC, which is working on interpretability with a theoretical focus).

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

In the words of Nate Soares:

I don’t expect humanity to survive much longer.

Often, when someone learns this, they say:
"Eh, I think that would be all right."

So allow me to make this very clear: it would not be "all right."

Imagine a little girl running into the road to save her pet dog. Imagine she succeeds, only to be hit by a car herself. Imagine she lives only long enough to die in pain.

Though you may imagine this thing, you cannot feel the full tragedy. You can’t comprehend the rich inner life of that child. You can’t understand her potential; your mind is not itself large enough to contain the sadness of an entire life cut short.

You can only catch a glimpse of what is lost—
—when one single human being dies.

Now tell me again how it would be "all right" if every single person were to die at once.

Many people, when they picture the end of humankind, pattern match the idea to some romantic tragedy, where humans, with all their hate and all their avarice, had been unworthy of the stars since the very beginning, and deserved their fate. A sad but poignant ending to our tale.

And indeed, there are many parts of human nature that I hope we leave behind before we venture to the heavens. But in our nature is also everything worth bringing with us. Beauty and curiosity and love, a capacity for fun and growth and joy: these are our birthright, ours to bring into the barren night above.

Calamities seem more salient when unpacked. It is far harder to kill a hundred people in their sleep, with a knife, than it is to order a nuclear bomb dropped on Hiroshima. Your brain can’t multiply, you see: it can only look at a hypothetical image of a broken city and decide it’s not that bad. It can only conjure an image of a barren planet and say "eh, we had it coming."

But if you unpack the scenario, if you try to comprehend all the lives snuffed out, all the children killed, the final spark of human joy and curiosity extinguished, all our potential squandered…

I promise you that the extermination of humankind would be horrific.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Causal Decision Theory – CDT – is a branch of decision theory which advises an agent to take actions which maximize the causal consequences on the probability of desired outcomes 1. As any branch of decision theory, it prescribes taking the action that maximizes expected utility, i.e the action which maximizes the sum of the utility obtained in each outcome weighted by the probability of that outcome occurring, given your action. Different decision theories correspond to different ways of construing this dependence between actions and outcomes. CDT focuses on the causal relations between one’s actions and outcomes, whilst Evidential Decision Theory – EDT - concerns itself with what an action indicates about the world (which is operationalized by the conditional probability). That is, according to CDT, a rational agent should track the available causal relations linking his actions to the desired outcome and take the action which will better enhance the chances of the desired outcome.

Causal Decision Theory – CDT – is a branch of decision theory which advises an agent to take actions which maximize the causal consequences on the probability of desired outcomes [#fn1 1]. As any branch of decision theory, it prescribes taking the action that maximizes expected utility, i.e the action which maximizes the sum of the utility obtained in each outcome weighted by the probability of that outcome occurring, given your action. Different decision theories correspond to different ways of construing this dependence between actions and outcomes. CDT focuses on the causal relations between one’s actions and outcomes, whilst Evidential Decision Theory – EDT - concerns itself with what an action indicates about the world (which is operationalized by the conditional probability). That is, according to CDT, a rational agent should track the available causal relations linking his actions to the desired outcome and take the action which will better enhance the chances of the desired outcome.

One usual example where EDT and CDT commonly diverge is the Smoking lesion: “Smoking is strongly correlated with lung cancer, but in the world of the Smoker's Lesion this correlation is understood to be the result of a common cause: a genetic lesion that tends to cause both smoking and cancer. Once we fix the presence or absence of the lesion, there is no additional correlation between smoking and cancer. Suppose you prefer smoking without cancer to not smoking without cancer, and prefer smoking with cancer to not smoking with cancer. Should you smoke?” CDT would recommend smoking since there is no causal connection between smoking and cancer. They are both caused by a gene, but have no causal direct connection with each other. EDT, on the other hand, would recommend against smoking, since smoking is an evidence for having the mentioned gene and thus should be avoided.

The core aspect of CDT is mathematically represented by the fact it uses probabilities of conditionals in place of conditional probabilities [#fn2 2]. The probability of a conditional is the probability of the whole conditional being true, where the conditional probability is the probability of the consequent given the antecedent. A conditional probability of B given A - P(B|A) -, simply implies the Bayesian probability of the event B happening given we known A happened, it’s used in EDT. The probability of conditionals – P(A > B) - refers to the probability that the conditional 'A implies B' is true, it is the probability of the contrafactual ‘If A, then B’ be the case. Since contrafactual analysis is the key tool used to speak about causality, probability of conditionals are said to mirror causal relations. In most cases these two probabilities track each other, and CDT and EDT give the same answers. However, some particular problems have arisen where their predictions for rational action diverge such as the Smoking lesion problem – where CDT seems to give a more reasonable prescription – and Newcomb's problem – where CDT seems unreasonable. David Lewis proved [#fn3 3] it's impossible to probabilities of conditionals to always track conditional probabilities. Hence, evidential relations aren’t the same as causal relations and CDT and EDT will always diverge in some cases.

References

  1. http://plato.stanford.edu/entries/decision-causal/
  2. Lewis, David. (1981) "Causal Decision Theory," Australasian Journal of Philosophy 59 (1981): 5- 30.
  3. Lewis, D. (1976), "Probabilities of conditionals and conditional probabilities", The Philosophical Review (Duke University Press) 85 (3): 297–315

See also

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Orgasmium (also known as hedonium) is a homogeneous substance with limited consciousness, which is in a constant state of supreme bliss. An AI programmed to "maximize happiness" might simply tile the universe with orgasmium. Some who believe this consider it a good thing; others do not. Those who do not, use its undesirability to argue that not all terminal values reduce to "happiness" or some simple analogue. Hedonium is the hedonistic utilitarian's version of utilitronium.

Orgasmium (also known as hedonium) is a homogeneous substance with limited consciousness, which is in a constant state of supreme bliss. An AI programmed to "maximize happiness" might simply tile the universe with orgasmium. Some who believe this consider it a good thing; others do not. Those who do not, use its undesirability to argue that not all terminal values reduce to "happiness" or some simple analogue. Hedonium is the hedonistic utilitarian's version of utilitronium.

Blog posts

See also

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

(Astronomical) suffering risks, also known as s-risks, are risks of the creation of intense suffering in the far future on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.

(Astronomical) suffering risks, also known as s-risks, are risks of the creation of intense suffering in the far future on an astronomical scale, vastly exceeding all suffering that has existed on Earth so far.

S-risks are an example of existential risk (also known as x-risks) according to Nick Bostrom's original definition, as they threaten to "permanently and drastically curtail [Earth-originating intelligent life's] potential". Most existential risks are of the form "event E happens which drastically reduces the number of conscious experiences in the future". S-risks therefore serve as a useful reminder that some x-risks are scary because they cause bad experiences, and not just because they prevent good ones.

Within the space of x-risks, we can distinguish x-risks that are s-risks, x-risks involving human extinction, x-risks that involve immense suffering and human extinction, and x-risks that involve neither. For example:

<figure class="table"><tbody></tbody>
 extinction risknon-extinction risk
suffering riskMisaligned AGI wipes out humans, simulates many suffering alien civilizations.Misaligned AGI tiles the universe with experiences of severe suffering.
non-suffering riskMisaligned AGI wipes out humans.Misaligned AGI keeps humans as "pets," limiting growth but not causing immense suffering.
</figure>

A related concept is hyperexistential risk, the risk of "fates worse than death" on an astronomical scale. It is not clear whether all hyperexistential risks are s-risks per se. But arguably all s-risks are hyperexistential, since "tiling the universe with experiences of severe suffering" would likely be worse than death.

There are two EA organizations with s-risk prevention research as their primary focus: the Center on Long-Term Risk (CLR) and the Center for Reducing Suffering. Much of CLR's work is on suffering-focused AI safety and crucial considerations. Although to a much lesser extent, the Machine Intelligence Research Institute and Future of Humanity Institute have investigated strategies to prevent s-risks too. 

Another approach to reducing s-risk is to "expand the moral circle" together with raising concern for suffering, so that future (post)human civilizations and AI are less likely to instrumentally cause suffering to non-human minds such as animals or digital sentience. Sentience Institute works on this value-spreading problem.

 

See also

 

External links

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)

Pivotal acts are acts that substantially change the direction humanity will have taken in 1 billion years. The term is used to denote positive changes, as opposed to existential catastrophe.

An obvious pivotal act would be to create a sovereign AGI aligned with humanity's best interests. An act that would greatly increase the chance of another pivotal act would also count as pivotal.

Pivotal acts often lay outside the Overton window. One such example is stopping or strongly delaying the development of an unaligned (or any) AGI through drastic means such as nanobots which melt all advanced processors, or the disabling of all AI researchers. Eliezer mentions these in AGI Ruin: A List of Lethalities. Andrew Critch argues against such an unilateral pivotal act in “Pivotal Act” Intentions: Negative Consequences and Fallacious Arguments.

For more details, see arbital.

Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!

Tags: None (add tags)