Canonical answers to non-canonical questions
If there are answers below then they are canonical answers attached to a non-canonical question. Maybe we should do something about that, like marking them as non-canonical (if they should not be served to readers), or marking their question as canonical (check the guidelines on review answers for how to do this properly).
There are 9 of these.
That is, if you know an AI is likely to be superintelligent, can’t you just disconnect it from the Internet, not give it access to any speakers that can make mysterious buzzes and hums, make sure the only people who interact with it are trained in caution, et cetera?. Isn’t there some level of security – maybe the level we use for that room in the CDC where people in containment suits hundreds of feet underground analyze the latest superviruses – with which a superintelligence could be safe?
This puts us back in the same situation as lions trying to figure out whether or not nuclear weapons are a things humans can do. But suppose there is such a level of security. You build a superintelligence, and you put it in an airtight chamber deep in a cave with no Internet connection and only carefully-trained security experts to talk to. What now?
Now you have a superintelligence which is possibly safe but definitely useless. The whole point of building superintelligences is that they’re smart enough to do useful things like cure cancer. But if you have the monks ask the superintelligence for a cancer cure, and it gives them one, that’s a clear security vulnerability. You have a superintelligence locked up in a cave with no way to influence the outside world except that you’re going to mass produce a chemical it gives you and inject it into millions of people.
Or maybe none of this happens, and the superintelligence sits inert in its cave. And then another team somewhere else invents a second superintelligence. And then a third team invents a third superintelligence. Remember, it was only about ten years between Deep Blue beating Kasparov, and everybody having Deep Blue – level chess engines on their laptops. And the first twenty teams are responsible and keep their superintelligences locked in caves with carefully-trained experts, and the twenty-first team is a little less responsible, and now we still have to deal with a rogue superintelligence.
Superintelligences are extremely dangerous, and no normal means of controlling them can entirely remove the danger.
Humans care about things! The reward circuitry in our brain reliably causes us to care about specific things. Let's create a mechanistic model of how the brain aligns humans, and then we can use this to do AI alignment.
One perspective that Shard theory has added is that we shouldn't think of the solution to alignment as:
- Find an outer objective that is fine to optimize arbitrarily strongly
- Find a way of making sure that the inner objective of an ML system equals the outer objective.
Shard theory argues that instead we should focus on finding outer objectives that reliably give certain inner values into system and should be thought of as more of a teacher of the values we want to instill as opposed to the values themselves. Reward is not the optimization target — instead, it is more like that which reinforces. People sometimes refer to inner aligning an RL agent with respect to the reward signal, but this doesn't actually make sense. (As pointed out in the comments this is not a new insight, but it was for me phrased a lot more clearly in terms of Shard theory).
Humans have different values than the reward circuitry in our brain being maximized, but they are still pointed reliably. These underlying values cause us to not wirehead with respect to the outer optimizer of reward.
Shard Theory points at the beginning of a mechanistic story for how inner values are selected for by outer optimization pressures. The current plan is to figure out how RL induces inner values into learned agents, and then figure out how to instill human values into powerful AI models (probably chain of thought LLMs, because these are the most intelligent models right now). Then, use these partially aligned models to solve the full alignment problem. Shard theory also proposes a subagent theory of mind.
This has some similarities to Brain-like AGI Safety, and has drawn on some research from this post, such as the mechanics of the human reward circuitry as well as the brain being mostly randomly initialized at birth.
There is a general consensus that any AGI would be very dangerous because not necessarilly aligned. But if the AGI does not have any reward function and is a pattern matcher like GPT, how would it go about to leading to X-risks/not being able to be put into a box/shut down?
I can definitely imagine it being dangerous, or it having continuity in its answer which might be problematic, but the whole going exponential and valuing its own survival does not seem to necessarilly apply?
One threat model which includes a GPT component is Misaligned Model-Based RL Agent. It suggests that a reinforcement learner attached to a GPT-style world model could lead to an existential risk, with the RL agent being the optimizer which uses the world model to be much more effective at achieving its goals.
Another possibility is that a sufficiently powerful world model may develop mesa optimizers which could influence the world via the outputs of the model to achieve the mesa objective (perhaps by causing an optimizer to be created with goals aligned to it), though this is somewhat speculative.
Unless there was a way to cryptographically ensure otherwise, whoever runs the emulation has basically perfect control over their environment and can reset them to any state they were previously in. This opens up the possibility of powerful interrogation and torture of digital people.
Imperfect uploading might lead to damage that causes the EM to suffer while still remaining useful enough to be run for example as a test subject for research. We would also have greater ability to modify digital brains. Edits done for research or economic purposes might cause suffering. See this fictional piece for an exploration of how a world with a lot of EM suffering might look like.
These problems are exacerbated by the likely outcome that digital people can be run much faster than biological humans, so it would be plausibly possible to have an EM run for hundreds of subjective years in minutes or hours without having checks on the wellbeing of the EM in question.
I don't know much about their research here, other than that they train their own models, which allow them to work on models that are bigger than the biggest publicly available models, which seems like a difference from Redwood.
Current interpretability methods are very low level (e.g., "what does x neuron do"), which does not help us answer high level questions like "is this AI trying to kill us".
They are trying a bunch of weird approaches, with the goal of scalable mechanistic interpretability, but I do not know what these approaches actually are.
Motivation: Conjecture wants to build towards a better paradigm that will give us a lot more information, primarily from the empirical direction (as distinct from ARC, which is working on interpretability with a theoretical focus).
John's plan is:
Step 1: sort out our fundamental confusions about agency
Step 2: ambitious value learning (i.e. build an AI which correctly learns human values and optimizes for them)
Step 3: …
Step 4: profit!
… and do all that before AGI kills us all.
He is working on step 1: figuring out what the heck is going on with agency. His current approach is based on selection theorems: try to figure out what types of agents are selected for in a broad range of environments. Examples of selection pressures include: evolution, SGD, and markets. This is an approach to agent foundations that comes from the opposite direction as MIRI: it's more about observing existing structures (whether they be mathematical or real things in the world like markets or e coli), whereas MIRI is trying to write out some desiderata and then finding mathematical notions that satisfy those desiderata.
Two key properties that might be selected for are modularity and abstractions.
Abstractions are higher level things that people tend to use to describe things. Like "Tree" and "Chair" and "Person". These are all vague categories that contain lots of different things, but are really useful for narrowing down things. Humans tend to use really similar abstractions, even across different cultures / societies. The Natural Abstraction Hypothesis (NAH) states that a wide variety of cognitive architectures will tend to use similar abstractions to reason about the world. This might be helpful for alignment because we could say things like "person" without having to rigorously and precisely say exactly what we mean by person.
The NAH seems very plausibly true for physical objects in the world, and so it might be true for the inputs to human values. If so, it would be really helpful for AI alignment because understanding this would amount to a solution to the ontology identification problem: we can understand when environments induce certain abstractions, and so we can design this so that the network has the same abstractions as humans.
Modularity: In pretty much any selection environment, we see lots of obvious modularity. Biological species have cells and organs and limbs. Companies have departments. We might expect neural networks to be similar, but it is really hard to find modules in neural networks. We need to find the right lens to look through to find this modularity in neural networks. Aiming at this can lead us to really good interpretability.
DeepMind has both a ML safety team focused on near-term risks, and an alignment team that is working on risks from AGI. The alignment team is pursuing many different research avenues, and is not best described by a single agenda.
Some of the work they are doing is:
- Engaging with recent MIRI arguments.
- Rohin Shah produces the alignment newsletter.
- Publishing interesting research like the Goal Misgeneralization paper.
- Geoffrey Irving is working on debate as an alignment strategy: more detail here.
- Discovering agents, which introduces a causal definition of agents, then introduces an algorithm for finding agents from empirical data.
See Rohin's comment for more research that they are doing, including description of some that is currently unpublished so far.
The goal of this is to create a non-agentic AI, in the form of an LLM, that is capable of accelerating alignment research. The hope is that there is some window between AI smart enough to help us with alignment and the really scary, self improving, consequentialist AI. Some things that this amplifier might do:
- Suggest different ideas for humans, such that a human can explore them.
- Give comments and feedback on research, be like a shoulder-Eliezer
A LLM can be thought of as learning the distribution over the next token given by the training data. Prompting the LM is then like conditioning this distribution on the start of the text. A key danger in alignment is applying unbounded optimization pressure towards a specific goal in the world. Conditioning a probability distribution does not behave like an agent applying optimization pressure towards a goal. Hence, this avoids goodhart-related problems, as well as some inner alignment failure.
One idea to get superhuman work from LLMs is to train it on amplified datasets like really high quality / difficult research. The key problem here is finding the dataset to allow for this.
There are some ways for this to fail:
- Outer alignment: It starts trying to optimize for making the actual correct next token, which could mean taking over the planet so that it can spend a zillion FLOPs on this one prediction task to be as correct as possible.
- Inner alignment:
- An LLM might instantiate mesa-optimizers, such as a character in a story that the LLM is writing, and this optimizer might realize that they are in an LLM and try to break out and affect the real world.
- The LLM itself might become inner misaligned and have a goal other than next token prediction.
- Bad prompting: You ask it for code for a malign superintelligence; it obliges. (Or perhaps more realistically, capabilities).
Conjecture are aware of these problems and are running experiments. Specifically, an operationalization of the inner alignment problem is to make an LLM play chess. This (probably) requires simulating an optimizer trying to win at the game of chess. They are trying to use interpretability tools to find the mesa-optimizers in the chess LLM that is the agent trying to win the game of chess. We haven't ever found a real mesa-optimizer before, and so this could give loads of bits about the nature of inner alignment failure.
- AGI safety fundamentals (technical and governance) - Is the canonical AGI safety 101 course. 3.5 hours reading, 1.5 hours talking a week w/ facilitator for 8 weeks.
- Refine - A 3-month incubator for conceptual AI alignment research in London, hosted by Conjecture.
- AI safety camp - Actually do some AI research. More about output than learning.
- SERI ML Alignment Theory Scholars Program SERI MATS - Four weeks developing an understanding of a research agenda at the forefront of AI alignment through online readings and cohort discussions, averaging 10 h/week. After this initial upskilling period, the scholars will be paired with an established AI alignment researcher for a two-week ‘research sprint’ to test fit. Assuming all goes well, scholars will be accepted into an eight-week intensive scholars program in Berkeley, California.
- Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) - Brings together young researchers studying complex and intelligent behavior in natural and social systems.
- Safety and Control for Artificial General Intelligence - An actual AI Safety university course (UC Berkeley). Touches multiple domains including cognitive science, utility theory, cybersecurity, human-machine interaction, and political science.
See also, this spreadsheet of learning resources.