|Alignment Forum Tag|
Human Values are the things we care about, and would want an aligned superintelligence to look after and support. It is suspected that true human values are highly complex, and could be extrapolated into a wide variety of forms.
An actually good solution to AI alignment might look like a superintelligence that understands, agrees with, and deeply believes in human morality.
You wouldn’t have to command a superintelligence like this to cure cancer; it would already want to cure cancer, for the same reasons you do. But it would also be able to compare the costs and benefits of curing cancer with those of other uses of its time, like solving global warming or discovering new physics. It wouldn’t have any urge to cure cancer by nuking the world, for the same reason you don’t have any urge to cure cancer by nuking the world – because your goal isn’t to “cure cancer”, per se, it’s to improve the lives of people everywhere. Curing cancer the normal way accomplishes that; nuking the world doesn’t. This sort of solution would mean we’re no longer fighting against the AI – trying to come up with rules so smart that it couldn’t find loopholes. We would be on the same side, both wanting the same thing.
It would also mean that the CEO of Google (or the head of the US military, or Vladimir Putin) couldn’t use the AI to take over the world for themselves. The AI would have its own values and be able to agree or disagree with anybody, including its creators.
It might not make sense to talk about “commanding” such an AI. After all, any command would have to go through its moral system. Certainly it would reject a command to nuke the world. But it might also reject a command to cure cancer, if it thought that solving global warming was a higher priority. For that matter, why would one want to command this AI? It values the same things you value, but it’s much smarter than you and much better at figuring out how to achieve them. Just turn it on and let it do its thing.
We could still treat this AI as having an open-ended maximizing goal. The goal would be something like “Try to make the world a better place according to the values and wishes of the people in it.”
The only problem with this is that human morality is very complicated, so much so that philosophers have been arguing about it for thousands of years without much progress, let alone anything specific enough to enter into a computer. Different cultures and individuals have different moral codes, such that a superintelligence following the morality of the King of Saudi Arabia might not be acceptable to the average American, and vice versa.
One solution might be to give the AI an understanding of what we mean by morality – “that thing that makes intuitive sense to humans but is hard to explain”, and then ask it to use its superintelligence to fill in the details. Needless to say, this suffers from various problems – it has potential loopholes, it’s hard to code, and a single bug might be disastrous – but if it worked, it would be one of the few genuinely satisfying ways to design a goal architecture.
Until AI doesn't exceed human capabilities, we could do that.
But there is no reason why AI capabilities would stop at the human level. Systems more intelligent than us, could think of several ways to outsmart us, so our best bet is to have them as closely aligned to our values as possible.
Ideally, it would be aligned to everyone's shared values. This is captured in the "coherent extrapolated volition" idea, which is meant to be the holy grail of alignment. The problem is that it's extremely hard to implement it.
We could divide the alignment problem into two subproblems: aligning AI to its creators, and aligning those creators to the general population. Let's assume optimistically that the first one is solved. Now, we still can have a situation where the creators want something that's harmful for the rest, for example when they are a for-profit company whose objective is to maximize those profits, regardless of the externalities.
One approach is to crowdsource the values for AI, like in the moral machineexample, where people are faced with a moral dilemma and have to choose which action to choose. This data could then used to train the AI. One problem with such approach is that people are prone to lots of cognitive biases, and their answers won't be fully rational. The AI would then align to what people say they value, and not to what they actually value, which with a superintelligent system may be catastrophic. The AI should be aware of this fact and don't take what people say at face value, but try to infer their underlying values. This is an active area of study.
For some, the problem of aligning the AI creators with the rest of the people, is just as hard or even harder, than aligning those creators with AI. The solution could require passing some law or building some decentralized system.
Optimistic views might hold that it is possible to coordinate between all AI creators to align their AIs only with a central agreed-upon definition of "human values," which could be determined by traditional human political organizations. Succeeding at this coordination would prevent (or at least, reduce) the weaponization of AIs toward competition between these values.
More pessimistic views hold that this coordination is unlikely to succeed, and that just as today different definitions of "human values" compete with one another (through e.g. political conflicts), AIs will likely be constructed by actors with different values and will compete with one another on the same grounds. The exception being that this competition might end if one group gains enough advantage to carry out a Pivotal Act that can "lock-in" their set of values as winner.
We could imagine a good instance of this might look like a U.N.-sanctioned project constructing the first super-intelligent AI, successfully aligned with the human values roughly defined as "global peace and development". This AI might then perform countermeasures to reduce the influence of bad AIs by e.g. regulating further AI development, or seizing compute power from agencies developing bad AIs.
Bad outcomes might look similar to the above, but with AIs developed by extremists or terrorists taking over. Worse still would be a careless development group accidentally producing a maligned AI, where we don't end up with "bad human values" (like one of the more oppressive human moralities), we just end up with "non-human values" (like where only paperclips matter).
A common concern is that if a friendly AI doesn't carry this out, then an opposition AI is likely to do so. Hence, there is a relatively common view that safe AI not only must be developed, but must be deployed to prevent possibly hostile AIs from arising.There are also arguments against "Pivot Act" mentality which promote political regulation as a better path toward friendly AI than leaving the responsibility to the first firm to finish.
Unanswered canonical questions