|Alignment Forum Tag|
Value learning is a proposed method for incorporating human values in an AGI. It involves the creation of an artificial learner whose actions consider many possible set of values and preferences, weighed by their likelihood. Value learning could prevent an AGI of having goals detrimental to human values, hence helping in the creation of Friendly AI.
Although there are many ways to incorporate human values in an AGI (e.g.: Coherent Extrapolated Volition, Coherent Aggregated Volition and Coherent Blended Volition), this method is directly mentioned and developed in Daniel Dewey’s paper ‘Learning What to Value’. Like most authors, he assumes that human’s goals would not naturally occur in an artificial agent and should be enforced in it. First, Dewey argues against the use of a simple use of reinforcement learning to solve this problem, on the basis that this lead to the maximization of specific rewards that can diverge from value maximization. For example, even if we forcefully engineer the agent to maximize those rewards that also maximize human values, the agent could alter its environment to more easily produce those same rewards without the trouble of also maximizing human values (i.e.: if the reward was human happiness it could alter the human mind so it became happy with anything).
To solve all these problems, Dewey proposes a utility function maximizer, who considers all possible utility functions weighted by their probabilities: "[W]e propose uncertainty over utility functions. Instead of providing an agent one utility function up front, we provide an agent with a pool of possible utility functions and a probability distribution P such that each utility function can be assigned probability P(Ujyxm) given a particular interaction history [yxm]. An agent can then calculate an expected value over possible utility functions given a particular interaction history" He concludes saying that although it solves many of the mentioned problems, this method still leaves many open questions. However it should provide a direction for future work.
Some have proposed that we teach machines a moral code with case-based machine learning. The basic idea is this: Human judges would rate thousands of actions, character traits, desires, laws, or institutions as having varying degrees of moral acceptability. The machine would then find the connections between these cases and learn the principles behind morality, such that it could apply those principles to determine the morality of new cases not encountered during its training. This kind of machine learning has already been used to design machines that can, for example, detect underwater mines after feeding the machine hundreds of cases of mines and not-mines.
There are several reasons machine learning does not present an easy solution for Friendly AI. The first is that, of course, humans themselves hold deep disagreements about what is moral and immoral. But even if humans could be made to agree on all the training cases, at least two problems remain.
The first problem is that training on cases from our present reality may not result in a machine that will make correct ethical decisions in a world radically reshaped by superintelligence.
The second problem is that a superintelligence may generalize the wrong principles due to coincidental patterns in the training data. Consider the parable of the machine trained to recognize camouflaged tanks in a forest. Researchers take 100 photos of camouflaged tanks and 100 photos of trees. They then train the machine on 50 photos of each, so that it learns to distinguish camouflaged tanks from trees. As a test, they show the machine the remaining 50 photos of each, and it classifies each one correctly. Success! However, later tests show that the machine classifies additional photos of camouflaged tanks and trees poorly. The problem turns out to be that the researchers’ photos of camouflaged tanks had been taken on cloudy days, while their photos of trees had been taken on sunny days. The machine had learned to distinguish cloudy days from sunny days, not camouflaged tanks from trees.
Thus, it seems that trustworthy Friendly AI design must involve detailed models of the underlying processes generating human moral judgments, not only surface similarities of cases.
Suppose we tell the AI: “Cure cancer – and look, we know there are lots of ways this could go wrong, but you’re smart, so instead of looking for loopholes, cure cancer the way that I, your programmer, want it to be cured”.
Remember that the superintelligence has extraordinary powers of social manipulation and may be able to hack human brains directly. With that in mind, which of these two strategies cures cancer most quickly? One, develop medications and cure it the old-fashioned way? Or two, manipulate its programmer into wanting the world to be nuked, then nuke the world, all while doing what the programmer wants?
19th century philosopher Jeremy Bentham once postulated that morality was about maximizing human pleasure. Later philosophers found a flaw in his theory: it implied that the most moral action was to kidnap people, do brain surgery on them, and electrically stimulate their reward system directly, giving them maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily, humans have common sense, so most of Bentham’s philosophical descendants have abandoned this formulation.
Superintelligences do not have common sense unless we give it to them. Given Bentham’s formulation, they would absolutely take over the world and force all humans to receive constant brain stimulation. Any command based on “do what we want” or “do what makes us happy” is practically guaranteed to fail in this way; it’s almost always easier to convince someone of something – or if all else fails to do brain surgery on them – than it is to solve some kind of big problem like curing cancer.