Redshift's Answer to In "aligning AI with human values", which humans' values are we talking about?

From Stampy's Wiki
Redshift's Answer to In "aligning AI with human values", which humans' values are we talking about?
Alignment is very broadly concerned with how to align an AI with any given set of arbitrary values. For the purposes of research, it doesn't matter what the values are—so long as we can get the AI to sincerely hold them we have succeeded at alignment. Once the problem of alignment is solved then the "human values" a given AI holds are those given to it by its creators (as there is no reason for anyone to create an AI that works against their interest)
Alignment is very broadly concerned with how to align an AI with any given set of arbitrary values. For the purposes of research, it doesn't matter what the values are—so long as we can get the AI to sincerely hold them we have succeeded at alignment. Once the problem of alignment is solved then the "human values" a given AI holds are those given to it by its creators (as there is no reason for anyone to create an AI that works against their interest).

Optimistic views might hold that it is possible to coordinate between all AI creators to align their AIs only with a central agreed-upon definition of "human values," which could be determined by traditional human political organizations. Succeeding at this coordination would prevent (or at least, reduce) the weaponization of AIs toward competition between these values.

More pessimistic views hold that this coordination is unlikely to succeed, and that just as today different definitions of "human values" compete with one another (through e.g. political conflicts), AIs will likely be constructed by actors with different values and will compete with one another on the same grounds. The exception being that this competition might end if one group gains enough advantage to carry out a Pivotal Act that can "lock-in" their set of values as winner.

We could imagine a good instance of this might look like a U.N.-sanctioned project constructing the first super-intelligent AI, successfully aligned with the human values roughly defined as "global peace and development". This AI might then perform countermeasures to reduce the influence of bad AIs by e.g. regulating further AI development, or seizing compute power from agencies developing bad AIs.

Bad outcomes might look similar to the above, but with AIs developed by extremists or terrorists taking over. Worse still would be a careless development group accidentally producing a maligned AI, where we don't end up with "bad human values" (like one of the more oppressive human moralities), we just end up with "non-human values" (like where only paperclips matter).

A common concern is that if a friendly AI doesn't carry this out, then an opposition AI is likely to do so. Hence, there is a relatively common view that safe AI not only must be developed, but must be deployed to prevent possibly hostile AIs from arising.

There are also arguments against "Pivot Act" mentality which promote political regulation as a better path toward friendly AI than leaving the responsibility to the first firm to finish.
Stamps: None
Show your endorsement of this answer by giving it a stamp of approval!


Answer to

People talk about "aligning AI with human values", but humans don't all agree on one set of values. Whose values are we aligning the AI with?

Answer Info
Original by: redshift


Discussion