|Alignment Forum Tag|
A utility function assigns numerical values ("utilities") to outcomes, in such a way that outcomes with higher utilities are always preferred to outcomes with lower utilities.
Utility Functions do not work very well in practice for individual humans. Human drives are not coherent nor is there any reason to think they would be (Thou Art Godshatter), and even people with a strong interest in the concept have trouble working out what their utility function actually is even slightly (Post Your Utility Function). Furthermore, humans appear to calculate utility and disutility separately - adding one to the other does not predict their behavior accurately. This makes humans highly exploitable.
However, utility functions can be a useful model for dealing with humans in groups, e.g. in economics.
The VNM Theorem tag is likely to be a strict subtag of the Utility Functions tag, because the VNM theorem establishes when preferences can be represented by a utility function, but a post discussing utility functions may or may not discuss the VNM theorem/axioms.
I think an AI inner aligned to optimize a utility function of maximize happiness minus suffering is likely to do something like this.
Inner aligned meaning the AI is trying to do the thing we trained it to do. Whether this is what we actually want or not.
"Aligned to what" is the outer alignment problem which is where the failure in this example is. There is a lot of debate on what utility functions are safe or desirable to maximize, and if human values can even be described by a utility function.