Answers written by this user
"Cause the least change to environment outside of that needed to accomplish goal" still doesn't stop an AGI from killing every lifeform on Earth if it is sure that is required to accomplish its goal. It might, however, stop accidental wiping out of humanity in some specific cases. That's probably the showstopper problem with that idea. There is a good article on limiting the impact of all actions, which is a much safer way to operate, over here: https://www.alignmentforum.org/tag/impact-measures This explores the trade-offs between useful efficiency and harmful actions. Ultimately, it's a hard measurement to make in the same way that "only do things we want you to" would be a hard command to codify for an AGI. Different people would place different values on different levels of impact.
The wonderful thing about humanity is how dang many of us there are. While it is easy to say "Why are you worrying about X when there is Y?" it must also be remembered that humanity is populous enough that we can work on two (or more) problems at the same time. The dangerous part, whether it's a human mind uploaded to a computer, a brain wired to a computer, or an AGI, is when a single actor becomes a super-intelligence. Humans, as we all know, are not aligned to humanity's goals any more than the most worrying AGI designs. Making anything into a super-intelligence gives it ultimate power, and there's a very well-known saying regarding that (power corrupts, absolute power corrupts absolutely). A "lazy" AGI (one that tries to think only of thing a reasonable human might try to implement) has been discussed in the video: https://www.youtube.com/watch?v=gdKMG6kTl6Y Side note: humans are a general intelligence, but we are not (to the best of our knowledge and evidence) artificial. Ultimately, we shouldn't "fear" either. We should attempt to disassociate that emotion from the equation and examine both general intelligence (humans) and artificial general intelligence (AGI) with caution and care.
This seems like a possible solution, but even just barely scratching the surface reveals why this would be a very bad idea. For example, what if the first AGI was perfectly aligned, but you released the second one to turn it off and the second one (realizing humans would just remake the first one again) decided that killing all humans first is the best way ahead? At its core, the alignment problem could end up being a dice roll anyway, with just a (hopefully) small chance of going bad, we don't wish to keep rolling that dice again and again. In your example, this would literally double the chance of having a bad AGI.
This typically leads to a huge game of whack-a-mole. So you make it so the objective is in different sections of the map each time—then deployment shows the objective is above them 10% of the time and the AI has no notion of how to perceive that. Every time you adjust your test data and rerun your training you find another slight problem. What needs to be done is have a way to teach the AI to recognize the objective no matter where it is and effectively decoupling the reward from the location.
"what to do" according to Google Translate.
Acting stupid could be entirely part of an actor's psychological fight with another actor—for example, feigning a tell in a game of poker. The answer to your first question is, therefore, you cannot without gathering more information—but you should definitely plan for both eventualities, particularly if you're an actor they're interacting with. Occam's Razor only gives a hint toward what is more likely given a limited set of information.
Our users have found 12 differences between the two images at 4:17 and an additional 3 at 4:19. I believe the difficulty of this puzzle was far easier than Inner Alignment problems.
Seems very much like "raising AI like kids". https://www.youtube.com/watch?v=eaYIU6YXr3w The other issue involved here is the AI could—instead of committing to learning human values by osmosis—emulate the desired values long enough to be let off the leash.
I think that there is a chance that it is impossible to align AGI. i would personally say its a small one but i cant predict that with any certainty. Trying to work on finding a solution is still very important in order to at least reduce the chances that an unaligned AGI gets developed.
This (the Stop Button Problem) has been covered in the excellent series by Rob over on the Computerphile channel: https://youtu.be/3TYT1QfdfsM
Animals tend to be far easier to anticipate. The structure of their brains and physiology tends to ensure they think and act along certain lines in the same way that humans will think and act. An AI is a complete unknown until you try throwing different scenarios at it to ensure it isn't taking shortcuts to game its way through testing/training. You would have many of the same problems brought up in a previous video: Why Not Just: Raise AI Like Kids? This explores the differences between an artificial intelligence and an evolved intelligence (like animals have).
If you're lucky, you spend the rest of your life drawing picturesque scenes on little tiny pieces of paper.
For the first question, yes—an easy example to reach for would be an AI to play the game "Cheat": https://en.wikipedia.org/wiki/Cheat_(game) . For the second question—you can, but the AI could lie about being honest.
You would start to run into the whack-a-mole problem. Basically, whenever you make a hard "don't ever do X" rule, you will absolutely wind up having to make dozens of exceptions each time the AI works around said rule. Ex: Make a medical research AI and program it to Not harm Living Creatures AI halts, since any action it takes will cause harm to at least one single-celled organism You make an exception for anything under a few hundred cells AI creates a new medication that has a side effect of killing gut flora/fauna—anyone who takes it dies of malnutrition You make an exception to the exception for things living inside humans AI halts trying to make a de-worming drug because it cannot harm things living in humans Etc
These problems are precisely because an AI, at least current ones, cannot adapt their behavior to even slightly different changes. A human wouldn't stand at the end point where a coin would normally be if there was a coin visible elsewhere on the screen. Ultimately, though, this shows that while we intend to teach an AI something—and we believe we have—it can quite often be undertaking a different goal that while it solved the problems in testing, is revealed to be not what we intended to teach it in production. A real-world example of this misalignment is when YouTube created an AI to identify animal fighting videos. They thought they had trained it to find animals fighting each other in an arena—what they had was an AI that matched any fighting in an arena, even robot-fighting contests like Robot Wars and BattleBots. A human would be extremely unlikely to make this mistake.
The video in question, and the time period within it, is https://youtu.be/XOxxPcy5Gr4?t=267 in the lower left corner. The AI did try to make text, but looking at the transforms it was unable to make actual characters let alone words. I suspect a larger training set of lolcats would be required to teach it both what a cat looks like and how to write captions in valid English.
Yes, people are thinking about biologically inspired alignment strategies, for example steve2152 on LessWrong: https://www.lesswrong.com/posts/DWFx2Cmsvd4uCKkZ4/inner-alignment-in-the-brainThere is also the idea of whole brain emulation: https://www.lesswrong.com/tag/whole-brain-emulationIt's not clear that these avenues will go anywhere good, since evolution did not succeed in aligning us with its 'goals' despite massive resources, but it is worth investigating.
Worst-case scenarios are many and all of them are terrible for humanity (of course, for the AI they wouldn't be terrible or worst-case). Wiping out all humans, enslaving all humans, realigning humanity's goals to its own—are all just barely breaking the surface of what flavor of "worst-case scenarios" a superintelligent AI agent could execute.
Trying to hide information from an AGI is almost certainly not an avenue towards safety - if the agent is better at reasoning than us, it is likely to derive information relevant to safety considerations that we wouldn't think to hide. It is entirely appropriate, then, to use thought experiments like these where the AGI has such a large depth of information, because our goal should be to design systems that behave safely even in such permissive environments.
The deceptive behavior is an emergent result of the hypothetical mesa-optimiser's well-developed but misaligned long-term planning capabilities acting under the incentive structure presented by the test environment. The premise is that it's a general intelligence, it doesn't have to 'learn' deception specifically, it will be able to figure out how to do it on its own.
It certainly would be more complicated than making a utility maximizer, yes, and likely beyond our current machine learning techniques (As humans are generally intelligent, a machine capable of predicting human responses to stimuli would be an AGI). The goal isn't to make AGI easier to produce, but safer to operate.
Alignment failure, at its core, is any time an AI's output deviates from what we intended. We have already witnessed alignment failure in simple AIs. Mostly, these amount to correlation being equated to causation. A good example was an AI built by Youtube to recognize animals being forced to fight for sport. The videos given to the AI were always set in some kind of arena, so the AI drew the simplest conclusion and matched videos where there were similar arenas—such as with robot combat tournaments.
None of Damaged's questions have been answered yet.