Reward Hacking: Concrete Problems in AI Safety Part 3

From Stampy's Wiki

Reward Hacking: Concrete Problems in AI Safety Part 3
Channel: Robert Miles
Published: 2017-08-12T19:24:08Z
Views: 76080
Likes: 3642
43 questions on this video!
QuestionYouTubeLikesAsked Discord?AnsweredBy
Karpata's question on Reward Hacking230false
Benjamin Brady's question on Reward Hacking24true
Bibbedibob's question on Reward Hacking7false
Kataquax's question on Reward Hacking3false
Benjamin Brady's question on Reward Hacking3false
Misium's question on Reward Hacking3true
Doublebrass's question on Reward Hacking2false
LordCAR's question on Reward Hacking2false
ShineOn1337's question on Reward Hacking1false
Kekscoreunlimited's question on Reward Hacking1false
Arinco3817's question on Reward Hacking1false
Alexito's World's question on Reward Hacking1false
Dukereg's question on Reward Hacking1false
Chad Smith's question on Reward Hacking0false
Alexander Korsunsky's question on Reward Hacking0false
4729 Zex's question on Reward Hacking0false
Tay Ko's question on Reward Hacking0false
Knightshousegames's question on Reward Hacking0false
Knightshousegames's question on Reward Hacking0false
Bombolo Bombazzo's question on Reward Hacking0false
Krmpfpks's question on Reward Hacking0false
Roberto Gallotta's question on Reward Hacking0false
TiagoTiago's question on Reward Hacking0false
Daniel Houck's question on Reward Hacking0false
Luke Mooney's question on Reward Hacking0true
Mason Hidari's question on Reward Hacking0false
Faustin Gashakamba's question on Reward Hacking0false
Steven Victor Neiman's question on Reward Hacking0false
Iwer Sonsch's question on Reward Hacking0false
Danielo515's question on Reward Hacking0false
Jiggy Potamus's question on Reward Hacking0false
NeatNit's question on Reward Hacking0false
RobertsMrtn's question on Reward Hacking0false
SirCutRy's question on Reward Hacking0false
Daniele Scotece's question on Reward Hacking0false
Sharad Richardet's question on Reward Hacking0false
Klaus Gartenstiel's question on Reward Hacking0false
Tanmay Khattar's question on Reward Hacking0false
Bleach Martini's question on Reward Hacking0false
400cc MIRUKU's question on Reward Hacking0false
Kmden Rt's question on Reward Hacking0true
Atizs's question on Reward Hacking0false


Sometimes AI can find ways to 'cheat' and get more reward than we intended by doing something unexpected.

The Concrete Problems in AI Safety Playlist:
The Computerphile video:
The paper 'Concrete Problems in AI Safety':

SethBling's channel:

With thanks to my excellent Patreon supporters:

Jordan Medina
FHI's own Kyle Scott
Jason Hise
David Rasmussen
James McCuen
Richárd Nagyfi
Ammar Mousali
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
The Guru Of Vision
Fabrizio Pisani
Alexander Hartvig Nielsen
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Peggy Youell
Konstantin Shabashov
Almighty Dodd
Matthias Meger
Scott Stevens
Emilio Alvarez
Benjamin Aaron Degenhart
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabilan Kabilan Kabilan Kabilan
Philip Alexander
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham


hi this video is part of a series
looking at the paper concrete problems
in AI safety I recommend you check out
the other videos in the series linked
but hopefully this video should still
make sense even if you haven't seen the
others also in case anyone is subscribed
to me and not to computer file for some
reason I recently did a video on the
computer file channel that's effectively
part of the same series that video
pretty much covers what I would say in a
video about the multi agent approaches
part of the paper so check that video
out if you haven't seen it linked in the
doobly-doo right today we're talking
about reward hacking so in the computer
file video I was talking about
reinforcement learning and the basics of
how that works you have an agent
interacting with an environment and
trying to maximize a reward so like your
agent might be pac-man and then the
environment would be the maze the
pac-man is in and the reward would be
the score of the game systems based on
this framework have demonstrated an
impressive ability to learn how to
tackle various problems and some people
are hopeful about this work eventually
developing into full general
intelligence but there are some
potential problems with using this kind
of approach to build powerful AI systems
and one of them is reward hacking so
let's say you've built a very powerful
reinforcement learning AI system and
you're testing it in Super Mario World
it can see the screen and take actions
by pressing buttons on the controller
and you've told it the address in memory
where the score is and set that as the
reward and what it does instead of
really playing the game is something
like this this is a video by sethbling
who is a legitimate wizard check out his
channel link in the doobly-doo
it's doing all of this weird looking
glitchy stuff and spin jumping all over
the place and also Mario is black now
anyway this doesn't look like the kind
of game playing behavior you're
expecting and you just kind of assume
that your AI is broken and then suddenly
the score part of memory is set to the
maximum possible value in the AI stops
entering inputs and just sits there
hang on ai what are you doing am i rock
no you're not wrong them are on because
it turns out that there are sequences of
inputs you can put into Super Mario
World that just totally breaks a game
and give you the ability to directly
edit basically any address in memory you
can use it to turn the game into flappy
bird if you want and your AI has just
used it to set its reward to maximum
this is reward hacking so what really
went wrong here
well you told it to maximize a value in
memory when what you really wanted was
for it to play the game the assumption
was that the only way to increase the
score value was to play the game well
but that assumption turned out to not
really be true any way in which your
reward function doesn't perfectly line
up with what you actually want the
system to do can cause problems and
things only get more difficult in the
real world where there's no handy score
variable we can use as a reward whatever
we want the agent to do we need some way
of translating that real-world thing
into a number and how do you do that
without the AI messing with it these
days the usual way to convert complex
messy real-world things that we can't
easily specify into machine
understandable values is to use a deep
neural network of some kind train it
with a lot of examples and then it can
tell if something is a cat or a dog or
whatever so it's tempting to feed
whatever real-world data we've got into
a neural net and train it on a bunch of
human supplied rewards for different
situations and then use the output of
that as the reward for the AI now as
long as your situation and your AI are
very limited that can work but as I'm
sure many of you have seen these neural
networks are vulnerable to adversarial
examples it's possible to find
unexpected inputs that cause the system
to give dramatically the wrong answer
the image on the right is clearly a
panda but it's misidentified as a given
this kind of thing could obviously cause
problems for any reinforcement learning
agent that's using the output of a
neural network as its reward function
and there's kind of an analogy to Super
Mario World here as well if we view the
game as a system designed to take a
sequence of inputs the players button
presses and output a number representing
how well that player is playing ie the
score then this glitchy hacking stuff
sethbling did can be thought of as an
adversarial example it's a carefully
chosen input that makes the system
output the wrong answer note that as far
as I know no AI system has independently
discovered how to do this to Super Mario
World yet because figuring out the
details of how all the bugs work just by
messing around with the game requires a
level of intelligence that's probably
beyond current AI systems but similar
stuff happens all the time and the more
powerful an AI system is the better it
is at figuring things out and the more
likely it is to
expected ways to cheat to get large
rewards and as AI systems are used to
tackle more complex problems with more
possible actions the agent can take and
more complexity in the environment it
becomes more and more likely that there
will be high reward strategies that we
didn't think of okay so there are going
to be unexpected sets of actions that
result in high rewards and maybe AI
systems will stumble across them but
it's worse than that because we should
expect powerful AI systems to actively
seek out these hacks even if they don't
know they exist why well they're the
best strategies with hacking you can set
your super mario world score to
literally the highest value that that
memory address is capable of holding is
that even possible in non cheating play
even if it is it'll take much longer
and look at the numbers on these images
again the left image of a panda is
recognized as a panda with less than 60
percent confidence and real photos of
Givens would probably have similar
confidence levels but the adversarial
example on the right is recognized as a
given with ninety nine point three
percent confidence if an AI system were
being rewarded for seeing given it would
love that right image to the neural net
that panda looks more like a given than
any actual photo of a given similarly
reward hacking can give much much higher
rewards than even the best possible
legitimate actions so seeking out reward
hacks would be the default behavior of
some kinds of powerful AI systems so
this is an AI design problem but is it a
safety problem our super mario world a I
just gave itself the highest possible
score and then sat there doing nothing
it's not what we wanted but it's not too
big of a deal right I mean we can just
turn it off and try to fix it but a
powerful general intelligence might
realize that the best strategy is an
actually hack your reward function and
then sit there forever with maximum
reward because you'll quickly get turned
off the best strategy is do whatever you
need to do to make sure that you won't
ever be turned off and then act your
reward function and that doesn't work
out well for us thanks to my excellent
patreon supporters these these people
you've now made this channel pretty much
financially sustainable I'm so grateful
to all of you and in this video I
particularly want to thank Bo and
who's been supporting me since May
you know now that the channel was hit
10,000 subscribers maybe it's actually
12,000 now I'm apparently now allowed to
use the YouTube space in London so I
hope to have some behind-the-scenes
stuff about that so patreon supporters
before too long
thanks again and I'll see you next time