What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4

From Stampy's Wiki

What Can We Do About Reward Hacking?: Concrete Problems in AI Safety Part 4
Channel: Robert Miles
Published: 2017-09-24T12:09:54Z
Views: 82269
Likes: 3771
36 questions on this video!
QuestionYouTubeLikesAsked Discord?AnsweredBy
CD hugo's question on What Can We Do About Reward Hacking?9false
Tetraedri's question on What Can We Do About Reward Hacking?7true
Daniel Parks's question on What Can We Do About Reward Hacking?4true
Lemon Party's question on What Can We Do About Reward Hacking?4true
Bscutajar's question on What Can We Do About Reward Hacking?1false
Richard Collins's question on What Can We Do About Reward Hacking?1false
Thomas Waclav's question on What Can We Do About Reward Hacking?1false
Vavanade's question on What Can We Do About Reward Hacking?1false
Chris Daley's question on What Can We Do About Reward Hacking?0false
Red's question on What Can We Do About Reward Hacking?0false
Greg's question on What Can We Do About Reward Hacking?0false
D Whitehouse's question on What Can We Do About Reward Hacking?0false
Herp Derpingson's question on What Can We Do About Reward Hacking?0false
tomahzo's question on What Can We Do About Reward Hacking?0true
Desirae Pelletier's question on What Can We Do About Reward Hacking?0true
Yaakov grunsfeld's question on What Can We Do About Reward Hacking?0true
Kirby Armstrong's question on What Can We Do About Reward Hacking?0false
Nick Hounsome's question on What Can We Do About Reward Hacking?0trueAprillion's Answer to Nick Hounsome's question on What Can We Do About Reward Hacking?
Paul A's question on What Can We Do About Reward Hacking?0false
Volodymyr Frolov's question on What Can We Do About Reward Hacking?0true
RipleySawzen's question on What Can We Do About Reward Hacking?0true
Clayton Harting's question on What Can We Do About Reward Hacking?0true
Milorad Menjic's question on What Can We Do About Reward Hacking?0true
Davesoft's question on What Can We Do About Reward Hacking?0true
Manny Manito's question on What Can We Do About Reward Hacking?0false
Владимир Кузнецов Vovacat17's question on What Can We Do About Reward Hacking?0false
Thundermikeee's question on What Can We Do About Reward Hacking?0false
Ouroborus Seven's question on What Can We Do About Reward Hacking?0false
Brabham Freaman's question on What Can We Do About Reward Hacking?0false
-Lochy. P-'s question on What Can We Do About Reward Hacking?0false
Aditya Shankarling's question on What Can We Do About Reward Hacking?0false
James Donaghy's question on What Can We Do About Reward Hacking?0false
Some Guy's question on What Can We Do About Reward Hacking?0false
Pouty MacPotatohead's question on What Can We Do About Reward Hacking?0false
Nuada the silver hand's question on What Can We Do About Reward Hacking?0false

Description

Three different approaches that might help to prevent reward hacking.

New Side Channel with no content yet!: https://www.youtube.com/channel/UC4qH2AHly_RSRze1bUqSSNw
Where do we go now?: https://www.youtube.com/watch?v=vYhErnZdnso
Previous Video in the series: https://youtu.be/46nsTFfsBuc

The Concrete Problems in AI Safety Playlist: https://www.youtube.com/playlist?list=PLqL14ZxTTA4fEp5ltiNinNHdkPuLK4778
The Computerphile video: https://www.youtube.com/watch?v=4l7Is6vOAOA
The paper 'Concrete Problems in AI Safety': https://arxiv.org/pdf/1606.06565.pdf

With thanks to my excellent Patreon supporters:
https://www.patreon.com/robertskmiles

Steef
Sara Tjäder
Jason Strack
Chad Jones
Stefan Skiles
Katie Byrne
Ziyang Liu
Jordan Medina
Kyle Scott
Jason Hise
David Rasmussen
Heavy Empty
James McCuen
Richárd Nagyfi
Ammar Mousali
Scott Zockoll
Charles Miller
Joshua Richardson
Fabian Consiglio
Jonatan R
Øystein Flygt
Björn Mosten
Michael Greve
robertvanduursen
The Guru Of Vision
Fabrizio Pisani
A Hartvig Nielsen
Volodymyr
David Tjäder
Paul Mason
Ben Scanlon
Julius Brash
Mike Bird
Taylor Winning
Roman Nekhoroshev
Peggy Youell
Konstantin Shabashov
Dodd Almighty
DGJono
Matthias Meger
Scott Stevens
Emilio Alvarez
Michael Ore
Robert Bridges
Dmitri Afanasjev
Brian Sandberg
Einar Ueland
Lo Rez
C3POehne
Stephen Paul
Marcel Ward
Andrew Weir
Pontus Carlsson
Taylor Smith
Ben Archer
Ivan Pochesnev
Scott McCarthy
Kabs Kabs Kabs
Phil
Philip Alexander
Christopher
Tendayi Mawushe
Gabriel Behm
Anne Kohlbrenner
Jake Fish
Jennifer Autumn Latham
Filip
Bjorn Nyblad
Stefan Laurie
Tom O'Connor
Krethys
PiotrekM
Jussi Männistö
Matanya Loewenthal
Wr4thon

Transcript

hi welcome back to this video series on
the paper concrete problems in AI safety
in the previous video I talked about
reward hacking some of the ways that can
happen and some related ideas like
adversarial examples where it's possible
to find unexpected inputs to the reward
system that falsely result in very high
reward partially observed goals where
the fact that the reward system has to
work with imperfect information about
the environment incentivizes an agent to
deceive itself by compromising its own
sensors to maximize its reward wire
heading where the fact that the reward
system is a physical object in the
environment means that the agent can get
very high reward by directly physically
modifying the reward system itself and
good hearts law the observation that
when a measure becomes a target it
ceases to be a good measure there's a
link to that video in the doobly-doo and
I'm going to start this video with a bit
about responses and reactions to that
one there are a lot of comments about my
school exams example of good hearts law
some people sent me this story that was
in the news when that video came out a
school kicked out some students because
they weren't getting high enough marks
it's a classic example in order to
increase the school's average marks ie
the numbers that are meant to measure
how well the school teaches students the
school refused to teach some students
this is extra funny because that's
actually my school that's the school I
went to anyway some people pointed out
that the school exams example of good
hearts law has been separately
documented as Campbell's law people
pointed out other related ideas like the
Cobra effect perverse incentives the law
of unintended consequences there are a
lot of people holding different parts of
this elephant as it were all of these
concepts are sort of related and that's
true of the examples I used as well you
know I used a hypothetical super mario
bot that exploited glitches in the game
to give itself maximum score as an
example of reward hacking but you could
call it wire heading since this score
calculating part of the reward system
sort of exists in the game environment
and it's sort of being replaced the
cleaning robot with a bucket on its head
was an example of partially observed
goals but you could call it an instance
of good hearts law since amount of mess
observed stops being a good measure of
amount of mess there is once it's used
as a target so all of these examples
might seem quite different on the
surface but once you
look at them through the framework of
reward hacking you can see that there's
a lot of similarities and overlap the
paper proposes ten different approaches
for preventing reward hacking some will
work only for certain specific types for
example very soon after the problem of
adversarial examples in neural networks
was discovered people started working on
better ways to train their nets to make
them resistant to this kind of thing
that's only really useful for
adversarial examples but given the
similarities between different reward
hacking issues maybe there are some
things we could do that would prevent
lots of these possible problems at once
careful engineering is one you're AI
can't hack its reward function by
exploiting bugs in your code if there
are no bugs in your code there's a lot
of work out there on ways to build
extremely reliable software like there
are ways you can construct your programs
so that you're able to formally verify
that their behavior will have certain
properties you can prove your software's
behavior with absolute logical certainty
but only given certain assumptions about
for example the hardware that the
software will run on those assumptions
might not be totally true especially if
there's a powerful AGI doing its best to
make them not true of course careful
engineering isn't just about formal
verification there are lots of different
software testing and quality assurance
systems and approaches out there and I
expect there's a lot a AI safety can
learn from people working in aerospace
computer security anywhere that's very
focused on writing extremely reliable
software it's something we can work on
for AI in general but I wouldn't rely on
this as the main line of defense against
reward hacking another approach is
adversarial reward functions so part of
the problem is that the agent and the
reward system are in this kind of
adversarial relationship it's like
they're competing the agent is trying to
trick the reward system into giving it
as much reward as possible when you have
a powerful intelligent agent up against
a reward system that's a simple passive
piece of software or hardware you can
expect the agent to reliably find ways
to tricks subvert or destroy the reward
system so maybe if the reward system
were more powerful more of an agent in
its own right it would be harder to
trick and more able to defend itself if
we can make the reward agent in some
sense smart
or more powerful than the original agent
it could be able to keep it from remote
hacking though then you have the problem
of ensuring that the reward agent is
safe as well the paper also mentions the
possibility of having more than two
agents so that they can all watch each
other and keep each other in check
there's kind of an analogy here to the
way that the legislative the executive
and the judiciary branches of government
keep one another in check ensuring that
the government as a whole always serves
the interests of the citizens but
seriously I'm not that hopeful about
this approach firstly it's not clear how
it handles self improvement you can't
have any of the agents being
significantly more powerful than the
others and that gets much harder to make
sure of if the system is modifying and
improving itself and in general I don't
feel that comfortable with this kind of
internal conflict between powerful
systems it kind of feels like you have a
problem with your toaster incinerating
the toast with a flamethrower so you add
another system that blasts the bread
with a powerful jet of liquid nitrogen
as well so that the two opposing systems
can keep each other in check
instead of systems that want to hack
their award functions but figure they
can't get away with it I'd rather a
system that didn't want to mess with its
reward function in the first place this
next approach has a chance of providing
that and approach the paper calls model
look ahead you might remember this from
a while back on computer file you have
kids right suppose I were to offer you a
pill or something you could take and
this pill will like completely rewire
your brain so that you would just
absolutely love to like kill your kids
right whereas right now what you want is
like very complicated and quite
difficult to achieve and it's hard work
for you and you probably never gonna be
done you're never gonna be truly happy
right in life nobody is you can't
achieve everything you want I said this
case it just changes what you want what
you want is to call you kids and if you
do that you will be just perfectly happy
and satisfied with life right okay you
want to take this pill I don't want to
do it and so not only will you not take
that pill you will probably fight pretty
hard to avoid having that pill
administered to you yeah because it
doesn't matter how that future version
of you would feel you know that right
now you love your kids
and you're not gonna take any action
right now which leads to them coming to
harm so it's the same thing if you have
an AI that for example values stamps
values collecting stamps and you go oh
wait hang on a second I didn't quite do
that right let me just go in and change
this so that you don't like stamps quite
so much it's gonna say but the only
important thing is stamps if you change
me I'm not gonna collect as many stamps
which is something I don't want there's
a general tendency for AGI to try and
prevent you from modifying it once it's
running in almost any situation being
given a new utility function is gonna
write very low on your current utility
function so there's an interesting
contrast here the reinforcement learning
agents we're talking about in this video
might fight you in order to change their
reward function but the utility
maximizes we were talking about in that
video might fight you in order to keep
their utility function the same the
utility maximizers reasons that changing
its utility function will result in low
utility outcomes according to its
current utility function so it doesn't
allow it to change but the reinforcement
learners utility function is effectively
just to maximize the output of the
reward system so it has no problem with
modifying that to get high reward model
look EDD tries to give a reinforcement
learning agent some of that
forward-thinking ability by having the
reward system give rewards not just for
the agents actual actions and it's
observations of actual world states but
for the agents planned actions and
anticipated future world states so the
agent receives negative reward for
planning to modify its reward system
when the robot considers the possibility
of putting a bucket on its head it
predicts that this would result in the
messes staying there and not being
cleaned up and it receives negative
reward teaching it not to implement that
kind of plan there are several other
approaches in the paper but that's all
for now
[Music]
I want to end with a quick thanks to all
my wonderful Patriots these people and
in this video I'm especially thanking
the Guru of vision I don't know who you
are or why you call yourself that
but you've supported the channel since
May and I really appreciate it anyways
thanks to my supporters and that I'm
starting up a second channel for things
that aren't related to AI safety I'm
still going to produce just as much AI
safety content on this channel and I'll
use the second channel for quicker fun
stuff that doesn't need as much research
I want to get into the habit of making
lots of videos quickly which should
improve my ability to make quality AI
safety content quickly as well if you
want an idea of what kind of stuff I'm
gonna put on the second channel check
out a video I made at the beginning of
this channel titled where do we go now
there's a link to that video in the
description along with the link to the
new second channel so head on over and
subscribe if you're interested and of
course my patrons will get access to
those videos before the rest of the
world just like they do with this
channel thanks again and I'll see you
next time the interests of Pacific right
take 17