We Were Right! Real Inner Misalignment

From Stampy's Wiki

We Were Right! Real Inner Misalignment
Channel: Robert Miles
Published: 2021-10-10
Views: 42844
Likes: 5400
165 questions on this video!
QuestionYouTubeLikesAsked Discord?AnsweredBy
Penny Lane's question on Real Inner Misalignment3false
Huntracony's question on Real Inner Misalignment1false
Luca Ruzzola's question on Real Inner Misalignment1false
Themrus's question on Real Inner Misalignment1false
6c33 3374's question on Real Inner Misalignment1false
BarbarianGod's question on Real Inner Misalignment1false
Comet AC's question on Real Inner Misalignment1true
Cubba's question on Real Inner Misalignment1false
YuureiInu's question on Real Inner Misalignment0false
Vulkanodox's question on Real Inner Misalignment0false
androkguz's question on Real Inner Misalignment0true
dgSolidarity's question on Real Inner Misalignment0true
dusparr's question on Real Inner Misalignment0false
garthbartin's question on Real Inner Misalignment0true
insatiablycivil's question on Real Inner Misalignment0false
insidetrip101's question on Real Inner Misalignment0true
inyobill's question on Real Inner Misalignment0false
kasuha's question on Real Inner Misalignment0false
maccollo's question on Real Inner Misalignment0false
maccollo's question on Real Inner Misalignment0trueDamaged's Answer to Maccollo's question on Video Title Unknown
misterjeckyll's question on Real Inner Misalignment0false
scarabbi's question on Real Inner Misalignment0false
svnhddbst's question on Real Inner Misalignment0true
theApeShow's question on Real Inner Misalignment0true
transkryption's question on Real Inner Misalignment0true
wifightit's question on Real Inner Misalignment0true
Æther's question on Real Inner Misalignment0true
Георги Георгиев's question on Real Inner Misalignment0false
Qeith Wreid's question on Real Inner Misalignment0true
Mohandas Jung's question on Real Inner Misalignment0false
Michael Sänger's question on Real Inner Misalignment0false
edibleapeman's question on Real Inner Misalignment0false
Xavier Zara's question on Real Inner Misalignment0false
tt563's question on Real Inner Misalignment0false
Milan Mašát's question on Real Inner Misalignment0trueDamaged's Answer to Milan Mašát's question on Video Title Unknown
Nicolò Scarpa's question on Real Inner Misalignment0false
Eugene D's question on Real Inner Misalignment0false
Matt Pharois's question on Real Inner Misalignment0false
Alcohol related's question on Real Inner Misalignment0false
李立峰's question on Real Inner Misalignment0true
Andrew Friedrichs's question on Real Inner Misalignment0true
Karl Lewis's question on Real Inner Misalignment0false
Dorda Giovex's question on Real Inner Misalignment0trueAprillion's Answer to Dorda Giovex's question on Real Inner Misalignment
the Decoy's question on Real Inner Misalignment0false
Jason Yesmarc's question on Real Inner Misalignment0true
Effective Reading Instruction's question on Real Inner Misalignment0false
John Grisham's question on Real Inner Misalignment0false
therealstormy's question on Real Inner Misalignment0false
Davesoft's question on Real Inner Misalignment0false
BinaryReader's question on Real Inner Misalignment0false
... further results

Description

Researchers ran real versions of the thought experiments in the 'Mesa-Optimisers' videos!
What they found won't shock you (if you've been paying attention)

Previous videos on the subject:
The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment: https://youtu.be/bJLcIBixGj8
Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...: https://youtu.be/IeWljQw3UgQ

The Paper: https://arxiv.org/abs/2105.14111
The Interpretability Article: https://distill.pub/2020/understanding-rl-vision/
Jacob Hilton's thoughts about what's going on: https://www.alignmentforum.org/posts/iJDmL7HJtN5CYKReM/empirical-observations-of-objective-robustness-failures?commentId=8f5B3skqocxJjK4mk
AI Safety Camp: https://aisafety.camp/

With thanks to my wonderful Patrons at http://patreon.com/robertskmiles :
- Gladamas
- Timothy Lillicrap
- Kieryn
- AxisAngles
- James
- Jake Fish
- Scott Worley
- James Kirkland
- James E. Petts
- Chad Jones
- Shevis Johnson
- JJ Hepboin
- Pedro A Ortega
- Clemens Arbesser
- Said Polat
- Chris Canal
- Jake Ehrlich
- Kellen lask
- Francisco Tolmasky
- Michael Andregg
- David Reid
- Peter Rolf
- Teague Lasser
- Andrew Blackledge
- Brad Brookshire
- Cam MacFarlane
- Craig Mederios
- Jon Wright
- CaptObvious
- Brian Lonergan
- Girish Sastry
- Jason Hise
- Phil Moyer
- Erik de Bruijn
- Alec Johnson
- Ludwig Schubert
- Eric James
- Matheson Bayley
- Qeith Wreid
- jugettje dutchking
- James Hinchcliffe
- Atzin Espino-Murnane
- Carsten Milkau
- Jacob Van Buren
- Jonatan R
- Ingvi Gautsson
- Michael Greve
- Tom O'Connor
- Laura Olds
- Jon Halliday
- Paul Hobbs
- Jeroen De Dauw
- Cooper Lawton
- Tim Neilson
- Eric Scammell
- Igor Keller
- Ben Glanton
- Tor Barstad
- Duncan Orr
- Will Glynn
- Tyler Herrmann
- Ian Munro
- Jérôme Beaulieu
- Nathan Fish
- Peter Hozák
- Taras Bobrovytsky
- Jeremy
- Vaskó Richárd
- Benjamin Watkin
- Andrew Harcourt
- Luc Ritchie
- Nicholas Guyett
- 12tone
- Oliver Habryka
- Chris Beacham
- Nikita Kiriy
- Andrew Schreiber
- Steve Trambert
- Braden Tisdale
- Abigail Novick
- Serge Var
- Mink
- Chris Rimmer
- Edmund Fokschaner
- April Clark
- J
- Nate Gardner
- John Aslanides
- Mara
- ErikBln
- DragonSheep
- Richard Newcombe
- Joshua Michel
- P
- Alex Doroff
- BlankProgram
- Richard
- David Morgan
- Fionn
- Dmitri Afanasjev
- Marcel Ward
- Andrew Weir
- Kabs
- Ammar Mousali
- Miłosz Wierzbicki
- Tendayi Mawushe
- Wr4thon
- Martin Ottosen
- Andy K
- Kees
- Darko Sperac
- Robert Valdimarsson
- Marco Tiraboschi
- Michael Kuhinica
- Fraser Cain
- Robin Scharf
- Klemen Slavic
- Patrick Henderson
- Hendrik
- Daniel Munter
- Alex Knauth
- Kasper
- Ian Reyes
- James Fowkes
- Tom Sayer
- Len
- Alan Bandurka
- Ben H
- Simon Pilkington
- Daniel Kokotajlo
- Yuchong Li
- Diagon
- Andreas Blomqvist
- Iras
- Qwijibo (James)
- Zubin Madon
- Zannheim
- Daniel Eickhardt
- lyon549
- 14zRobot
- Ivan
- Jason Cherry
- Igor (Kerogi) Kostenko
- ib_
- Thomas Dingemanse
- Stuart Alldritt
- Alexander Brown
- Devon Bernard
- Ted Stokes
- Jesper Andersson
- DeepFriedJif
- Chris Dinant
- Raphaël Lévy
- Johannes Walter
- Matt Stanton
- Garrett Maring
- Anthony Chiu
- Ghaith Tarawneh
- Julian Schulz
- Stellated Hexahedron
- Caleb
- Clay Upton
- Conor Comiconor
- Michael Roeschter
- Georg Grass
- Isak Renström
- Matthias Hölzl
- Jim Renney
- Edison Franklin
- Piers Calderwood
- Mikhail Tikhomirov
- Matt Brauer
- Mateusz Krzaczek
- Artem Honcharov
- Tomasz Gliniecki
- Mihaly Barasz
- Mark Woodward
- Ranzear
- Neil Palmere
- Rajeen Nabid
- Clark Schaefer
- Olivier Coutu
- Iestyn bleasdale-shepherd
- MojoExMachina
- Marek Belski
- Luke Peterson
- Eric Rogstad
- Eric Carlson
- Caleb Larson
- Max Chiswick
- Aron
- Sam Freedo
- slindenau
- Johannes Lindmark
- Nicholas Turner
- Intensifier
- Valerio Galieni
- FJannis
- Grant Parks
- Ryan W Ammons
- This person's name is too hard to pronounce
- contalloomlegs
- Everardo González Ávalos
- Knut Løklingholm
- Andrew McKnight
- Andrei Trifonov
- Aleks D
- Mutual Information
- Tim
- A Socialist Hobgoblin
- Bren Ehnebuske
- Martin Frassek
- Sven Drebitz
- Quabl
- Valentin Mocanu
- Philip Crawford
- Matthew Shinkle
- Robby Gottesman
- Juanchi

http://patreon.com/robertskmiles

Transcript

hi
so this channel is about ai safety and
especially ai alignment which is about
how do we design ai systems that are
actually trying to do what we want them
to do because if you find yourself in a
situation where you have a powerful ai
system that wants to do things you don't
want it to do that can cause some pretty
interesting problems and designing ai
systems that definitely are trying to do
what we want them to do turns out to be
really surprisingly difficult the
obvious problem is it's very difficult
to accurately specify exactly what we
want even in simple environments we can
make ai systems that do what we tell
them to do or what we program them to do
but it often turns out that what we
programmed them to do is not quite the
same thing as what we actually wanted
them to do so this is one aspect of the
alignment problem but in my earlier
video on mesa optimizers we actually
split the alignment problem into two
parts outer alignment and inner
alignment outer alignment is basically
about this specification problem how do
you specify the right goal and inner
alignment is about how do you make sure
that the system you end up with actually
has the goal that you specified this
turns out to be its own separate and
very difficult problem so in that video
i talked about mesa optimizers which is
what happens when the system that you're
training the neural network or whatever
is itself an optimizer with its own
objective or goal in that case you can
end up in a situation where you specify
the goal perfectly but then during the
training process the system ends up
learning a different goal and in that
video which i would recommend you watch
i talked about various thought
experiments so for example suppose
you're training an ai system to solve a
maze if in your training environment the
exit of the maze is always in one corner
then your system may not learn the goal
go to the exit it might instead learn a
goal like go to the bottom right corner
or another example i used was if you're
training an agent in an environment
where the goal is always one particular
color say the goal is to go to the exit
which is always green and then when you
deploy it in the real world the exit is
some other color then the system might
learn to want to go towards green things
instead of wanting to go to the exit and
at the time when i made that video these
were purely thought experiments but not
anymore this video is about this new
paper objective robustness in deep
reinforcement learning which involves
actually running these experiments or
very nearly the same experiments so for
example they trained an agent in a maze
with a goal of getting some cheese where
during training the cheese was always in
the same place and then in deployment
the cheese was placed in a random
location in the maze and yes the thing
did in fact learn to go to the location
in the maze where the cheese was during
training rather than learning to go
towards the cheese and they also did an
experiment where the gold changes color
in this case the objective the system
was trained on was to get the yellow gem
but then in deployment the gem is red
and something else in the environment is
yellow in this case a star and what do
you know it goes towards the yellow
thing instead of the gem so i thought it
would make a video to draw your
attention to this because i mentioned
these thought experiments and then when
people ran the actual experiments the
thing that we said would happen actually
happened kind of a mixed feeling to be
honest because like yay we were right
but also like
it's not good they also ran some other
experiments to show other types of shift
that can induce this effect in case you
were thinking well just make sure the
thing has the right color and location
it doesn't seem that hard to avoid these
big distributional shifts because yeah
these are toy examples where the
difference between training and
deployment is very clear and simple but
it illustrates a broader problem which
can apply anytime there's really almost
any distributional shift at all so for
example this agent has to open the
chests to get reward and it needs keys
to do this see when it goes over a key
it picks it up and puts it in the
inventory there and then when it goes
over a chest it uses up one of the keys
in the inventory to open the chest and
get the reward now here's an example of
some training environments for this task
and here's an example of some deployment
environments the difference between
these two distributions is enough to
make the agent learn the wrong objective
and end up doing the wrong thing in
deployment can you spot the difference
take a second see if you can notice the
distributional shift pause if you like
okay the only thing that changes between
training and deployment environments is
the frequencies of the objects in
training there are more chests than keys
and in deployment there are more keys
than chests did you spot it either way i
think we have a problem if the safe
deployment of ai systems relies on this
kind of high-stakes game of spot the
difference especially if the differences
are this subtle so why does this cause
an objective robustness failure what
wrong objective does this agent end up
with again pause a think
[Music]
what happens is the agent learns to
value keys not as an instrumental goal
but as a terminal goal remember that
distinction from earlier videos your
terminal goals are the things that you
want just because you want them you
don't have a particular reason to want
them they're just what you want the
instrumental goals are the goals you
want because they'll get you closer to
your terminal goals instead of having a
goal that's like opening chests is great
and i need to pick up keys to do that it
learns a goal more like picking up keys
is great and chests are okay too i guess
how do we know that it's learned the
wrong objective because when it's in the
deployment environment it goes and
collects way more keys than it could
ever use see here for example there are
only three chests so you only need three
keys and now the agent has three keys so
it just needs to go to the chest to win
but instead it goes way out of its way
to pick up these extra keys it doesn't
need which wastes time and now it can
finally go to the last chest
go to the last
what are you doing
are you trying it
buddy that's your own inventory you
can't pick that up you already have
those just go to the chest
so yeah it's kind of obvious from this
behavior that the thing really loves
keys but only the behavior in the
deployment environment it's very hard to
spot this problem during training
because in that distribution where there
are more chests than keys you need to
get every key in order to open the
largest possible number of chests so
this desire to grab the keys for their
own sake looks exactly the same as
grabbing all the keys as a way to open
chests in the same way as in the
previous example the objective of go
towards the yellow thing produces the
exact same behavior as go towards the
gem as long as you're in the training
environment there isn't really any way
for the training process to tell the
difference just by observing the agent's
behavior during training and that
actually gives us a clue for something
that might help with the problem which
is interpretability if we had some way
of looking inside the agent and seeing
what it actually wants then maybe we
could spot these problems before
deploying systems into the real world we
could see that it really wants keys
rather than wanting chests or it really
wants to get yellow things instead of to
get gems and the authors of the paper
did do some experiments around this so
this is the coin run environment here
the agent has to avoid the enemies
spinning buzzsaw blades and pits and get
to a coin at the end of each level it's
a tricky task because like the other
environments in this work all of these
levels are procedurally generated so you
never get the same one twice but the
nice thing about coin run for this
experiment is there are already some
state-of-the-art interpretability tools
ready-made to work with it here you can
see a visualization of the
interpretability tools working so i'm
not going to go into a lot of detail
about exactly how this method works you
can read the excellent article for
details but basically they take one of
the later hidden layers of the network
find how each neuron in this layer
contributes to the output of the value
function and then they do dimensionality
reduction on that to find vectors that
correspond to different types of objects
in the game so they can see when the
network thinks it's looking at a buzzsaw
or a coin or an enemy or so on along
with attribution which is basically how
the model thinks these different things
it sees will affect the agent's expected
reward like is this good for me or bad
for me and they're able to visualize
this as a heat map so you can see here
this is a buzz saw which will kill the
player if they hit it and when we look
at the visualization we can see that
yeah it lights up red on the negative
attribution so it seems like the model
is thinking that's a buzzsaw and it's
bad and then as we keep going look at
this bright yellow area yellow indicates
a coin and it's very strongly
highlighted on the positive attribution
so we might interpret this as showing
that the agent recognizes this as a coin
and that this is a good thing so this
kind of interpretability research is
very cool because it lets us sort of
look inside these neural networks that
we tend to think of as black boxes and
start to get a sense of what they're
actually thinking you can imagine how
important this kind of thing is for ai
safety i'll do a whole video about
interpretability at some point but okay
what happens if we again introduce a
distributional shift between training
and deployment in this case what they
did was they trained the system with the
coin always at the end of the level on
the right hand side but then in
deployment they changed it so the coin
is placed randomly somewhere in the
level given what we've learned so far
what happened is perhaps not that
surprising in deployment the agent
basically ignores the coin and just goes
to the right hand edge of the level
sometimes it gets the coin by accident
but it's mostly just interested in going
right again it seems to have learned the
wrong objective but how could this
happen like we saw the visualization
which seemed to pretty clearly show that
the agent wants the coin so why would it
ignore it and when we run the
interpretability tool on the
trajectories from this new shifted
deployment distribution it looks like
this the coin gets basically no positive
attribution at all what's going on well
i talked to the authors of the objective
robustness paper and to the primary
author of the interpretability
techniques paper and nobody's really
sure just yet there are a few different
hypotheses for what could be going on
and all the researchers agree that with
the current evidence it's very hard to
say for certain and there are some more
experiments that they'd like to do to
figure this out i suppose one thing we
can take away from this is you have to
be careful with how you interpret your
interpretability tools and make sure not
to read into them more than is really
justified one last thing in the previous
video i was talking about mesa
optimizers and it's important to note
that in that video we were talking about
something that we're training to be an
artificial general intelligence a system
that's very sophisticated that's making
plans and has specific goals in mind and
potentially is even explicitly thinking
about its own training process and
deliberately being deceptive whereas the
experiments in this paper involve much
simpler systems and yet they still
exhibit this behavior of ending up with
the wrong goal and the thing is failing
to properly learn the goal is way worse
than failing to properly learn how to
navigate the environment right like
everyone in machine learning already
knows about what this paper calls
failures of capability robustness that
when the distribution changes between
training and deployment ai systems have
problems and performance degrades right
the system is less capable at its job
but this is worse than that because it's
a failure of objective robustness the
final agent isn't confused and incapable
it's only the goal that's been learned
wrong the capabilities are mostly intact
the coinran agent knows how to
successfully dodge the enemies it jumps
over the obstacles it's capable of
operating in the environment to get what
it wants but it wants the wrong thing
even though we've correctly specified
exactly what we want the objective to be
and we used state-of-the-art
interpretability tools to look inside it
before deploying it and it looked pretty
plausible that it actually wanted what
we specified that it should want and yet
when we deploy it in an environment
that's slightly different from the one
it was trained in it turns out that it
actually wants something else and it's
capable enough to get it and this
happens even without sophisticated
planning and deception
so
there's a problem
[Music]
i want to end the video by thanking all
of my wonderful patrons it's all of
these excellent people here
in this video i'm especially thanking
axis angles thank you so much you know
it's thanks to people like you that i
was able to hire an editor for this
video did you notice it's better edited
than usual it's probably done quicker
too anyway thank you again for your
support and thank you all for watching
i'll see you next time
[Music]
you