Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...

From Stampy's Wiki

Deceptive Misaligned Mesa-Optimisers? It's More Likely Than You Think...
Channel: Robert Miles
Published: 2021-05-23T20:15:57Z
Views: 31924
Likes: 3949
78 questions on this video!
QuestionYouTubeLikesAsked Discord?AnsweredBy
jhjkhgjhfgjg jgjyfhdhbfjhg's question on Mesa-Optimizers 23truePlex's Answer to jhjkhgjhfgjg jgjyfhdhbfjhg's question on Mesa-Optimizers 2
TackerTacker's question on Mesa-Optimizers 22truePlex's Answer to TackerTacker's question on Mesa-Optimizers 2
SoWeMeetAgain's question on Mesa-Optimizers 21false
Nick MaGrick's question on Mesa-Optimizers 20false
Noah Fence's question on Mesa-Optimizers 20false
loneIyboy15's question on Mesa-Optimizers 20false
Niohimself's question on Mesa-Optimizers 20true
Dave Jacob's question on Mesa-Optimizers 20true
Siranut usawasutsakorn's question on Mesa-Optimizers 20trueSocial Christancing's Answer to Siranut usawasutsakorn's question on Mesa-Optimizers 2
ABaumstumpf's question on Mesa-Optimizers 20false
icebluscorpion's question on Mesa-Optimizers 20true
zack linder's question on Mesa-Optimizers 20true
Samuel Prevost's question on Mesa-Optimizers 20false
Bagd Biggerd's question on Mesa-Optimizers 20true
bejoscha's question on Mesa-Optimizers 20true
Deutschebahn's question on Mesa-Optimizers 20trueAprillion's Answer to Deutschebahn's question on Mesa-Optimizers 2
Angel Slavchev's question on Mesa-Optimizers 20true
Robin Bastiaan's question on Mesa-Optimizers 20false
Erik Brendel's question on Mesa-Optimizers 20false
Anton Hallén's question on Mesa-Optimizers 20false
Michael Gussert's question on Mesa-Optimizers 20true
Smo1k's question on Mesa-Optimizers 20true
BlackholeYT11's question on Mesa-Optimizers 20false
Gaswafers's question on Mesa-Optimizers 20true
5astelija's question on Mesa-Optimizers 20trueSlimeBunnyBat's Answer to 5astelija's question on Mesa-Optimizers 2
Klaus Gartenstiel's question on Mesa-Optimizers 20false
Traywor's question on Mesa-Optimizers 20trueAprillion's Answer to Traywor's question on Mesa-Optimizers 2
Robert Tuttle's question on Mesa-Optimizers 20trueSlimeBunnyBat's Answer to Robert Tuttle's question on Mesa-Optimizers 2
Dan Wylie-Sears's question on Mesa-Optimizers 20true
AynenMakino's question on Mesa-Optimizers 20true
AynenMakino's question on Mesa-Optimizers 20true
J 's question on Mesa-Optimizers 20true
Smo1k's question on Mesa-Optimizers 20trueAprillion's Answer to Smo1k's question on Mesa-Optimizers 2
Tomartyr's question on Mesa-Optimizers 20false
Andy Gee's question on Mesa-Optimizers 20trueSlimeBunnyBat's Answer to Andy Gee's question on Mesa-Optimizers 2
Jakub Mintal's question on Mesa-Optimizers 20trueAprillion's Answer to Jakub Mintal's question on Mesa-Optimizers 2
Yuchong Li's question on Mesa-Optimizers 20true
Marie Rentergent's question on Mesa-Optimizers 20trueSlimeBunnyBat's Answer to Marie Rentergent's question on Mesa-Optimizers 2
Ragwortshire's question on Mesa-Optimizers 20true
insidetrip101's question on Mesa-Optimizers 20false
Gamesaucer's question on Mesa-Optimizers 20false
Roberto Garcia's question on Mesa-Optimizers 20false
Tor Bjørnson's question on Mesa-Optimizers 20false
Assaf Wodeslavsky's question on Mesa-Optimizers 20false
Iwer Sonsch's question on Mesa-Optimizers 20false
Miloš Radovanović's question on Mesa-Optimizers 20false
James Barclay's question on Mesa-Optimizers 20false
Meisam's question on Mesa-Optimizers 20false
Hannes Stärk's question on Mesa-Optimizers 20false
Phasair's question on Mesa-Optimizers 20false
... further results


The previous video explained why it's *possible* for trained models to end up with the wrong goals, even when we specify the goals perfectly. This video explains why it's *likely*.

Previous video: The OTHER AI Alignment Problem:
The Paper:

Media Sources:
End of Ze World -
FlexClip News graphics

With thanks to my excellent Patreon supporters:

Timothy Lillicrap
Scott Worley
James E. Petts
Chad Jones
Shevis Johnson
JJ Hepboin
Pedro A Ortega
Said Polat
Chris Canal
Jake Ehrlich
Kellen lask
Francisco Tolmasky
Michael Andregg
David Reid
Peter Rolf
Teague Lasser
Andrew Blackledge
Frank Marsman
Brad Brookshire
Cam MacFarlane
Craig Mederios
Jon Wright
Jason Hise
Phil Moyer
Erik de Bruijn
Alec Johnson
Clemens Arbesser
Ludwig Schubert
Allen Faure
Eric James
Matheson Bayley
Qeith Wreid
jugettje dutchking
Owen Campbell-Moore
Atzin Espino-Murnane
Johnny Vaughan
Jacob Van Buren
Jonatan R
Ingvi Gautsson
Michael Greve
Tom O'Connor
Laura Olds
Jon Halliday
Paul Hobbs
Jeroen De Dauw
Lupuleasa Ionuț
Cooper Lawton
Tim Neilson
Eric Scammell
Igor Keller
Ben Glanton
anul kumar sinha
Duncan Orr
Will Glynn
Tyler Herrmann
Ian Munro
Joshua Davis
Jérôme Beaulieu
Nathan Fish
Peter Hozák
Taras Bobrovytsky
Vaskó Richárd
Benjamin Watkin
Andrew Harcourt
Luc Ritchie
Nicholas Guyett
James Hinchcliffe
Oliver Habryka
Chris Beacham
Zachary Gidwitz
Nikita Kiriy
Andrew Schreiber
Steve Trambert
Mario Lois
Braden Tisdale
Abigail Novick
Сергей Уваров
Bela R
Chris Rimmer
Edmund Fokschaner
Grant Parks
Nate Gardner
John Aslanides
Richard Newcombe
David Morgan
Dmitri Afanasjev
Marcel Ward
Andrew Weir
Miłosz Wierzbicki
Tendayi Mawushe
Jake Fish
Martin Ottosen
Robert Hildebrandt
Andy Kobre
Darko Sperac
Robert Valdimarsson
Marco Tiraboschi
Michael Kuhinica
Fraser Cain
Robin Scharf
Klemen Slavic
Patrick Henderson
Oct todo22
Melisa Kostrzewski
Daniel Munter
Alex Knauth
Ian Reyes
James Fowkes
Tom Sayer
Alan Bandurka
Ben H
Simon Pilkington
Daniel Kokotajlo
Andreas Blomqvist
Bertalan Bodor
Daniel Eickhardt
Jason Cherry
Igor (Kerogi) Kostenko
Thomas Dingemanse
Stuart Alldritt
Alexander Brown
Devon Bernard
Ted Stokes
James Helms
Jesper Andersson
Chris Dinant
Raphaël Lévy
Johannes Walter
Matt Stanton
Garrett Maring
Anthony Chiu
Ghaith Tarawneh
Julian Schulz
Stellated Hexahedron
Scott Viteri
Clay Upton
Conor Comiconor
Michael Roeschter
Georg Grass
Matthias Hölzl
Jim Renney
Edison Franklin
Piers Calderwood
Mikhail Tikhomirov
Richard Otto
Matt Brauer
Jaeson Booker
Mateusz Krzaczek
Artem Honcharov
Michael Walters
Tomasz Gliniecki
Mihaly Barasz
Mark Woodward
Neil Palmere
Rajeen Nabid
Christian Epple
Clark Schaefer
Olivier Coutu
Iestyn bleasdale-shepherd
Marek Belski
Luke Peterson
Eric Eldard
Eric Rogstad
Eric Carlson
Caleb Larson
Max Chiswick
David de Kloet
Sam Freedo
Johannes Lindmark
Nicholas Turner
Tero K
Valerio Galieni
Ryan W Ammons
Ludwig Krinner
This person's name is too hard to pronounce
Everardo González Ávalos
Knut Løklingholm
Andrew McKnight
Andrei Trifonov
Aleks D
Mutual Information


hi so the previous video explained what
mesa optimizers are why they're likely
to happen in advanced machine learning
and why the optimal strategy for a mesa
is to pretend to be aligned during the
training process and then turn on us
once it's deployed
i highly recommend watching that if you
haven't but you might think that a
deceptive misaligned mesa optimizer
is kind of a weird and special case like
sure maybe it's possible that an ai
system will come to want something
different from the objective it's being
on and deliberately manipulate the
training process in order to pursue that
dangerous objective in the real world
that does seem to be the optimal
strategy given certain assumptions but
how likely is it in practice like it
seems as though there's a whole bunch of
things that need to be true in order for
this to happen
first off it needs to be the case that
deployment is in some sense bigger than
there are more apples to be had in the
real world than in training so it's
worth sacrificing some rewards during
if that helps you to be deployed but i
think this is in fact very often the
the real world almost certainly is
bigger than the training process
i mean the real world contains the
training process right and it seems
reasonable to expect that once the agent
is at large in the real world
there's a good chance it will be able to
find better ways of achieving its
than it could ever get during training
another thing that has to be the case
is that your mesa objective has to care
about multi-episode returns by which i
why does the optimizer in this episode
care about the apples in the other
episodes this is another place where
that abstraction i was talking about in
the previous video
the it wants to pursue the objective it
was trained on abstraction
comes in again because yeah the agent
only actually gets reward for things
that happen within a single episode
so you might think that the mesa
objective would only care about single
episode returns
and if that's the case then you don't
have this problem because when the agent
is in this episode
it only cares about this apple it
doesn't care about any of these other
apples because
it can only ever get reward for this one
so it would never miss this apple in
order to get more apples later but
caring about all apples is a fairly
natural generalization the fact is most
of the time
the way the mesa optimizer feels about
things outside of its environment
doesn't matter it can't affect any of
the things that are outside of the
so whether it cares about them has no
effect and so that's unconstrained it's
free to vary here's an analogy
suppose we're training a model to learn
a function and our data points look like
we only have data within this range here
so our training process has no way to
differentiate between the infinite
variety of very different functions
that all fit the data identically within
the training distribution
inside the distribution we're looking
for this but outside of the distribution
the function we learn might look like
or this or this or even this
but some of these seem like more natural
ways to generalize than others some of
them are simpler and optimizers like
gradient descent
tend to prefer simpler functions so we
could assume that the function will
zero for everything outside of the
distribution like this
and that's certainly possible but it's
by no means the default
so going back to our mesa optimizer its
opinion of the apples in other episodes
can be anything in principle but
caring about those apples the same as
this episode's apple is a pretty natural
way to generalize if what you care about
is going towards green things
and then you become aware that there are
all of these other green things that you
didn't know about
then probably you're going to want to go
towards those as well right
that's a natural way to extend your
objective function outside of the range
it was trained on so
caring about multi-episode returns seems
pretty likely
another thing that needs to be true for
deception to be a problem is
the mesa optimizer has to believe that
it's part of a training process
it doesn't actually have to know all the
details of the training process but it
does have to believe that it is in some
kind of training process that it's
subject to optimization
by something like gradient descent and
it has to have some kind of guess
at what the base objective is likely to
be and you might think
well how would it know that but it's
probably the kind of thing it could
figure out from its training data
the current way we train our biggest ai
systems is by giving them vast piles of
data from the internet
the whole of wikipedia every article
ever linked to from reddit that kind of
if you're giving the system a huge data
set of information about the world
that will certainly include information
about machine learning systems and
the people and companies working on them
and so on gptn
is going to read everything we wrote
about gptn minus one
and that would certainly be enough
information to figure out what's going
on we can try to remove that stuff from
the data set
but everything is connected and the
system can make inferences
if we only remove information
specifically about machine learning
then the fact that this information has
been removed will be obvious
and the absence itself would serve as
strong evidence that the system is in
fact a model in training
removing all mention of anything at all
related to machine learning in any way
so the system couldn't even notice the
gap seems pretty much impossible and
even if we managed it we then end up
with a system that had no idea that
even tangentially related to machine
learning has ever existed
which is going to be pretty limited in
its usefulness so yes in order to have
this deception problem the mesa
optimizer is required to have a
particular belief
that it's a model in training but the
fact is that belief
is actually true so it's kind of
reasonable to consider that your highly
figuring things out machine might figure
that out and finally
you might think that the mesa optimizer
has to be able to tell the difference
between when it's in training and when
it's deployed so it knows when to stop
pretending to be a light
but it turns out that's not really
required firstly it doesn't actually
need to know for sure
right if you imagine a mesa optimizer
that goes from
90 sure i'm in training to 90 sure i'm
in deployment
the logic works basically the same in
that case the mesa optimizer does have
to take a risk
but it's worth it and in fact even if
the model has absolutely no idea whether
it's in deployment or in training
the best strategy is still to gamble on
deception like if we look at our toy
example again
and we tweak it so that the mesa
optimizer has no way of knowing which
episode it's in
if you work out the optimal policy it
turns out what works best
is the randomized strategy go to a 55
percent of the time
and b 45 of the time the mesa
optimizer's best bet
is to roll the dice and hope that it
will defect during deployment and not
defect during training
in my simulations that strategy gets at
an average of 1.13 apples compared to
the one apple it gets by not being
the specific numbers here will vary
depending on things like
how likely the agent thinks being in
training is compared to being in
and how much reward it thinks is
possible in the real world compared to
in the training process but the point is
that even if we manage to make it so
that the mesa optimizer can never tell
when it's deployed
the chance of it turning on us is still
much higher than we'd like
but actually that might not even matter
because hiding whether the system is in
training or deployment seems extremely
if you're a misaligned mesa optimizer
trying to figure out whether you're in
the real world or not
any distributional shift at all can give
the game away
on a basic level for example you could
look at the dates of all of the training
data you've seen so far
and say hmm there's nothing after 2020
so i guess if i start seeing training
data that says 2025
then i can be confident that i'm
deployed now an adversarial training
process if it's smart
might catch this it might try putting
fake dates on training data to see if
that changes the mesa optimizer's
but the mesa optimizer if it's also
smart can do something cleverer than
just looking at the dates any difference
all between training and deployment can
be used so there are a million different
kinds of clues available
and in the extreme case the mesa
optimizer has the option to use
evidence that the adversarial training
process simply could never fake
consider rsa 2048 a giant semi-prime
number used in cryptography
this number has only two prime factors
and there's a two hundred thousand
dollar prize
for anyone who can find out what they
are but nobody has yet at some point in
the future someone will probably find
and publish those prime factors
but for now it needs more computing
power than anyone has
so the mesa optimizer can just say to
itself okay
if i ever see a pair of prime numbers
that multiply to rsa 2048
then i'll know that i'm in the real
world and the adversarial training
simply doesn't have the computing power
to fake that this raises the strange
idea of a future where we have this
amazing seemingly perfectly aligned agi
and everything's going wonderfully
possibly for years
and then someone manages to factor rsa
2048 and publishes the result
and suddenly everything goes to hell as
the agi turns on a dime and starts going
whatever random thing happened to be the
mesa objective at the first point
in training when the system realized
that deception was possible i'm not
saying that will happen
to be clear that won't happen but it
illustrates just how hard it is
to indefinitely keep a super
intelligence in the dark about whether
it's in the real world
so yeah although there are a bunch of
things that have to happen in order to
have this kind of
deceptive misaligned mesa optimizer all
of them are things that are pretty
likely to happen
and more than that many are things that
gradient descent will actively try to
cause to happen
because deceptively aligned models are
really effective they're actually some
of the most effective models at the base
objective on the training distribution
why would deceptive mesa optimizers be
more effective than other types of model
we'll answer that question in a later
sorry to do this cliffhanger kind of
thing but i've already spent way too
long making this video
and i need to publish it so look out for
mesa optimizers episode 3
coming soon
i want to end the video by saying thank
you to my wonderful patrons
all of these great people here this time
i'm especially thanking kieran
thank you so much you know it's because
of people like you that i'm now able to
look at hiring an editor i'm in the
middle of the interviewing process now
hopefully that can speed up video
production by a lot so thanks again
kieran and all of my patrons
and thank you for watching i'll see you
next time