What is inverse reinforcement learning (IRL)?

Inverse reinforcement learning (IRL) is a type of machine learning in which an AI observes the behavior of another agent in a particular environment, typically an expert human, and tries to work out the reward function without having that function explicitly defined.

IRL is typically used when a reward function is too complex to define programmatically, or when AI agents need to respond robustly to sudden changes in an environment that demand a different approach to the reward function, in order to remain safe. For example, imagine an AI agent trying to learn how to do a backflip. Humans, dogs and Boston Dynamics robots can all perform backflips, but they all do it very differently depending on their physiology, their incentives, and where they are at the time, all of which can vary enormously in the real world – an AI agent learning backflips solely by trial and error across a wide range of body types and locations, without something to watch, may be rather like relying on chimpanzees to accidentally type the work of William Shakespeare.

Chimpanzee probably not typing Hamlet

(Source: Wikipedia)

IRL therefore does not necessarily mean that an AI mimics other agents’ behavior, because AI researchers may expect the AI agent to develop more efficient ways to maximize the reward function it discovers. But IRL does assume that the agent being observed is behaving transparently enough for an AI agent to accurately identify what they’re doing, and what success looks like (in the case of backflips, this might be “returning to your original standing position, having rotated vertically a full 360 degrees, without being harmed”).

IRL is both a machine learning method, because it can be used when specifying a reward function is too difficult, and a machine learning problem, because an AI agent may settle on an inaccurate reward function, or apply unsafe and unaligned methods to achieve it – for example, forcing a dog’s spinal column to mimic a human spinal column while attempting a backflip will almost certainly end in tears (though quite possibly still a backflip)

IRL is closely related to reinforcement learning (RL), another approach in which an agent is given an explicit reward function and then learns how to optimize for that reward function in a particular environment through trial and error. Unlike reinforcement learning from human feedback (RLHF), in which a human provides feedback for each iteration of an AI agent’s performance on a given task, IRL usually limits human input to repeated demonstrations of the task (“here’s another backflip”, rather than “nice backflip”).