What is the "natural abstraction hypothesis"?

Introduction

The natural abstraction hypothesis makes a number of claims: that our physical world abstracts well, that these abstractions are natural i.e. many minds learn them, and that these abstractions are basically the concepts usually used by humans. If true, this dramatically simplifies AI alignment, as it implies that any cognition a very powerful AI uses will be in terms of concepts that humans can understand.1

Explanation of natural abstractions hypothesis

Let's unpack that. First, what do we mean by “our physical world abstracts well”? Just that for most systems, the information that is relevant “far away” from the system is much lower-dimensional (i.e. a small amount of numbers) than the system itself. (Note that “far away” is not just physical distance relative to the size of the system: it can also be conceptual or causal separation.)

For example, a wheel can be understood without considering the position/velocity of every atom in it. We only need to know a few large scale properties like its shape, how it rotates etc. to know how a wheel interacts with other parts of the world. This is a handful of dimensions compared to an atomically precise description, which would require over 10^26 numbers! Or consider a tree: far from the tree, you don't need to keep track of individual leaves or its root structure for most purposes. E.g. if you’re looking for fresh wood, you just need to know it is a tree, and how broad/tall it is.

The natural abstractions hypothesis claims that different minds will converge to the same set of abstractions because they are the most efficient representations of all relevant info that reaches the mind from “far away”. And many parts of the world that are far from a mind will influence things the mind cares about, so a mind will be incentivized to learn these abstractions. So, for instance, if someone mostly cares about building great cars, then things like “Hertian Zones” may affect its ability to buil great cars despite being conceptually far from car-design. So said mind would plausibly have to learn what high pressure phase transitions are.

Moreover, NAH claims that the abstractions that humans usually use are approximately natural abstractions. That is, any mind that looks at and uses car wheels successfully will have learn what a car wheel is in approximately the same way as a human. Or if some extremely clever AI was looking at the world, they might develop a new theory of physics that improves on ours; but our theory would still give approximately the same answers as theirs in most scenarios. Like how Newton’s physics is a very good approximation to Einstein’s.

Note how strong a claim NAH is making! It holds even for aliens, even for superintelligences, even for alien superintelligences! Nevertheless, we have some reason to believe in it, at least up to human-level intelligence. But first, why does NAH matter for alignment?

Why it is important for alignment

If true, the natural abstraction hypothesis would simplify AI and AI alignment in particular. It would mean that a wide variety of cognitive architectures will reliably learn approximately-the-same concepts as humans use, no matter how superhumanly intelligent the cognitive architecture is. This could substantially change the strategic picture.

In an especially fortunate world, human values, or other alignment targets like “niceness” or corrigibility or property rights etc. are themselves natural abstractions. If these things are represented in a simple way in most advanced AI systems, then alignment, or control, is simply a matter of locating these abstractions within the AI's mind and forming a goal from them like “be corrigible to your creator”. A crude but remarkably effective technique in this vein is activation steering (see Golden-Gate Claude).

Different alignment targets look more or less plausible as natural abstractions. Any arbitrary value a human has, e.g. a love of wasabi ice-cream, is unlikely to be a natural abstraction. But there are some parts of human values which are plausibly natural abstractions, and if an AI had them, would limit the worst-case scenarios from occurring. E.g. respecting the property rights of other agents, being honest, being nice to weaker creatures and respecting stronger ones, being corrigible etc.

If these targets were themselves natural abstractions, it would make it much more plausible that we could set one such target as an objective to chisel into an AI’s cognition using standard learning algorithms. And expect said AI to reliably learn and value this target by default.

Is NAH true?

We don't know. NAH is ultimately an empirical question, and we have few examples of minds we can converse with, or manually inspect, to see if their abstractions are natural. For those cases we can check i.e. humans and AI, the data are consistent with NAH.

Humans share roughly the same abstractions (note how you've never needed 1TB of data to describe an idea to someone, let alone convince someone else that something is a rock) and our abstractions continue to work even in drastically different environments from where we designed them, e.g. F=ma still roughly works on the moon. As far as we can tell with our crude ability to measure abstractions, very different AI trained in different ways on different data develop basically the same abstractions; even more so the better the AI.

But we have no data for generally superhuman systems. This is where some theories of natural abstractions would have to come into play. Then we might test theories against existing data and use the best to predict what will occur for superhuman systems. Alas, the theory of natural abstractions is far from developed enough to do such things. We do not even have a good technical definition yet, which is why the hypothesis is framed informally. The work is ongoing.

Sources:

Natural Abstractions Distillation: https://www.lesswrong.com/posts/gvzW46Z3BsaZsLc25/natural-abstractions-key-claims-theorems-and-critiques-1

Platonic representation: https://arxiv.org/abs/2405.07987


  1. “Good representations are all alike; every bad representation is bad in its own way” - if Tolstoy had invented the Natural Abstractions Hypothesis, that is what it would say. ↩︎