scaling laws

From Stampy's Wiki
Scaling laws
scaling laws
Main Question: What are scaling laws in machine learning?
Alignment Forum Tag

Description

Scaling Laws refer to the observed trend of some machine learning architectures (notably transformers) to scale their performance on predictable power law when given more compute, data, or parameters (model size), assuming they are not bottlenecked on one of the other resources. This has been observed as highly consistent over more than six orders of magnitude.

Scaling Laws refer to the observed trend of some machine learning architectures (notably transformers) to scale their performance on predictable power law when given more compute, data, or parameters (model size), assuming they are not bottlenecked on one of the other resources. This has been observed as highly consistent over more than six orders of magnitude.

<figure class="image"><img src="7lhHT8n.png"><figcaption>Scaling laws graph from Scaling Laws for Neural Language Models</figcaption></figure>‎

Canonically answered

Why might people try to build AGI rather than stronger and stronger narrow AIs?

Show your endorsement of this answer by giving it a stamp of approval!

Making a narrow AI for every task would be extremely costly and time-consuming. By making a more general intelligence, you can apply one system to a broader range of tasks, which is economically and strategically attractive.

Of course, for generality to be a good option there are some necessary conditions. You need an architecture which is straightforward enough to scale up, such as the transformer which is used for GPT and follows scaling laws. It's also important that by generalizing you do not lose too much capacity at narrow tasks or require too much extra compute for it to be worthwhile.

Whether or not those conditions actually hold: It seems like many important actors (such as DeepMind and OpenAI) believe that they do, and are therefore focusing on trying to build an AGI in order to influence the future, so we should take actions to make it more likely that AGI will be developed safety.

Additionally, it is possible that even if we tried to build only narrow AIs, given enough time and compute we might accidentally create a more general AI than we intend by training a system on a task which requires a broad world model.

See also:

What are "scaling laws" and how are they relevant to safety?

Show your endorsement of this answer by giving it a stamp of approval!

Scaling laws are observed trends on the performance of large machine learning models.

In the field of ML, better performance is usually achieved through better algorithms, better inputs, or using larger amounts of parameters, computing power, or data. Since the 2010s, advances in deep learning have shown experimentally that the easier and faster returns come from scaling, an observation that has been described by Richard Sutton as the bitter lesson.

While deep learning as a field has long struggled to scale models up while retaining learning capability (with such problems as catastrophic interference), more recent methods, especially the Transformer model architecture, were able to just work by feeding them more data, and as the meme goes, stacking more layers.

More surprisingly, performance (in terms of absolute likelihood loss, a standard measure) appeared to increase smoothly with compute, or dataset size, or parameter count. Which gave rise to scaling laws, the trend lines suggested by performance gains, from which returns on data/compute/time investment could be extrapolated.

A companion to this purely descriptive law (no strong theoretical explanation of the phenomenon has been found yet), is the scaling hypothesis, which Gwern Branwen describes:

The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, [...] we can simply train ever larger [neural networks] and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data.

The scaling laws, if the above hypothesis holds, become highly relevant to safety insofar capability gains become conceptually easier to achieve: no need for clever designs to solve a given task, just throw more processing at it and it will eventually yield. As Paul Christiano observes:

It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works”.

While the scaling laws still hold experimentally at the time of this writing (July 2022), whether they'll continue up to safety-relevant capabilities is still an open problem.

Unanswered canonical questions