|Main Question: What are scaling laws in machine learning?|
|Alignment Forum Tag|
Scaling Laws refer to the observed trend of some machine learning architectures (notably transformers) to scale their performance on predictable power law when given more compute, data, or parameters (model size), assuming they are not bottlenecked on one of the other resources. This has been observed as highly consistent over more than six orders of magnitude.<figure class="image"><img src=""><figcaption>Scaling laws graph from Scaling Laws for Neural Language Models</figcaption></figure>
Making a narrow AI for every task would be extremely costly and time-consuming. By making a more general intelligence, you can apply one system to a broader range of tasks, which is economically and strategically attractive.
Of course, for generality to be a good option there are some necessary conditions. You need an architecture which is straightforward enough to scale up, such as the transformer which is used for GPT and follows scaling laws. It's also important that by generalizing you do not lose too much capacity at narrow tasks or require too much extra compute for it to be worthwhile.
Whether or not those conditions actually hold: It seems like many important actors (such as DeepMind and OpenAI) believe that they do, and are therefore focusing on trying to build an AGI in order to influence the future, so we should take actions to make it more likely that AGI will be developed safety.
Additionally, it is possible that even if we tried to build only narrow AIs, given enough time and compute we might accidentally create a more general AI than we intend by training a system on a task which requires a broad world model.
- Reframing Superintelligence - A model of AI development which proposes that we might mostly build narrow AI systems for some time.
Scaling laws are observed trends on the performance of large machine learning models.
In the field of ML, better performance is usually achieved through better algorithms, better inputs, or using larger amounts of parameters, computing power, or data. Since the 2010s, advances in deep learning have shown experimentally that the easier and faster returns come from scaling, an observation that has been described by Richard Sutton as the bitter lesson.
While deep learning as a field has long struggled to scale models up while retaining learning capability (with such problems as catastrophic interference), more recent methods, especially the Transformer model architecture, were able to just work by feeding them more data, and as the meme goes, stacking more layers.
More surprisingly, performance (in terms of absolute likelihood loss, a standard measure) appeared to increase smoothly with compute, or dataset size, or parameter count. Which gave rise to scaling laws, the trend lines suggested by performance gains, from which returns on data/compute/time investment could be extrapolated.
A companion to this purely descriptive law (no strong theoretical explanation of the phenomenon has been found yet), is the scaling hypothesis, which Gwern Branwen describes:
The strong scaling hypothesis is that, once we find a scalable architecture like self-attention or convolutions, [...] we can simply train ever larger [neural networks] and ever more sophisticated behavior will emerge naturally as the easiest way to optimize for all the tasks & data.
The scaling laws, if the above hypothesis holds, become highly relevant to safety insofar capability gains become conceptually easier to achieve: no need for clever designs to solve a given task, just throw more processing at it and it will eventually yield. As Paul Christiano observes:
It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn’t involve qualitatively new ideas about “how intelligence works”.
While the scaling laws still hold experimentally at the time of this writing (July 2022), whether they'll continue up to safety-relevant capabilities is still an open problem.
We don't know yet. This sentiment has been influenced by the ebb and flow of new research, but the wind seems to be blowing in the direction of scaling being sufficient for AGI.
External opinions on this subject
- Gwern on scaling hypothesis
- Daniel Kokotajlo on what we could do with +12 OOMs of compute
- Rohin Shah on the likelihood of getting to AGI with current techniques
- Rich Sutton's "The Bitter Lesson" on why more computation beats leveraging existing human knowledge
- Gary Marcus on how LLMs are limited and how scaling will not help
- Evidence against current methods
The basic idea is to figure out how model performance scales, and use this to help understand and predict what future AI models might look like, which can inform timelines and AI safety research. A classic result found that you need to increase data, parameters, and compute all at the same time (at roughly the same rate) in order to improve performance. Anthropic extended this research here.