How would we evaluate if an AI is an AGI?

AGI ("artificial general intelligence") broadly refers to an AI system that can do a wide range of cognitive tasks at a level comparable to humans. Determining whether a specific AI passes this bar to be considered “AGI” is no easy task, and history is littered with tasks which were once thought to be AI-complete, but were later achieved by narrow specialized systems.

Until recently, AI systems at the frontier of progress were specialized or “narrow AI” — outperforming humans in specific domains like board games, but unable to do a broader variety of tasks. Significant progress was made on many tasks, including computer vision, natural language understanding, and autonomous driving — but few people in retrospect consider the AI systems which are most capable at these problems to be generally intelligent.

Since 2022, the development of LLMs and their successor multimodal systems has led some to argue that they constitute “AGI”, because they perform well on a broad variety of tasks, including many that they were not specifically trained for.

In this context, how does one determine whether a specific AI counts as “AGI”? Evaluating whether an AI is AGI depends on 1) a set of criteria for AGI, and 2) an empirical test for whether a given system meets those criteria. There isn't consensus on exactly what kinds of systems should be considered "AGI". Below are tests that people have proposed to evaluate whether a given system meets their conception of AGI, but for all of these tests there might still be disagreement over whether a system that passes it counts as AGI.

The Turing Test

Alan Turing's classic "Imitation Game", commonly known as the Turing test, tests a machine’s ability to exhibit intelligent behavior by seeing whether a human evaluator can distinguish between it and a human. What counts as success on the Turing test is not precisely agreed upon, and some AIs — including very early ones like ELIZA in the 1960s, as well as modern ones like GPT-4 — have sometimes been able to convincingly mimic a human.

Forecasting resolution criteria

Metaculus, a forecasting platform, has two questions related to predicting the date of AGI. The “resolution criteria” for both questions require an AI system to succeed at four tests of ability benchmarked against human performance.

This set of forecasting resolution criteria for a ‘weak AGI’ involves four tasks “easily completable by a typical college-educated human”:

  • Passing a Turing test of the type that would win the Loebner Silver Prize, which requires that the AI system can convince judges that the human is the AI.

  • Score 90% or more on a version of the “Winograd Schema Challenge” — a multiple choice test consisting of a specific type of question that requires knowledge about the world — where humans also score 90% or more.

  • Score 75th percentile on the mathematics section of a standard SAT exam using just images of the exam pages.

  • Explore all 24 rooms in the Atari game "Montezuma's revenge", using only visual inputs and standard controls, with less than the human equivalent of 100 hours of play.

This set of forecasting resolution criteria for the first general AI system involves four tasks completable by “at least some humans”:

ARC

The Abstraction Reasoning Corpus test was designed to test generalization in a way that is resistant to memorization. François Chollet argues that scoring as well as humans on these tests is a necessary condition for AGI.