How would we evaluate if an AI is an AGI?

5 min read

Suggest changes in Google Docs

AGI ("artificial general intelligence") broadly refers to an AI system that can do a wide range of cognitive tasks at a level comparable to humans. Determining whether a specific AI passes this bar to be considered “AGI” is no easy task, and history is littered with tasks which were once thought to be AI-complete, but were later achieved by narrow specialized systems.

Until recently, AI systems at the frontier of progress were specialized or “narrow AI” — outperforming humans in specific domains like board games, but unable to do a broader variety of tasks. Significant progress was made on many tasks, including computer vision, natural language understanding, and autonomous driving — but few people in retrospect consider the AI systems which are most capable at these problems to be generally intelligent.

Since 2022, the development of LLMs and their successor multimodal systems has led some to argue that they constitute “AGI”, because they perform well on a broad variety of tasks, including many that they were not specifically trained for.

In this context, how does one determine whether a specific AI counts as “AGI”? Evaluating whether an AI is AGI depends on 1) a set of criteria for AGI, and 2) an empirical test for whether a given system meets those criteria. There isn't consensus on exactly what kinds of systems should be considered "AGI". Below are tests that people have proposed to evaluate whether a given system meets their conception of AGI, but for all of these tests there might still be disagreement over whether a system that passes it counts as AGI.

The Turing Test

Alan Turing's classic "Imitation Game", commonly known as the Turing test, tests a machine’s ability to exhibit intelligent behavior by seeing whether a human evaluator can distinguish between it and a human. What counts as success on the Turing test is not precisely agreed upon, and some AIs — including very early ones like ELIZA in the 1960s, as well as modern ones like GPT-4 — have sometimes been able to convincingly mimic a human.

Forecasting resolution criteria

Metaculus, a forecasting platform, has two questions related to predicting the date of AGI. The “resolution criteria” for both questions require an AI system to succeed at four tests of ability benchmarked against human performance.

This set of forecasting resolution criteria for a ‘weak AGI’ involves four tasks “easily completable by a typical college-educated human”:

Passing a Turing test of the type that would win the Loebner Silver Prize, which requires that the AI system can convince judges that the human is the AI.
Score 90% or more on a version of the “Winograd Schema Challenge” — a multiple choice test consisting of a specific type of question that requires knowledge about the world — where humans also score 90% or more.
Score 75th percentile on the mathematics section of a standard SAT exam using just images of the exam pages.
Explore all 24 rooms in the Atari game "Montezuma's revenge", using only visual inputs and standard controls, with less than the human equivalent of 100 hours of play.

This set of forecasting resolution criteria for the first general AI system involves four tasks completable by “at least some humans”:

Passing a 2-hour Turing test during which participants can send text, images, and audio files, where the human judges are instructed to ask interesting and difficult questions to unmask the computer as an impostor.
Has general robotic capabilities of the type able to autonomously assemble a 1:8 scale automobile model with appropriate actuators and human-readable instructions.
Demonstrate high competency in diverse fields of expertise by achieving high accuracy on a multiple choice test designed to test for extensive world knowledge and problem solving ability.
Achieve 90% accuracy on interview-level coding questions from the APPS benchmark.

ARC

The Abstraction Reasoning Corpus test was designed to test generalization in a way that is resistant to memorization. François Chollet argues that scoring as well as humans on these tests is a necessary condition for AGI.

How would we evaluate if an AI is an AGI?

In progress