How would we evaluate if an AI is an AGI?

AGI ("artificial general intelligence") refers to an AI system that can do a wide range of cognitive tasks at a level comparable to humans. Determining whether a specific AI passes this bar to be considered “AGI” is no easy task, and history is littered with tasks which were once thought to be AI-complete.

Until recently, AI systems at the frontier of progress were specialized or “narrow AI” — outperforming humans in specific domains like board games, but unable to do a broader variety of tasks. Significant progress was made on many tasks, including computer vision, natural language understanding, and autonomous driving — but few people in retrospect consider the AI systems which are most capable at these problems to be generally intelligent. Some believed that outperforming humans at tasks such as Go would require general human-level intelligence, but in hindsight the first systems to outperform humans were not considered generally intelligent.1 Problems considered difficult enough to require AGI have been informally known as AI-complete or AI-hard.

Since 2022, the development of LLMs and their successor multimodal systems has led some to argue that they constitute “AGI”, because they perform well on tasks they were not specifically trained for.

In this context, how does one determine whether a specific AI counts as “AGI”? We outline here some proposed approaches answering this question, but this is by no means an exhaustive list and for any given test, people disagree on whether passing it is a sufficient bar for AGI.

The Turing Test

Alan Turing's classic "Imitation Game", commonly known as the Turing test, tests a machine’s ability to exhibit intelligent behavior by seeing whether a human evaluator can distinguish between it and a human. What counts as success on the Turing test is not precisely agreed upon, and some AIs — including very early ones like ELIZA in the 1960s, as well as modern ones like GPT-4 — have sometimes been able to convincingly mimic a human.

Forecasting resolution criteria

Metaculus, a forecasting platform, has two questions related to predicting the date of AGI. The “resolution criteria” for both questions require an AI system to succeed at four tests of ability benchmarked against human performance.

This set of forecasting resolution criteria for a ‘weak AGI’ involves four tasks “easily completable by a typical college-educated human”:

  • Passing a Turing test of the type that would win the Loebner Silver Prize, which requires that the AI system can convince judges that the human is the AI.

  • Score 90% or more on a version of the “Winograd Schema Challenge” — a multiple choice test consisting of a specific type of question that requires knowledge about the world — where humans also score 90% or more.

  • Score 75th percentile on the mathematics section of a standard SAT exam using just images of the exam pages.

  • Explore all 24 rooms in the Atari game "Montezuma's revenge", using only visual inputs and standard controls, with less than the human equivalent of 100 hours of play.

This set of forecasting resolution criteria for the first general AI system involves four tasks completable by “at least some humans”:

t-AGI

The t-AGI framework, proposed by Richard Ngo, benchmarks the difficulty of a task by how long it would take a human to do it. For instance, an AI that can recognise objects in an image, answer trivia questions, etc. would be considered a "1-second-AGI”, because it can do tasks that would take a human one second to do, while an AI that can develop new apps or review scientific papers would be considered a "1-month-AGI."


  1. AlphaGo, however, led to the more general AlphaZero, which was able to play multiple board games, and then MuZero, which was able to play both board games and Atari games. ↩︎