Architecture
Evaluation
Reliability

What an AGI Framework Leaves Out

In March, 2026, Google DeepMind published a paper proposing a framework for measuring progress toward artificial general intelligence. The work is rigorous, drawing on decades of research in cognitive science, psychology, and neuroscience to construct a taxonomy of ten cognitive faculties - perception, generation, attention, learning, memory, reasoning, metacognition, executive functions, problem solving, and social cognition. The proposed evaluation protocol measures system performance across each faculty, builds cognitive profiles against human baselines, and offers a structured way to map system capabilities onto the space of human cognition.

It's a serious contribution. It deserves the engagement it will receive.

It is also, in one specific and consequential way, incomplete.

What the Framework Measures

The cognitive taxonomy measures what a system can do. Across ten faculties, the framework asks how capable a system is, how its performance compares to a representative human sample, and where its strengths and weaknesses lie.

This is a meaningful question. Capability matters. The map of what a system can accomplish is a necessary part of any honest evaluation.

What the Framework Acknowledges but Sets Aside

The paper itself notes that propensities - what a system will tend to do, as opposed to what it can do - are a separate critical dimension. The authors describe propensities as important, deserving of robust evaluation, and beyond the scope of the cognitive framework they propose.

This is the gap.

By treating propensities as adjacent to the cognitive framework rather than as constitutive of intelligence, the proposal implies that AGI can be measured first by capability, with behavior measured separately.

We think that separation is incorrect.

Why Capability Without Judgment Isn't General Intelligence

A system that scores at the 99th percentile across all ten cognitive faculties has demonstrated what it can accomplish under controlled conditions. It has not demonstrated what it will do across real interactions, under conflicting signals, over extended reasoning, in environments where outcomes matter.

Those are not minor caveats. They are the conditions that determine whether intelligence is usable.

A system that can solve problems brilliantly in isolation but behaves inconsistently when the conditions change is not generally intelligent in any meaningful sense. It is powerfully capable and unreliably applied.

The paper acknowledges that the stochasticity of generative AI systems adds noise to evaluation results - that asking a system to complete the same task multiple times can produce wildly different results across repetitions. This is framed as a measurement difficulty. It is also a description of the gap.

If a system produces wildly different results across repetitions of the same task, the question is not only how to measure that variance. It is what that variance tells us about whether the system can be trusted to apply its capabilities consistently when it matters.

A Different Framing

Trust depends on more than intelligence alone. In fact, we argue:

Trust = Judgment × Intelligence.

Capability without judgment is unreliable. Judgment without capability is empty. Neither alone produces something worth calling general intelligence. Trust requires both - operating together, governed in real time, applied consistently as responses are formed.

A framework that treats intelligence as the cognitive part and judgment as the behavioral part - one measured first, the other measured separately - gets the relationship wrong.

What This Means for AGI

If trust and judgment are essential of general intelligence rather than adjacent to it, then a framework for measuring progress toward AGI cannot defer behavioral consistency to a future companion paper. It has to be built in from the start.

The DeepMind paper is a starting point, as the authors themselves note. We agree it is a starting point - and we think the next step is to recognize that the question isn't only how capable a system is across ten faculties. It's whether that capability can be trusted to behave consistently when it matters.

A Simple Conclusion

Capability defines what a system can do. Behavior determines whether it can be trusted.

AGI requires both.

We agree. So we did something about it.

This perspective is informed by ongoing work at XyloIQ on how AI behavior can be stabilized and governed as responses are formed.

##

Reference: Burnell et al. (2026). Measuring Progress Toward AGI: A Cognitive Framework. Google DeepMind.

Read More Articles