Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

Amir Konigsberg

arxiv: 2604.05631 · v1 · submitted 2026-04-07 · 💻 cs.AI · cs.ET· cs.HC

Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

Amir Konigsberg This is my paper

Pith reviewed 2026-05-10 19:28 UTC · model grok-4.3

classification 💻 cs.AI cs.ETcs.HC

keywords AI evaluationTuring testbehavioral epistemologycognitive revolutionintelligence attributioncomputational processesmechanism distinctionpsychology analogy

0 comments

The pith

AI evaluation remains trapped in behavioral tests that cannot distinguish systems with different internal processes, requiring a cognitive shift like psychology's.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that Turing's 1950 move to a behavioral test for machine thinking was not merely practical but established an epistemology that treats observable outputs as sufficient for intelligence claims. This commitment has shaped AI benchmarks and research for decades, making it impossible to ask whether two systems reach the same results through comparable computational mechanisms. A sympathetic reader would care because many assertions about AI capabilities rest on evidence that ignores these internal differences, which the paper says are essential for proper attribution of intelligence. The central proposal is that AI needs an epistemological transition, retaining behavioral data while adding recognition that it alone cannot support the field's stronger construct claims.

Core claim

Turing's replacement of questions about whether machines think with a test of output indistinguishability embedded a behavioral epistemology in AI that parallels psychology's earlier behaviorist phase. Just as behaviorism blocked inquiry into mental processes until the cognitive revolution, AI's focus on performance metrics prevents distinguishing systems that achieve identical outputs through fundamentally different computational processes. The paper claims this distinction matters for intelligence attribution and that the field must adopt a post-behaviorist stance: behavioral evidence remains necessary but insufficient, opening the way to ask about mechanism, internal organization, and the

What carries the argument

The behavioral epistemology inherited from the Turing test, which defines relevant evidence for intelligence solely in terms of observable outputs and thereby renders process-level questions unaskable.

If this is right

AI evaluation could begin to ask whether systems that produce the same outputs do so through equivalent internal organizations.
New methods would be needed to examine computational mechanisms alongside performance data.
Intelligence claims would become more constrained, applying only when both behavioral and process evidence align.
Certain questions about how AI systems achieve results would move from unaskable to central.
The field's infrastructure of benchmarks and leaderboards would need redesign to incorporate mechanism-sensitive tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This view could encourage hybrid benchmarks that probe not only final answers but the sequence of internal representations used to reach them.
It connects to ongoing debates about whether functional equivalence is enough for attributing understanding or reasoning.
One testable extension would be applying process audits to existing models to see if they reveal differences hidden by behavior alone.

Load-bearing premise

The historical parallel between psychology's behaviorist-to-cognitivist transition and AI's current evaluative practices is close enough that the same kind of epistemological shift is both feasible and required.

What would settle it

A controlled comparison of two AI systems that match on all standard behavioral benchmarks yet differ measurably in internal computational steps, where a process-focused evaluation assigns different intelligence status to each and this assignment proves more predictive of further capabilities.

read the original abstract

In 1950, Alan Turing proposed replacing the question "Can machines think?" with a behavioral test: if a machine's outputs are indistinguishable from those of a thinking being, the question of whether it truly thinks can be set aside. This paper argues that Turing's move was not only a pragmatic simplification but also an epistemological commitment, a decision about what kind of evidence counts as relevant to intelligence attribution, and that this commitment has quietly constrained AI research for seven decades. We trace how Turing's behavioral epistemology became embedded in the field's evaluative infrastructure, rendering unaskable a class of questions about process, mechanism, and internal organization that cognitive psychology, neuroscience, and related disciplines learned to ask. We draw a structural parallel to the behaviorist-to-cognitivist transition in psychology: just as psychology's commitment to studying only observable behavior prevented it from asking productive questions about internal mental processes until that commitment was abandoned, AI's commitment to behavioral evaluation prevents it from distinguishing between systems that achieve identical outputs through fundamentally different computational processes, a distinction on which intelligence attribution depends. We argue that the field requires an epistemological transition comparable to the cognitive revolution: not an abandonment of behavioral evidence, but a recognition that behavioral evidence alone is insufficient for the construct claims the field wishes to make. We articulate what a post-behaviorist epistemology for AI would involve and identify the specific questions it would make askable that the field currently has no way to ask.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Turing's 1950 behavioral test embedded an epistemological commitment in AI that prioritizes observable outputs over internal processes and mechanisms, rendering certain questions about computational distinctions unaskable; it draws a structural parallel to psychology's behaviorist-to-cognitivist transition and argues that AI requires a comparable epistemological shift to recognize behavioral evidence as necessary but insufficient for intelligence attribution claims, while outlining what a post-behaviorist approach would entail.

Significance. If the central argument holds, the paper could prompt the AI community to expand evaluation practices beyond benchmarks and Turing-style tests toward cognitively informed methods that distinguish process-level differences, potentially improving robustness and interdisciplinary integration with cognitive science. It offers a coherent historical framing and makes explicit an implicit stance in the field without relying on new empirical data or formal derivations.

major comments (2)

[historical parallel and motivation sections] The central analogy to psychology's cognitive revolution (detailed in the historical tracing and parallel sections) is load-bearing for the necessity claim, yet the manuscript does not sufficiently examine disanalogies such as AI systems being designed artifacts with inspectable internals versus opaque biological minds; this leaves open whether the same epistemological barriers apply or if AI already has tools (e.g., mechanistic interpretability) that psychology lacked.
[post-behaviorist epistemology section] In the section articulating the post-behaviorist epistemology, the paper identifies classes of questions that would become askable but provides no concrete operationalization, example evaluation protocol, or falsifiable prediction for how process distinctions would be measured in practice; this weakens the claim that the shift is both possible and required for current AI research.

minor comments (2)

The term 'behavioral epistemology' is used repeatedly without an explicit early definition or contrast to related concepts like 'behavioral testing' or 'black-box evaluation,' which could improve accessibility.
[abstract and introduction] The abstract and introduction could more explicitly state the scope (e.g., whether the argument applies primarily to large language models or to AI evaluation in general) to set reader expectations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope and presentation of our argument. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript without altering its core conceptual claims.

read point-by-point responses

Referee: [historical parallel and motivation sections] The central analogy to psychology's cognitive revolution (detailed in the historical tracing and parallel sections) is load-bearing for the necessity claim, yet the manuscript does not sufficiently examine disanalogies such as AI systems being designed artifacts with inspectable internals versus opaque biological minds; this leaves open whether the same epistemological barriers apply or if AI already has tools (e.g., mechanistic interpretability) that psychology lacked.

Authors: We agree that the disanalogy between designed, inspectable AI systems and opaque biological minds merits explicit discussion, as it could affect how the epistemological barriers translate. However, our argument centers on the field's dominant evaluative practices rather than technical feasibility: even with access to internals, standard AI evaluation (benchmarks, Turing-style tests, leaderboards) continues to prioritize behavioral equivalence and largely ignores process-level distinctions. Mechanistic interpretability remains an emerging research area, not a core component of intelligence attribution protocols. In revision, we will expand the historical parallel section to acknowledge this difference, explain why the behavioral commitment persists despite inspectability, and note how interpretability tools could support but do not yet replace the needed epistemological shift. revision: partial
Referee: [post-behaviorist epistemology section] In the section articulating the post-behaviorist epistemology, the paper identifies classes of questions that would become askable but provides no concrete operationalization, example evaluation protocol, or falsifiable prediction for how process distinctions would be measured in practice; this weakens the claim that the shift is both possible and required for current AI research.

Authors: The post-behaviorist section is intentionally framed at the level of epistemology to identify the class of questions currently rendered unaskable, rather than prescribing specific methods. We recognize that without at least one illustrative protocol the practicality claim is harder to evaluate. In the revised manuscript we will add a brief example operationalization in that section, drawing on existing cognitive science techniques such as controlled intervention tests (e.g., lesioning or probing internal representations) and contrasting them with purely behavioral benchmarks, along with a falsifiable prediction that systems passing behavioral tests but failing process probes will show reduced robustness on out-of-distribution tasks. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper advances a philosophical argument by tracing Turing's behavioral test through historical parallels in psychology and AI literature, without any equations, fitted parameters, self-definitional constructs, or load-bearing self-citations that reduce the central claim to its own inputs. The derivation relies on external historical sources and normative premises about evidence for intelligence attribution, remaining self-contained against benchmarks outside the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on interpretive claims about the history of psychology and AI without introducing new entities or fitted parameters; its axioms are domain assumptions about what counts as evidence for intelligence.

axioms (2)

domain assumption Turing's behavioral test constituted an epistemological commitment that has constrained AI research for seven decades
Stated directly in the abstract as the foundational move that embedded behavioral epistemology in evaluative infrastructure.
domain assumption Behavioral evidence alone is insufficient for the construct claims AI wishes to make about intelligence
Central premise of the argument, presented as the reason a cognitive-style transition is required.

pith-pipeline@v0.9.0 · 5551 in / 1318 out tokens · 42240 ms · 2026-05-10T19:28:44.306386+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Original work published 1868, translated by W. G. Koster. Hubert L. Dreyfus.What Computers Still Can’t Do: A Critique of Artificial Reason. MIT Press, Cambridge, MA, revised edition, 1992. Originally published asWhat Computers Can’t Do(1972). Jerry A. Fodor.Psychological Explanation: An Introduction to the Philosophy of Psychology. Random House, New York,...

work page internal anchor Pith review Pith/arXiv arXiv 1992
[2]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

McGraw-Hill, New York, 1963. Allen Newell and Herbert A. Simon.Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ, 1972. Marianna Nezhurina, Lucia Ciber, and Ilia Shumailov. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models.arXiv preprint arXiv:2406.02061, 2024. OpenAI. GPT-4 technical...

work page arXiv 1963

[1] [1]

Original work published 1868, translated by W. G. Koster. Hubert L. Dreyfus.What Computers Still Can’t Do: A Critique of Artificial Reason. MIT Press, Cambridge, MA, revised edition, 1992. Originally published asWhat Computers Can’t Do(1972). Jerry A. Fodor.Psychological Explanation: An Introduction to the Philosophy of Psychology. Random House, New York,...

work page internal anchor Pith review Pith/arXiv arXiv 1992

[2] [2]

Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models

McGraw-Hill, New York, 1963. Allen Newell and Herbert A. Simon.Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ, 1972. Marianna Nezhurina, Lucia Ciber, and Ilia Shumailov. Alice in wonderland: Simple tasks showing complete reasoning breakdown in state-of-the-art large language models.arXiv preprint arXiv:2406.02061, 2024. OpenAI. GPT-4 technical...

work page arXiv 1963