arxiv: 2605.06524 · v2 · submitted 2026-05-07 · 💻 cs.AI

Recognition: no theorem link

Process Matters more than Output for Distinguishing Humans from Machines

Milena Rmus , Mathew D. Hardy , Thomas L. Griffiths , Mayank Agrawal

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3

classification 💻 cs.AI

keywords human-machine discriminationprocess-level featurescognitive tasksbehavioral mimicryAI fine-tuningLLM agentsCogCAPTCHA30

0 comments

The pith

Measuring how decisions unfold distinguishes humans from AI agents more reliably than matching their final answers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built a battery of 30 cognitive tasks, called CogCAPTCHA30, that records not only the answers people and machines give but also the steps they take along the way. They show that features such as timing patterns and choice sequences let a simple classifier separate humans from frontier agents and fine-tuned models with mean accuracy corresponding to AUC 0.88, even when the final outputs have been made to look identical. Training large models on millions of human decisions brings their processes closer to human ones, and adding direct supervision on process targets improves mimicry further inside each task. The advantage shrinks sharply when the same models face new tasks, because the learned process rules do not transfer.

Core claim

Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks.

What carries the argument

CogCAPTCHA30, a set of 30 cognitive tasks engineered to record process-level features such as response latencies and decision sequences even when final performance is matched between humans and agents.

Load-bearing premise

The extracted process features reflect genuine differences in how humans and machines generate behavior rather than artifacts of the specific tasks, data collection, or the classifiers chosen.

What would settle it

A new agent whose process-feature classifier AUC drops to chance level (near 0.5) when outputs are matched to human performance on the same 30 tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06524 by Mathew D. Hardy, Mayank Agrawal, Milena Rmus, Thomas L. Griffiths.

**Figure 1.** Figure 1: Process-level behavioral features provide stronger human-agent discriminative signal than task performance alone. A) CAPTCHA provides a motivating real-world example of output-process dissociation. Humans and frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro) achieve comparable CAPTCHA performance, yet differ significantly in process-level interaction features including sequential click patterns, d… view at source ↗

**Figure 2.** Figure 2: Task-aligned process supervision improves human-like behavior in-task, but its advantage diminishes under transfer. A) Mean absolute Cohen’s d between model and human process-feature distributions for action-level fine-tuning (A-SFT), process-level fine-tuning (P-SFT), and Centaur (dashed reference line). P-SFT achieves the closest match to humans when evaluated on the process features explicitly optimized… view at source ↗

read the original abstract

Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Process features from the new CogCAPTCHA30 battery separate humans from agents better than outputs alone (AUC 0.88), but the advantage looks partly tied to task instrumentation and does not transfer well.

read the letter

The main takeaway is that tracking how decisions unfold gives a stronger signal for telling humans from machines than just matching the final answers. They built CogCAPTCHA30, a battery of 30 tasks, and show that process-level classifiers reach mean AUC 0.88 even under output matching. They also compare frontier models, Centaur trained on 10.7 million human decisions, and two fine-tuning regimes on a small base model: one matching actions and one matching process features directly. The process-tuned version improves behavioral mimicry more than action-level tuning, though that edge shrinks on new tasks.

Referee Report

3 major / 2 minor

Summary. The paper introduces CogCAPTCHA30, a battery of 30 cognitive tasks, and reports that classifiers using process-level behavioral features distinguish humans from AI agents (frontier LLMs, Centaur, A-SFT, P-SFT) with mean AUC 0.88 even under output matching, outperforming performance metrics alone. It further shows that broad fine-tuning on human decisions improves process mimicry relative to off-the-shelf agents, with additional gains from task-specific process-level supervision, though these gains diminish under cross-task transfer.

Significance. If the central result holds after verification, the work offers an empirical alternative to output-focused Turing-style tests by demonstrating measurable process-level distinctions in a controlled battery, with direct relevance to AI detection and cognitive modeling. The inclusion of multiple agent regimes and the cross-task transfer analysis provide comparative breadth that strengthens the contribution beyond a single-task demonstration.

major comments (3)

[Methods] Methods: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.
[Results] Results (AUC reporting): The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.
[Discussion] Discussion (cross-task transfer): The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.

minor comments (2)

[Abstract] The abstract states a single mean AUC value without reporting variance, per-task breakdown, or confidence intervals, which would improve interpretability of the aggregate result.
[Methods] Notation for the two fine-tuning regimes (A-SFT and P-SFT) is introduced without an explicit equation or pseudocode showing how the process-level loss differs from the action-level loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each of the major comments below and have made revisions to the manuscript to incorporate additional methodological details, quantitative evidence for output matching, and feature analysis for cross-task transfer.

read point-by-point responses

Referee: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.

Authors: We agree that more explicit detail is necessary for reproducibility and to rule out confounds. In the revised manuscript, we have expanded the Methods section to include comprehensive descriptions of the human and agent sample sizes, precise mathematical definitions and step-by-step extraction procedures for each process-level feature, and the complete protocol for enforcing output matching (including how performance equivalence was achieved and verified). Supplementary materials now include the code used for feature extraction and matching. These revisions should enable independent verification that the reported distinctions arise from process-level differences. revision: yes
Referee: The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.

Authors: We acknowledge the need for explicit quantitative confirmation of output matching. The revised manuscript now includes a dedicated table in the Results section that presents the performance metrics (such as accuracy and latency) for both humans and agents across the tasks, along with statistical comparisons demonstrating that these metrics were successfully equated prior to the process-feature analyses. This addition directly addresses the concern and strengthens the isolation of process features as the source of the discriminative signal. revision: yes
Referee: The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.

Authors: We agree that including a feature-ablation or importance analysis would help address potential concerns about task-specific artifacts. Accordingly, the revised manuscript incorporates such an analysis in the Discussion. This quantifies the extent to which the AUC is attributable to generalizable process features versus task-specific ones, providing greater insight into the stability of the observed distinctions under cross-task transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical AUC from independent feature classifiers

full rationale

The paper extracts process-level behavioral features from human and agent data on the CogCAPTCHA30 battery, trains standard classifiers, and reports mean AUC = 0.88 under output-matched conditions. This is a direct empirical separability measure with no reduction to self-definition, fitted inputs renamed as predictions, or self-citation chains. Feature definitions and classifier training are independent of the final discrimination claim; cross-task transfer results further demonstrate that the findings are not tautological. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that the 30 tasks elicit process signals separable from output and that the chosen agents and fine-tuning methods adequately test generalization; no new physical entities or first-principles derivations are introduced.

axioms (1)

domain assumption Behavioral traces from cognitive tasks contain measurable process-level features that are diagnostic of human versus machine cognition even when final performance is matched.
Invoked throughout the abstract as the basis for the CogCAPTCHA30 design and classifier results.

pith-pipeline@v0.9.0 · 5611 in / 1402 out tokens · 48930 ms · 2026-05-12T01:45:20.209359+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 3 internal anchors

[1]

Nature , volume=

A foundation model to predict and capture human cognition , author=. Nature , volume=. 2025 , publisher=

work page 2025
[2]

Psychological Review , volume=

The magical number seven, plus or minus two: Some limits on our capacity for processing information , author=. Psychological Review , volume=

work page
[3]

2012 , publisher=

Collins, Anne GE and Frank, Michael J , journal=. 2012 , publisher=

work page 2012
[4]

2001 , publisher=

Cowan, Nelson , journal=. 2001 , publisher=

work page 2001
[5]

Journal of Experimental Psychology , volume=

Studies of interference in serial verbal reactions , author=. Journal of Experimental Psychology , volume=

work page
[6]

Psychological Review , volume=

Conflict monitoring and cognitive control , author=. Psychological Review , volume=

work page
[7]

Journal of Experimental Psychology , volume=

Errors and error correction in choice-response tasks , author=. Journal of Experimental Psychology , volume=

work page
[8]

Science , volume=

Anterior cingulate conflict monitoring and adjustments in control , author=. Science , volume=

work page
[9]

Quarterly Journal of Economics , year=

A behavioral model of rational choice , author=. Quarterly Journal of Economics , year=

work page
[10]

Science , year=

Judgment under uncertainty: Heuristics and biases , author=. Science , year=

work page
[11]

Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks.ArXiv, abs/2306.07899, 2023

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks , author=. arXiv preprint arXiv:2306.07899 , year=

work page arXiv
[12]

, journal=

Turing, Alan M. , journal=

work page
[13]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Behavioral and brain sciences , volume=

Minds, brains, and programs , author=. Behavioral and brain sciences , volume=. 1980 , publisher=

work page 1980
[15]

1981 , publisher=

Block, Ned , journal=. 1981 , publisher=

work page 1981
[16]

(No Title) , year=

Vision: A computational investigation into the human representation and processing of visual information , author=. (No Title) , year=

work page
[17]

2010 , publisher=

Vision: A computational investigation into the human representation and processing of visual information , author=. 2010 , publisher=

work page 2010
[18]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Wang, Weizhu and Chen, Weizhu , journal=

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Wang, Weizhu and Chen, Weizhu , journal=

work page
[19]

Journal of Statistical Planning and Inference , volume=

Sz. Journal of Statistical Planning and Inference , volume=. 2013 , publisher=

work page 2013
[20]

Qwen Team , journal=

work page
[21]

Half a century of research on the

MacLeod, Colin M , journal=. Half a century of research on the

work page
[22]

Attention, Perception, & Psychophysics , volume=

Testing theories of post-error slowing , author=. Attention, Perception, & Psychophysics , volume=

work page
[23]

Journal of Experimental Psychology: General , volume=

Humans use directed and random exploration to solve the explore--exploit dilemma , author=. Journal of Experimental Psychology: General , volume=

work page
[24]

Cognition , volume=

Deconstructing the human algorithms for exploration , author=. Cognition , volume=

work page
[25]

Mastering the game of

Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of

work page
[26]

Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=

work page
[27]

2024 , organization=

Plesner, Andreas and Vontobel, Tobias and Wattenhofer, Roger , booktitle=. 2024 , organization=

work page 2024
[28]

Jones, Cameron R and Bergen, Benjamin K , journal=

work page
[29]

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=

Does smartphone use drive our emotions or vice versa? A causal analysis , author=. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=

work page
[30]

Acien, Alejandro and Morales, Aythami and Fierrez, Julian and Vera-Rodriguez, Ruben , booktitle=

work page
[31]

Namazova, Sabrina and Brondetta, Alessandra and Strittmatter, Younes and Nassar, Matthew and Musslick, Sebastian , journal=. Not Yet

work page
[32]

Proceedings of the National Academy of Sciences , volume=

Take caution in using LLMs as human surrogates , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=

work page 2025
[33]

arXiv preprint arXiv:2603.03414 , year=

Cognitive Dark Matter: Measuring What AI Misses , author=. arXiv preprint arXiv:2603.03414 , year=

work page arXiv
[34]

Cognition , volume=

Insensitivity to future consequences following damage to human prefrontal cortex , author=. Cognition , volume=

work page
[35]

A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a

Grant, David A and Berg, Esta , journal=. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a

work page
[36]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck, S. Sparks of artificial general intelligence: Early experiments with. arXiv preprint arXiv:2303.12712 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

V ., Arriaga, R

Using large language models to simulate multiple humans and replicate human subject studies , author=. arXiv preprint arXiv:2208.10264 , year=

work page arXiv
[38]

arXiv preprint arXiv:2301.07543 , year=

Large language models as simulated economic agents: What can we learn from homo silicus? , author=. arXiv preprint arXiv:2301.07543 , year=

work page arXiv
[39]

Journal of Machine Learning Research , volume=

A kernel two-sample test , author=. Journal of Machine Learning Research , volume=

work page
[40]

Advances in Neural Information Processing Systems , volume=

Generative adversarial imitation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[41]

1988 , publisher=

Statistical Power Analysis for the Behavioral Sciences , author=. 1988 , publisher=

work page 1988
[42]

2011 , publisher=

Thinking, Fast and Slow , author=. 2011 , publisher=

work page 2011
[43]

Nature Human Behaviour , volume=

Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=

work page
[44]

Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

work page 2011
[45]

Proceedings of the twenty-first international conference on Machine learning , pages=

Apprenticeship learning via inverse reinforcement learning , author=. Proceedings of the twenty-first international conference on Machine learning , pages=

work page
[46]

1993 , publisher=

Nowak, Martin and Sigmund, Karl , journal=. 1993 , publisher=

work page 1993
[47]

2002 , publisher=

Campbell, Murray and Hoane Jr, A Joseph and Hsu, Feng-hsiung , journal=. 2002 , publisher=

work page 2002
[48]

Nature Machine Intelligence , volume=

Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

work page 2020
[49]

Nature Communications , volume=

Lapuschkin, Sebastian and W. Nature Communications , volume=. 2019 , publisher=

work page 2019
[50]

Perceptual learning, automatic attending and a general theory

Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. , author=. Psychological review , volume=. 1977 , publisher=

work page 1977
[51]

eLife , volume=

Ten simple rules for the computational modeling of behavioral data , author=. eLife , volume=

work page
[52]

Machine intelligence , pages=

You can't play 20 questions with nature and win: Projective comments on the papers of this symposium , author=. Machine intelligence , pages=. 2012 , publisher=

work page 2012
[53]

Advances in Neural Information Processing Systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[54]

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=

work page
[55]

Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , journal=

work page
[56]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[57]

Proximal Policy Optimization Algorithms

Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

The journal of the learning sciences , volume=

Cognitive tutors: Lessons learned , author=. The journal of the learning sciences , volume=. 1995 , publisher=

work page 1995
[59]

Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156, 2023

Can AI-generated text be reliably detected? , author=. arXiv preprint arXiv:2303.11156 , year=

work page arXiv
[60]

Journal of Artificial Intelligence Research , volume=

Detecting ai-generated text: Factors influencing detectability with current methods , author=. Journal of Artificial Intelligence Research , volume=

work page
[61]

2024 , publisher=

Yao, Yifan and Duan, Jinhao and Xu, Kaidi and Cai, Yuanfang and Sun, Zhibo and Zhang, Yue , journal=. 2024 , publisher=

work page 2024
[62]

Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and others , booktitle=

work page
[63]

Large language models can be used to effectively scale spear phishing campaigns

Large language models can be used to effectively scale spear phishing campaigns , author=. arXiv preprint arXiv:2305.06972 , pages=

work page arXiv
[64]

2024 , publisher=

Park, Peter S and Goldstein, Simon and O’Gara, Aidan and Chen, Michael and Hendrycks, Dan , journal=. 2024 , publisher=

work page 2024
[65]

2003 , organization=

Von Ahn, Luis and Blum, Manuel and Hopper, Nicholas J and Langford, John , booktitle=. 2003 , organization=

work page 2003
[66]

, author=

Neurobiology of decision-making: risk and reward. , author=. Seminars in Clinical Neuropsychiatry , volume=

work page
[67]

2013 , publisher=

Worthy, Darrell A and Pang, Bo and Byrne, Kaileigh A , journal=. 2013 , publisher=

work page 2013
[68]

Trends in Cognitive Sciences , volume=

The attentional blink , author=. Trends in Cognitive Sciences , volume=. 1997 , publisher=

work page 1997
[69]

2002 , publisher=

Troje, Nikolaus F , journal=. 2002 , publisher=

work page 2002
[70]

Nature , volume=

The capacity of visual working memory for features and conjunctions , author=. Nature , volume=. 1997 , publisher=

work page 1997
[71]

Corsi, Philip Michael , year=

work page
[72]

2013 , publisher=

Mazur, James E , booktitle=. 2013 , publisher=

work page 2013
[73]

Wechsler, David , journal=

work page
[74]

, author=

The information capacity of the human motor system in controlling the amplitude of movement. , author=. Journal of Experimental Psychology , volume=. 1954 , publisher=

work page 1954
[75]

Callaway, Frederick and Lieder, Falk and Krueger, Paul M and Griffiths, Thomas L , booktitle=

work page
[76]

Neuropsychologia , volume=

Pseudoneglect: a review and meta-analysis of performance factors in line bisection tasks , author=. Neuropsychologia , volume=. 2000 , publisher=

work page 2000
[77]

1971 , publisher=

Shepard, Roger N and Metzler, Jacqueline , journal=. 1971 , publisher=

work page 1971
[78]

Pylyshyn, Zenon W and Storm, Ron W , journal=

work page
[79]

1977 , publisher=

Navon, David , journal=. 1977 , publisher=

work page 1977
[80]

2005 , publisher=

Owen, Adrian M and McMillan, Kathryn M and Laird, Angela R and Bullmore, Ed , journal=. 2005 , publisher=

work page 2005

Showing first 80 references.