Recognition: no theorem link
Process Matters more than Output for Distinguishing Humans from Machines
Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3
The pith
Measuring how decisions unfold distinguishes humans from AI agents more reliably than matching their final answers alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks.
What carries the argument
CogCAPTCHA30, a set of 30 cognitive tasks engineered to record process-level features such as response latencies and decision sequences even when final performance is matched between humans and agents.
Load-bearing premise
The extracted process features reflect genuine differences in how humans and machines generate behavior rather than artifacts of the specific tasks, data collection, or the classifiers chosen.
What would settle it
A new agent whose process-feature classifier AUC drops to chance level (near 0.5) when outputs are matched to human performance on the same 30 tasks would falsify the central claim.
Figures
read the original abstract
Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CogCAPTCHA30, a battery of 30 cognitive tasks, and reports that classifiers using process-level behavioral features distinguish humans from AI agents (frontier LLMs, Centaur, A-SFT, P-SFT) with mean AUC 0.88 even under output matching, outperforming performance metrics alone. It further shows that broad fine-tuning on human decisions improves process mimicry relative to off-the-shelf agents, with additional gains from task-specific process-level supervision, though these gains diminish under cross-task transfer.
Significance. If the central result holds after verification, the work offers an empirical alternative to output-focused Turing-style tests by demonstrating measurable process-level distinctions in a controlled battery, with direct relevance to AI detection and cognitive modeling. The inclusion of multiple agent regimes and the cross-task transfer analysis provide comparative breadth that strengthens the contribution beyond a single-task demonstration.
major comments (3)
- [Methods] Methods: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.
- [Results] Results (AUC reporting): The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.
- [Discussion] Discussion (cross-task transfer): The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.
minor comments (2)
- [Abstract] The abstract states a single mean AUC value without reporting variance, per-task breakdown, or confidence intervals, which would improve interpretability of the aggregate result.
- [Methods] Notation for the two fine-tuning regimes (A-SFT and P-SFT) is introduced without an explicit equation or pseudocode showing how the process-level loss differs from the action-level loss.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address each of the major comments below and have made revisions to the manuscript to incorporate additional methodological details, quantitative evidence for output matching, and feature analysis for cross-task transfer.
read point-by-point responses
-
Referee: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.
Authors: We agree that more explicit detail is necessary for reproducibility and to rule out confounds. In the revised manuscript, we have expanded the Methods section to include comprehensive descriptions of the human and agent sample sizes, precise mathematical definitions and step-by-step extraction procedures for each process-level feature, and the complete protocol for enforcing output matching (including how performance equivalence was achieved and verified). Supplementary materials now include the code used for feature extraction and matching. These revisions should enable independent verification that the reported distinctions arise from process-level differences. revision: yes
-
Referee: The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.
Authors: We acknowledge the need for explicit quantitative confirmation of output matching. The revised manuscript now includes a dedicated table in the Results section that presents the performance metrics (such as accuracy and latency) for both humans and agents across the tasks, along with statistical comparisons demonstrating that these metrics were successfully equated prior to the process-feature analyses. This addition directly addresses the concern and strengthens the isolation of process features as the source of the discriminative signal. revision: yes
-
Referee: The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.
Authors: We agree that including a feature-ablation or importance analysis would help address potential concerns about task-specific artifacts. Accordingly, the revised manuscript incorporates such an analysis in the Discussion. This quantifies the extent to which the AUC is attributable to generalizable process features versus task-specific ones, providing greater insight into the stability of the observed distinctions under cross-task transfer. revision: yes
Circularity Check
No circularity: empirical AUC from independent feature classifiers
full rationale
The paper extracts process-level behavioral features from human and agent data on the CogCAPTCHA30 battery, trains standard classifiers, and reports mean AUC = 0.88 under output-matched conditions. This is a direct empirical separability measure with no reduction to self-definition, fitted inputs renamed as predictions, or self-citation chains. Feature definitions and classifier training are independent of the final discrimination claim; cross-task transfer results further demonstrate that the findings are not tautological. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Behavioral traces from cognitive tasks contain measurable process-level features that are diagnostic of human versus machine cognition even when final performance is matched.
Reference graph
Works this paper leans on
-
[1]
A foundation model to predict and capture human cognition , author=. Nature , volume=. 2025 , publisher=
work page 2025
-
[2]
Psychological Review , volume=
The magical number seven, plus or minus two: Some limits on our capacity for processing information , author=. Psychological Review , volume=
-
[3]
Collins, Anne GE and Frank, Michael J , journal=. 2012 , publisher=
work page 2012
- [4]
-
[5]
Journal of Experimental Psychology , volume=
Studies of interference in serial verbal reactions , author=. Journal of Experimental Psychology , volume=
-
[6]
Psychological Review , volume=
Conflict monitoring and cognitive control , author=. Psychological Review , volume=
-
[7]
Journal of Experimental Psychology , volume=
Errors and error correction in choice-response tasks , author=. Journal of Experimental Psychology , volume=
-
[8]
Anterior cingulate conflict monitoring and adjustments in control , author=. Science , volume=
-
[9]
Quarterly Journal of Economics , year=
A behavioral model of rational choice , author=. Quarterly Journal of Economics , year=
-
[10]
Judgment under uncertainty: Heuristics and biases , author=. Science , year=
-
[11]
Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks , author=. arXiv preprint arXiv:2306.07899 , year=
- [12]
-
[13]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Behavioral and brain sciences , volume=
Minds, brains, and programs , author=. Behavioral and brain sciences , volume=. 1980 , publisher=
work page 1980
- [15]
-
[16]
Vision: A computational investigation into the human representation and processing of visual information , author=. (No Title) , year=
-
[17]
Vision: A computational investigation into the human representation and processing of visual information , author=. 2010 , publisher=
work page 2010
-
[18]
Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Wang, Weizhu and Chen, Weizhu , journal=
-
[19]
Journal of Statistical Planning and Inference , volume=
Sz. Journal of Statistical Planning and Inference , volume=. 2013 , publisher=
work page 2013
-
[20]
Qwen Team , journal=
-
[21]
Half a century of research on the
MacLeod, Colin M , journal=. Half a century of research on the
-
[22]
Attention, Perception, & Psychophysics , volume=
Testing theories of post-error slowing , author=. Attention, Perception, & Psychophysics , volume=
-
[23]
Journal of Experimental Psychology: General , volume=
Humans use directed and random exploration to solve the explore--exploit dilemma , author=. Journal of Experimental Psychology: General , volume=
-
[24]
Deconstructing the human algorithms for exploration , author=. Cognition , volume=
-
[25]
Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of
-
[26]
Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=
-
[27]
Plesner, Andreas and Vontobel, Tobias and Wattenhofer, Roger , booktitle=. 2024 , organization=
work page 2024
-
[28]
Jones, Cameron R and Bergen, Benjamin K , journal=
-
[29]
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=
Does smartphone use drive our emotions or vice versa? A causal analysis , author=. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=
-
[30]
Acien, Alejandro and Morales, Aythami and Fierrez, Julian and Vera-Rodriguez, Ruben , booktitle=
-
[31]
Namazova, Sabrina and Brondetta, Alessandra and Strittmatter, Younes and Nassar, Matthew and Musslick, Sebastian , journal=. Not Yet
-
[32]
Proceedings of the National Academy of Sciences , volume=
Take caution in using LLMs as human surrogates , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=
work page 2025
-
[33]
arXiv preprint arXiv:2603.03414 , year=
Cognitive Dark Matter: Measuring What AI Misses , author=. arXiv preprint arXiv:2603.03414 , year=
-
[34]
Insensitivity to future consequences following damage to human prefrontal cortex , author=. Cognition , volume=
-
[35]
A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a
Grant, David A and Berg, Esta , journal=. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a
-
[36]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Bubeck, S. Sparks of artificial general intelligence: Early experiments with. arXiv preprint arXiv:2303.12712 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Using large language models to simulate multiple humans and replicate human subject studies , author=. arXiv preprint arXiv:2208.10264 , year=
-
[38]
arXiv preprint arXiv:2301.07543 , year=
Large language models as simulated economic agents: What can we learn from homo silicus? , author=. arXiv preprint arXiv:2301.07543 , year=
-
[39]
Journal of Machine Learning Research , volume=
A kernel two-sample test , author=. Journal of Machine Learning Research , volume=
-
[40]
Advances in Neural Information Processing Systems , volume=
Generative adversarial imitation learning , author=. Advances in Neural Information Processing Systems , volume=
-
[41]
Statistical Power Analysis for the Behavioral Sciences , author=. 1988 , publisher=
work page 1988
- [42]
-
[43]
Nature Human Behaviour , volume=
Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=
-
[44]
A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=
work page 2011
-
[45]
Proceedings of the twenty-first international conference on Machine learning , pages=
Apprenticeship learning via inverse reinforcement learning , author=. Proceedings of the twenty-first international conference on Machine learning , pages=
- [46]
-
[47]
Campbell, Murray and Hoane Jr, A Joseph and Hsu, Feng-hsiung , journal=. 2002 , publisher=
work page 2002
-
[48]
Nature Machine Intelligence , volume=
Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=
work page 2020
-
[49]
Nature Communications , volume=
Lapuschkin, Sebastian and W. Nature Communications , volume=. 2019 , publisher=
work page 2019
-
[50]
Perceptual learning, automatic attending and a general theory
Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. , author=. Psychological review , volume=. 1977 , publisher=
work page 1977
-
[51]
Ten simple rules for the computational modeling of behavioral data , author=. eLife , volume=
-
[52]
You can't play 20 questions with nature and win: Projective comments on the papers of this symposium , author=. Machine intelligence , pages=. 2012 , publisher=
work page 2012
-
[53]
Advances in Neural Information Processing Systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=
-
[54]
Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=
-
[55]
Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , journal=
-
[56]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[57]
Proximal Policy Optimization Algorithms
Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[58]
The journal of the learning sciences , volume=
Cognitive tutors: Lessons learned , author=. The journal of the learning sciences , volume=. 1995 , publisher=
work page 1995
-
[59]
Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156, 2023
Can AI-generated text be reliably detected? , author=. arXiv preprint arXiv:2303.11156 , year=
-
[60]
Journal of Artificial Intelligence Research , volume=
Detecting ai-generated text: Factors influencing detectability with current methods , author=. Journal of Artificial Intelligence Research , volume=
-
[61]
Yao, Yifan and Duan, Jinhao and Xu, Kaidi and Cai, Yuanfang and Sun, Zhibo and Zhang, Yue , journal=. 2024 , publisher=
work page 2024
-
[62]
Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and others , booktitle=
-
[63]
Large language models can be used to effectively scale spear phishing campaigns
Large language models can be used to effectively scale spear phishing campaigns , author=. arXiv preprint arXiv:2305.06972 , pages=
-
[64]
Park, Peter S and Goldstein, Simon and O’Gara, Aidan and Chen, Michael and Hendrycks, Dan , journal=. 2024 , publisher=
work page 2024
-
[65]
Von Ahn, Luis and Blum, Manuel and Hopper, Nicholas J and Langford, John , booktitle=. 2003 , organization=
work page 2003
- [66]
-
[67]
Worthy, Darrell A and Pang, Bo and Byrne, Kaileigh A , journal=. 2013 , publisher=
work page 2013
-
[68]
Trends in Cognitive Sciences , volume=
The attentional blink , author=. Trends in Cognitive Sciences , volume=. 1997 , publisher=
work page 1997
- [69]
-
[70]
The capacity of visual working memory for features and conjunctions , author=. Nature , volume=. 1997 , publisher=
work page 1997
-
[71]
Corsi, Philip Michael , year=
- [72]
-
[73]
Wechsler, David , journal=
- [74]
-
[75]
Callaway, Frederick and Lieder, Falk and Krueger, Paul M and Griffiths, Thomas L , booktitle=
-
[76]
Pseudoneglect: a review and meta-analysis of performance factors in line bisection tasks , author=. Neuropsychologia , volume=. 2000 , publisher=
work page 2000
-
[77]
Shepard, Roger N and Metzler, Jacqueline , journal=. 1971 , publisher=
work page 1971
-
[78]
Pylyshyn, Zenon W and Storm, Ron W , journal=
- [79]
-
[80]
Owen, Adrian M and McMillan, Kathryn M and Laird, Angela R and Bullmore, Ed , journal=. 2005 , publisher=
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.