pith. machine review for the scientific record. sign in

arxiv: 2605.06524 · v2 · submitted 2026-05-07 · 💻 cs.AI

Recognition: no theorem link

Process Matters more than Output for Distinguishing Humans from Machines

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:45 UTC · model grok-4.3

classification 💻 cs.AI
keywords human-machine discriminationprocess-level featurescognitive tasksbehavioral mimicryAI fine-tuningLLM agentsCogCAPTCHA30
0
0 comments X

The pith

Measuring how decisions unfold distinguishes humans from AI agents more reliably than matching their final answers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors built a battery of 30 cognitive tasks, called CogCAPTCHA30, that records not only the answers people and machines give but also the steps they take along the way. They show that features such as timing patterns and choice sequences let a simple classifier separate humans from frontier agents and fine-tuned models with mean accuracy corresponding to AUC 0.88, even when the final outputs have been made to look identical. Training large models on millions of human decisions brings their processes closer to human ones, and adding direct supervision on process targets improves mimicry further inside each task. The advantage shrinks sharply when the same models face new tasks, because the learned process rules do not transfer.

Core claim

Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks.

What carries the argument

CogCAPTCHA30, a set of 30 cognitive tasks engineered to record process-level features such as response latencies and decision sequences even when final performance is matched between humans and agents.

Load-bearing premise

The extracted process features reflect genuine differences in how humans and machines generate behavior rather than artifacts of the specific tasks, data collection, or the classifiers chosen.

What would settle it

A new agent whose process-feature classifier AUC drops to chance level (near 0.5) when outputs are matched to human performance on the same 30 tasks would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06524 by Mathew D. Hardy, Mayank Agrawal, Milena Rmus, Thomas L. Griffiths.

Figure 1
Figure 1. Figure 1: Process-level behavioral features provide stronger human-agent discriminative signal than task performance alone. A) CAPTCHA provides a motivating real-world example of output-process dissociation. Humans and frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro) achieve comparable CAPTCHA performance, yet differ significantly in process-level interaction features including sequential click patterns, d… view at source ↗
Figure 2
Figure 2. Figure 2: Task-aligned process supervision improves human-like behavior in-task, but its advantage diminishes under transfer. A) Mean absolute Cohen’s d between model and human process-feature distributions for action-level fine-tuning (A-SFT), process-level fine-tuning (P-SFT), and Centaur (dashed reference line). P-SFT achieves the closest match to humans when evaluated on the process features explicitly optimized… view at source ↗
read the original abstract

Reliable human-machine discrimination is becoming increasingly important as large language models and autonomous agents are deployed in online settings. Existing approaches evaluate whether a system can produce behavior or responses indistinguishable from those of a human, following the emphasis on outputs as a criterion for intelligence proposed by Alan Turing. Cognitive science offers an alternative perspective: evaluating the process by which behavior is produced. To test whether cognitive processes can reliably distinguish humans from machines, we introduce CogCAPTCHA30, a battery of 30 cognitive tasks designed to elicit diagnostic process-level features even when task performance is matched. Across the battery, process-level features provide stronger discriminative signal than performance metrics alone, reliably distinguishing humans from agents even under output matching (mean process-feature classifier AUC = 0.88). To evaluate agentic process differences, we compare off-the-shelf frontier agents (Claude Sonnet 4.5, GPT-5, Gemini 2.5 Pro), Centaur (a language model fine-tuned on 10.7M human decisions), and two task-specific fine-tuning approaches applied to Qwen2.5-1.5B-Instruct: action-level supervised fine-tuning (A-SFT) and process-level fine-tuning (P-SFT), which directly optimizes process features. Broad fine-tuning on human decisions improves human-like task processes relative to off-the-shelf agents, while task-specific process-level supervision further improves behavioral mimicry. However, this advantage diminishes under cross-task transfer when supervised process targets do not naturally generalize across tasks. Explicit process-level supervision can improve human behavioral mimicry, but only if appropriate task-specific process representations are available, highlighting process specification as a bottleneck for achieving human-like cognitive processes in machines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CogCAPTCHA30, a battery of 30 cognitive tasks, and reports that classifiers using process-level behavioral features distinguish humans from AI agents (frontier LLMs, Centaur, A-SFT, P-SFT) with mean AUC 0.88 even under output matching, outperforming performance metrics alone. It further shows that broad fine-tuning on human decisions improves process mimicry relative to off-the-shelf agents, with additional gains from task-specific process-level supervision, though these gains diminish under cross-task transfer.

Significance. If the central result holds after verification, the work offers an empirical alternative to output-focused Turing-style tests by demonstrating measurable process-level distinctions in a controlled battery, with direct relevance to AI detection and cognitive modeling. The inclusion of multiple agent regimes and the cross-task transfer analysis provide comparative breadth that strengthens the contribution beyond a single-task demonstration.

major comments (3)
  1. [Methods] Methods: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.
  2. [Results] Results (AUC reporting): The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.
  3. [Discussion] Discussion (cross-task transfer): The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.
minor comments (2)
  1. [Abstract] The abstract states a single mean AUC value without reporting variance, per-task breakdown, or confidence intervals, which would improve interpretability of the aggregate result.
  2. [Methods] Notation for the two fine-tuning regimes (A-SFT and P-SFT) is introduced without an explicit equation or pseudocode showing how the process-level loss differs from the action-level loss.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each of the major comments below and have made revisions to the manuscript to incorporate additional methodological details, quantitative evidence for output matching, and feature analysis for cross-task transfer.

read point-by-point responses
  1. Referee: The manuscript provides insufficient detail on human and agent sample sizes, the precise definitions and extraction procedures for the process-level features (e.g., action counts, latency variance, sequential patterns), and the exact protocol used to enforce output matching across the 30 tasks. These omissions make it impossible to evaluate whether the reported mean AUC of 0.88 arises from genuine cognitive-process differences or from systematic implementation differences between human participants and token-based agents.

    Authors: We agree that more explicit detail is necessary for reproducibility and to rule out confounds. In the revised manuscript, we have expanded the Methods section to include comprehensive descriptions of the human and agent sample sizes, precise mathematical definitions and step-by-step extraction procedures for each process-level feature, and the complete protocol for enforcing output matching (including how performance equivalence was achieved and verified). Supplementary materials now include the code used for feature extraction and matching. These revisions should enable independent verification that the reported distinctions arise from process-level differences. revision: yes

  2. Referee: The headline claim that process features reliably distinguish humans from agents 'even under output matching' is load-bearing for the title and abstract, yet the text does not include quantitative evidence (tables or statistics) confirming that performance metrics were successfully equated between groups before computing the process-feature AUC. Without this, the superiority of process features over performance metrics cannot be isolated from potential confounds.

    Authors: We acknowledge the need for explicit quantitative confirmation of output matching. The revised manuscript now includes a dedicated table in the Results section that presents the performance metrics (such as accuracy and latency) for both humans and agents across the tasks, along with statistical comparisons demonstrating that these metrics were successfully equated prior to the process-feature analyses. This addition directly addresses the concern and strengthens the isolation of process features as the source of the discriminative signal. revision: yes

  3. Referee: The reported shrinkage of the process advantage under cross-task transfer is consistent with the possibility that many discriminative features are task-specific artifacts of instrumentation or agent generation mechanics rather than stable cognitive signatures. The manuscript should include feature-ablation or importance analysis to quantify how much of the AUC is driven by such potentially non-generalizable features.

    Authors: We agree that including a feature-ablation or importance analysis would help address potential concerns about task-specific artifacts. Accordingly, the revised manuscript incorporates such an analysis in the Discussion. This quantifies the extent to which the AUC is attributable to generalizable process features versus task-specific ones, providing greater insight into the stability of the observed distinctions under cross-task transfer. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical AUC from independent feature classifiers

full rationale

The paper extracts process-level behavioral features from human and agent data on the CogCAPTCHA30 battery, trains standard classifiers, and reports mean AUC = 0.88 under output-matched conditions. This is a direct empirical separability measure with no reduction to self-definition, fitted inputs renamed as predictions, or self-citation chains. Feature definitions and classifier training are independent of the final discrimination claim; cross-task transfer results further demonstrate that the findings are not tautological. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical assumption that the 30 tasks elicit process signals separable from output and that the chosen agents and fine-tuning methods adequately test generalization; no new physical entities or first-principles derivations are introduced.

axioms (1)
  • domain assumption Behavioral traces from cognitive tasks contain measurable process-level features that are diagnostic of human versus machine cognition even when final performance is matched.
    Invoked throughout the abstract as the basis for the CogCAPTCHA30 design and classifier results.

pith-pipeline@v0.9.0 · 5611 in / 1402 out tokens · 48930 ms · 2026-05-12T01:45:20.209359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 3 internal anchors

  1. [1]

    Nature , volume=

    A foundation model to predict and capture human cognition , author=. Nature , volume=. 2025 , publisher=

  2. [2]

    Psychological Review , volume=

    The magical number seven, plus or minus two: Some limits on our capacity for processing information , author=. Psychological Review , volume=

  3. [3]

    2012 , publisher=

    Collins, Anne GE and Frank, Michael J , journal=. 2012 , publisher=

  4. [4]

    2001 , publisher=

    Cowan, Nelson , journal=. 2001 , publisher=

  5. [5]

    Journal of Experimental Psychology , volume=

    Studies of interference in serial verbal reactions , author=. Journal of Experimental Psychology , volume=

  6. [6]

    Psychological Review , volume=

    Conflict monitoring and cognitive control , author=. Psychological Review , volume=

  7. [7]

    Journal of Experimental Psychology , volume=

    Errors and error correction in choice-response tasks , author=. Journal of Experimental Psychology , volume=

  8. [8]

    Science , volume=

    Anterior cingulate conflict monitoring and adjustments in control , author=. Science , volume=

  9. [9]

    Quarterly Journal of Economics , year=

    A behavioral model of rational choice , author=. Quarterly Journal of Economics , year=

  10. [10]

    Science , year=

    Judgment under uncertainty: Heuristics and biases , author=. Science , year=

  11. [11]

    Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks.ArXiv, abs/2306.07899, 2023

    Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks , author=. arXiv preprint arXiv:2306.07899 , year=

  12. [12]

    , journal=

    Turing, Alan M. , journal=

  13. [13]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  14. [14]

    Behavioral and brain sciences , volume=

    Minds, brains, and programs , author=. Behavioral and brain sciences , volume=. 1980 , publisher=

  15. [15]

    1981 , publisher=

    Block, Ned , journal=. 1981 , publisher=

  16. [16]

    (No Title) , year=

    Vision: A computational investigation into the human representation and processing of visual information , author=. (No Title) , year=

  17. [17]

    2010 , publisher=

    Vision: A computational investigation into the human representation and processing of visual information , author=. 2010 , publisher=

  18. [18]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Wang, Weizhu and Chen, Weizhu , journal=

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Lu and Wang, Weizhu and Chen, Weizhu , journal=

  19. [19]

    Journal of Statistical Planning and Inference , volume=

    Sz. Journal of Statistical Planning and Inference , volume=. 2013 , publisher=

  20. [20]

    Qwen Team , journal=

  21. [21]

    Half a century of research on the

    MacLeod, Colin M , journal=. Half a century of research on the

  22. [22]

    Attention, Perception, & Psychophysics , volume=

    Testing theories of post-error slowing , author=. Attention, Perception, & Psychophysics , volume=

  23. [23]

    Journal of Experimental Psychology: General , volume=

    Humans use directed and random exploration to solve the explore--exploit dilemma , author=. Journal of Experimental Psychology: General , volume=

  24. [24]

    Cognition , volume=

    Deconstructing the human algorithms for exploration , author=. Cognition , volume=

  25. [25]

    Mastering the game of

    Silver, David and Huang, Aja and Maddison, Chris J and Guez, Arthur and Sifre, Laurent and Van Den Driessche, George and Schrittwieser, Julian and Antonoglou, Ioannis and Panneershelvam, Veda and Lanctot, Marc and others , journal=. Mastering the game of

  26. [26]

    Silver, David and Schrittwieser, Julian and Simonyan, Karen and Antonoglou, Ioannis and Huang, Aja and Guez, Arthur and Hubert, Thomas and Baker, Lucas and Lai, Matthew and Bolton, Adrian and others , journal=

  27. [27]

    2024 , organization=

    Plesner, Andreas and Vontobel, Tobias and Wattenhofer, Roger , booktitle=. 2024 , organization=

  28. [28]

    Jones, Cameron R and Bergen, Benjamin K , journal=

  29. [29]

    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=

    Does smartphone use drive our emotions or vice versa? A causal analysis , author=. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies , year=

  30. [30]

    Acien, Alejandro and Morales, Aythami and Fierrez, Julian and Vera-Rodriguez, Ruben , booktitle=

  31. [31]

    Namazova, Sabrina and Brondetta, Alessandra and Strittmatter, Younes and Nassar, Matthew and Musslick, Sebastian , journal=. Not Yet

  32. [32]

    Proceedings of the National Academy of Sciences , volume=

    Take caution in using LLMs as human surrogates , author=. Proceedings of the National Academy of Sciences , volume=. 2025 , publisher=

  33. [33]

    arXiv preprint arXiv:2603.03414 , year=

    Cognitive Dark Matter: Measuring What AI Misses , author=. arXiv preprint arXiv:2603.03414 , year=

  34. [34]

    Cognition , volume=

    Insensitivity to future consequences following damage to human prefrontal cortex , author=. Cognition , volume=

  35. [35]

    A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a

    Grant, David A and Berg, Esta , journal=. A behavioral analysis of degree of reinforcement and ease of shifting to new responses in a

  36. [36]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck, S. Sparks of artificial general intelligence: Early experiments with. arXiv preprint arXiv:2303.12712 , year=

  37. [37]

    V ., Arriaga, R

    Using large language models to simulate multiple humans and replicate human subject studies , author=. arXiv preprint arXiv:2208.10264 , year=

  38. [38]

    arXiv preprint arXiv:2301.07543 , year=

    Large language models as simulated economic agents: What can we learn from homo silicus? , author=. arXiv preprint arXiv:2301.07543 , year=

  39. [39]

    Journal of Machine Learning Research , volume=

    A kernel two-sample test , author=. Journal of Machine Learning Research , volume=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    Generative adversarial imitation learning , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    1988 , publisher=

    Statistical Power Analysis for the Behavioral Sciences , author=. 1988 , publisher=

  42. [42]

    2011 , publisher=

    Thinking, Fast and Slow , author=. 2011 , publisher=

  43. [43]

    Nature Human Behaviour , volume=

    Testing theory of mind in large language models and humans , author=. Nature Human Behaviour , volume=

  44. [44]

    Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=

    A reduction of imitation learning and structured prediction to no-regret online learning , author=. Proceedings of the fourteenth international conference on artificial intelligence and statistics , pages=. 2011 , organization=

  45. [45]

    Proceedings of the twenty-first international conference on Machine learning , pages=

    Apprenticeship learning via inverse reinforcement learning , author=. Proceedings of the twenty-first international conference on Machine learning , pages=

  46. [46]

    1993 , publisher=

    Nowak, Martin and Sigmund, Karl , journal=. 1993 , publisher=

  47. [47]

    2002 , publisher=

    Campbell, Murray and Hoane Jr, A Joseph and Hsu, Feng-hsiung , journal=. 2002 , publisher=

  48. [48]

    Nature Machine Intelligence , volume=

    Shortcut learning in deep neural networks , author=. Nature Machine Intelligence , volume=. 2020 , publisher=

  49. [49]

    Nature Communications , volume=

    Lapuschkin, Sebastian and W. Nature Communications , volume=. 2019 , publisher=

  50. [50]

    Perceptual learning, automatic attending and a general theory

    Controlled and automatic human information processing: II. Perceptual learning, automatic attending and a general theory. , author=. Psychological review , volume=. 1977 , publisher=

  51. [51]

    eLife , volume=

    Ten simple rules for the computational modeling of behavioral data , author=. eLife , volume=

  52. [52]

    Machine intelligence , pages=

    You can't play 20 questions with nature and win: Projective comments on the papers of this symposium , author=. Machine intelligence , pages=. 2012 , publisher=

  53. [53]

    Advances in Neural Information Processing Systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  54. [54]

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=

  55. [55]

    Rafailov, Rafael and Sharma, Archit and Mitchell, Eric and Manning, Christopher D and Ermon, Stefano and Finn, Chelsea , journal=

  56. [56]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  57. [57]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  58. [58]

    The journal of the learning sciences , volume=

    Cognitive tutors: Lessons learned , author=. The journal of the learning sciences , volume=. 1995 , publisher=

  59. [59]

    Can ai-generated text be reliably detected?arXiv preprint arXiv:2303.11156, 2023

    Can AI-generated text be reliably detected? , author=. arXiv preprint arXiv:2303.11156 , year=

  60. [60]

    Journal of Artificial Intelligence Research , volume=

    Detecting ai-generated text: Factors influencing detectability with current methods , author=. Journal of Artificial Intelligence Research , volume=

  61. [61]

    2024 , publisher=

    Yao, Yifan and Duan, Jinhao and Xu, Kaidi and Cai, Yuanfang and Sun, Zhibo and Zhang, Yue , journal=. 2024 , publisher=

  62. [62]

    Weidinger, Laura and Uesato, Jonathan and Rauh, Maribeth and Griffin, Conor and Huang, Po-Sen and Mellor, John and Glaese, Amelia and Cheng, Myra and Balle, Borja and Kasirzadeh, Atoosa and others , booktitle=

  63. [63]

    Large language models can be used to effectively scale spear phishing campaigns

    Large language models can be used to effectively scale spear phishing campaigns , author=. arXiv preprint arXiv:2305.06972 , pages=

  64. [64]

    2024 , publisher=

    Park, Peter S and Goldstein, Simon and O’Gara, Aidan and Chen, Michael and Hendrycks, Dan , journal=. 2024 , publisher=

  65. [65]

    2003 , organization=

    Von Ahn, Luis and Blum, Manuel and Hopper, Nicholas J and Langford, John , booktitle=. 2003 , organization=

  66. [66]

    , author=

    Neurobiology of decision-making: risk and reward. , author=. Seminars in Clinical Neuropsychiatry , volume=

  67. [67]

    2013 , publisher=

    Worthy, Darrell A and Pang, Bo and Byrne, Kaileigh A , journal=. 2013 , publisher=

  68. [68]

    Trends in Cognitive Sciences , volume=

    The attentional blink , author=. Trends in Cognitive Sciences , volume=. 1997 , publisher=

  69. [69]

    2002 , publisher=

    Troje, Nikolaus F , journal=. 2002 , publisher=

  70. [70]

    Nature , volume=

    The capacity of visual working memory for features and conjunctions , author=. Nature , volume=. 1997 , publisher=

  71. [71]

    Corsi, Philip Michael , year=

  72. [72]

    2013 , publisher=

    Mazur, James E , booktitle=. 2013 , publisher=

  73. [73]

    Wechsler, David , journal=

  74. [74]

    , author=

    The information capacity of the human motor system in controlling the amplitude of movement. , author=. Journal of Experimental Psychology , volume=. 1954 , publisher=

  75. [75]

    Callaway, Frederick and Lieder, Falk and Krueger, Paul M and Griffiths, Thomas L , booktitle=

  76. [76]

    Neuropsychologia , volume=

    Pseudoneglect: a review and meta-analysis of performance factors in line bisection tasks , author=. Neuropsychologia , volume=. 2000 , publisher=

  77. [77]

    1971 , publisher=

    Shepard, Roger N and Metzler, Jacqueline , journal=. 1971 , publisher=

  78. [78]

    Pylyshyn, Zenon W and Storm, Ron W , journal=

  79. [79]

    1977 , publisher=

    Navon, David , journal=. 1977 , publisher=

  80. [80]

    2005 , publisher=

    Owen, Adrian M and McMillan, Kathryn M and Laird, Angela R and Bullmore, Ed , journal=. 2005 , publisher=

Showing first 80 references.