pith. sign in

arxiv: 2605.15473 · v1 · pith:TFPELXQJnew · submitted 2026-05-14 · 💻 cs.CY

Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents

Pith reviewed 2026-05-19 14:18 UTC · model grok-4.3

classification 💻 cs.CY
keywords human-likeness evaluationLLM agentssocial science hypothesessimulation environmentsbehavioral alignmentreplicated experimentsagent design
0
0 comments X

The pith

If AI agents are human-like, populations of them should reach the same conclusions as humans on established social science experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes evaluating how human-like AI agents are by running populations of them through the same experiments that established key findings in social science. If the agents are truly human-like, groups of them should reach the same conclusions as groups of humans did in those studies. This approach turns decades of replicated behavioral research into an objective benchmark that can be scaled across many models and agent designs. It matters because current evaluations of AI often rely on subjective judgments or narrow tasks, whereas this method provides decomposable and replicable measures of alignment with human behavior.

Core claim

Validated hypotheses from social science serve as a lens for human-likeness evaluation because if an agent is human-like then a population of such agents should arrive at the same inferential conclusion as the human population when subjected to the same experimental protocol. The authors implement this through HumanStudy-Bench, which converts published studies into reusable simulation environments and computes a Probability Alignment Score for conclusion matching and an Effect Consistency Score for effect size matching. Evaluation of multiple models and agent designs on twelve replicated studies shows that performance polarizes between complete success and total failure, with design choices,

What carries the argument

HumanStudy-Bench, which converts replicated social science experiments into reusable simulation environments that measure population-level inferential agreement and effect-size consistency.

Load-bearing premise

The experimental protocols from social science can be translated into simulation environments that capture the intended behavioral signals without introducing new artifacts specific to how the agents are implemented.

What would settle it

Finding that even carefully configured agent designs produce low alignment scores on most of the twelve studies would indicate that the lens fails to detect human-likeness reliably.

Figures

Figures reproduced from arXiv: 2605.15473 by Guankai Zhai, Haojian Jin, Haoyang Shang, Xuan Liu, Yiwen Tu, Yuanjun Feng, Yunze Xiao, Zizhang Liu.

Figure 1
Figure 1. Figure 1: Overview of the HUMANSTUDY-BENCH engine. Given published human-subject studies, the engine filters and extracts experimental designs, statistical tests, and human ground-truth results into a reusable simulation environment. Practitioners design agents (base model with specification) and run them through reconstructed experi￾mental protocols. Agent results are then evaluated against human ground truth via i… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of p-values across human and agents (A4 context). The blue dashed line marks the standard statistical significance threshold (p = 0.05). The human data (far left) consistently yields highly significant results, with almost all probability mass tightly clustered near p ≈ 0. In contrast, agent simulations fail to consistently reproduce these effects, showing a much wider spread of p-values with … view at source ↗
Figure 3
Figure 3. Figure 3: Distributional analysis of simulation fidelity. (a) Effect-size correspondence for Claude Haiku 4.5 A2 vs. human. The gray diagonal marks perfect replication (y=x); the dashed line is the linear regression. Points are colored by significance (significant, p < 0.05; not significant) and sized by replication power. Marginal densities show agent effect sizes are flatter and wider than human ones. (b) Test-lev… view at source ↗
Figure 4
Figure 4. Figure 4: ECS decomposition into pattern correlation ρ and magnitude calibration Cb. (a) Each point is one agent plotted in the (ρ, Cb) plane; gray contours are ECS isolines. Agents cluster in ρ ≤ 0.31, Cb ∈ [0.35, 1.00]. The leaderboard best (Claude Haiku 4.5 A2) and worst (Grok 4.1 Fast A3) are highlighted. (b, c) ECS plotted against ρ and Cb across all agents. ECS aligns more tightly with ρ (r = +0.73) than with … view at source ↗
Figure 5
Figure 5. Figure 5: Decomposing Variance via Task Difficulty. (Left) The benchmark spans a wide spectrum of difficulty levels, refuting the notion that tasks are binary. (Right) The Hexbin analysis identifies a “Zone of Disagreement” (red center), where variance peaks. This indicates that the benchmark effectively disentangles models: in this zone, architectural choices—rather than task difficulty alone—determine success or f… view at source ↗
Figure 6
Figure 6. Figure 6: The Landscape of Idiosyncratic Capabilities. The scattered distribution of high (green) and low (red) alignment scores illustrates that capabilities are fragmented. No single agent universally dominates; instead, performance is highly specific to the interaction between agent and task type [PITH_FULL_IMAGE:figures/full_fig_p035_6.png] view at source ↗
read the original abstract

We propose using validated behavioral hypotheses as a lens for evaluating human-likeness in LLM-based agents. Our key idea is simple: If an agent is human-like, a population of such agents should reach the same inferential conclusion as the human population when run through the same experiment. Decades of social science have produced many such validated findings, each anchored to concrete experimental protocols and robustly established through independent replication. This yields an evaluation that is objective, decomposable, and scalable. We operationalize this lens through HumanStudy-Bench, an open platform that turns published human-subject studies into reusable simulation environments and administers the evaluation to configurable agents. It scores agent-human alignment on two metrics: the Probability Alignment Score (PAS) for inferential agreement and the Effect Consistency Score (ECS) for effect-size agreement. We curated an initial suite of 12 studies whose hypotheses are robustly established through independent replication, and evaluated 10 models under 4 agent designs. Results show that agent responses polarize between full replication and complete failure; agent design influences alignment more than model scale, but its effect is non-monotonic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes evaluating human-likeness in LLM-based agents by testing whether populations of agents reach the same inferential conclusions as human subjects when subjected to the same experimental protocols from validated social science hypotheses. The authors introduce HumanStudy-Bench as an open platform to convert 12 replicated studies into reusable simulation environments, define Probability Alignment Score (PAS) for inferential agreement and Effect Consistency Score (ECS) for effect-size agreement, and report results from evaluating 10 models under 4 agent designs. Key findings include polarization of agent responses between full replication and complete failure, with agent design exerting a stronger but non-monotonic influence on alignment compared to model scale.

Significance. If the reported effects prove robust, the framework offers an objective, decomposable, and scalable alternative to existing agent evaluations by anchoring assessments to independently replicated behavioral findings rather than ad-hoc benchmarks. The open platform and focus on population-level inferential matching represent a constructive step toward more cognitively grounded testing. The curation of replicated studies and provision of reusable environments are particular strengths that could enable community extensions.

major comments (3)
  1. [§4 (HumanStudy-Bench and protocol translation)] §4 (HumanStudy-Bench and protocol translation): The manuscript provides no explicit description of how the 12 experimental protocols are converted into prompts or simulation environments for LLM agents, including any scaffolding or output parsing steps. This is load-bearing for the central claim, as the skeptic's concern about LLM-specific artifacts (e.g., training-data leakage on classic high-visibility studies) cannot be evaluated without these details; the assumption that the translation preserves causal structure remains untested.
  2. [§5 (Results and evaluation)] §5 (Results and evaluation): The abstract and results report polarization between full replication and complete failure plus non-monotonic design effects, yet supply no statistical methods, population sizes per study, error analysis, or criteria for determining 'inferential conclusion' from agent outputs. Without these, the robustness of PAS and ECS cannot be verified, directly weakening the claim that matching conclusions demonstrates human-likeness.
  3. [§3 (Evaluation metrics)] §3 (Evaluation metrics): The definitions of PAS and ECS are not accompanied by any analysis of sensitivity to prompt variations or implementation choices. Given that agent design is reported to matter more than scale, the absence of ablation on how design parameters interact with the simulation environments leaves the non-monotonic effect claim without mechanistic support.
minor comments (2)
  1. [Table 1 or equivalent listing the 12 studies] Table 1 or equivalent listing the 12 studies: Include a column for the original human sample sizes and replication status to allow readers to gauge baseline strength.
  2. [Figure 2 (or results visualization)] Figure 2 (or results visualization): Ensure axis labels and legends explicitly distinguish the 4 agent designs and clarify whether error bars represent variability across models or runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving transparency and robustness, and we will incorporate revisions to address them. We respond to each major comment below.

read point-by-point responses
  1. Referee: [§4 (HumanStudy-Bench and protocol translation)] The manuscript provides no explicit description of how the 12 experimental protocols are converted into prompts or simulation environments for LLM agents, including any scaffolding or output parsing steps. This is load-bearing for the central claim, as the skeptic's concern about LLM-specific artifacts (e.g., training-data leakage on classic high-visibility studies) cannot be evaluated without these details; the assumption that the translation preserves causal structure remains untested.

    Authors: We agree that explicit details on protocol translation are essential for replicability and to address concerns about artifacts such as training-data leakage. In the revised manuscript, we will expand §4 with a dedicated subsection describing the conversion of each of the 12 protocols into prompts and simulation environments. This will cover prompt templates, scaffolding, output parsing logic, and measures taken to preserve the original causal structure. We will also discuss how the focus on independently replicated studies reduces risks of leakage and include concrete examples of translated protocols. revision: yes

  2. Referee: [§5 (Results and evaluation)] The abstract and results report polarization between full replication and complete failure plus non-monotonic design effects, yet supply no statistical methods, population sizes per study, error analysis, or criteria for determining 'inferential conclusion' from agent outputs. Without these, the robustness of PAS and ECS cannot be verified, directly weakening the claim that matching conclusions demonstrates human-likeness.

    Authors: We acknowledge that greater methodological transparency is required to substantiate the reported polarization and non-monotonic effects. The revised manuscript will augment §5 with descriptions of the statistical methods for PAS and ECS, explicit population sizes (agent runs per study), error analysis or variance reporting, and the precise criteria used to determine inferential conclusions from agent outputs. These additions will allow independent verification of the metrics' robustness. revision: yes

  3. Referee: [§3 (Evaluation metrics)] The definitions of PAS and ECS are not accompanied by any analysis of sensitivity to prompt variations or implementation choices. Given that agent design is reported to matter more than scale, the absence of ablation on how design parameters interact with the simulation environments leaves the non-monotonic effect claim without mechanistic support.

    Authors: We recognize the value of sensitivity and ablation analyses to support the claim that agent design exerts a stronger influence than scale. While the main results demonstrate the effects, the revised version will include an analysis of PAS and ECS sensitivity to prompt variations and implementation choices, along with ablations examining interactions between design parameters and the simulation environments. This will provide additional mechanistic insight into the observed non-monotonic patterns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework grounded in external replicated studies

full rationale

The paper defines its human-likeness lens directly from independently replicated social-science hypotheses and protocols, operationalized via HumanStudy-Bench as a translation layer that scores PAS and ECS alignment. No equations, fitted parameters, or self-citations are load-bearing; the 12 curated studies are presented as externally validated through independent replication rather than derived from the present work. The central claim therefore remains self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the domain assumption that replicated social-science findings serve as reliable proxies for human-likeness in AI agents; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Validated behavioral hypotheses from decades of social science are robust indicators of human inferential behavior.
    The entire evaluation lens rests on this premise being true for the selected studies.

pith-pipeline@v0.9.0 · 5753 in / 1163 out tokens · 41975 ms · 2026-05-19T14:18:13.225959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    PersonaLLM: Investigating the ability of large language models to express personality traits

    URLhttps://openreview.net/forum?id=I9xE1Jsjfx. Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Findings of the Association for Computational Linguistics: NAACL 2024, pp. 3605–36...

  2. [2]

    false consensus effect

    URLhttps://openreview.net/forum?id=u8VOQVzduP. Xuan Liu, Haoyang Shang, and Haojian Jin. Cobra: Programming cognitive bias in social agents using classic social science experiments, 2026. URLhttps://arxiv.org/abs/2509.13588. Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, and Markus Strohmaier. The prompt makes the per- son(a): A systematic evaluati...

  3. [3]

    For a binary state θ, the distribution maximizing Shannon Entropy H(θ) is the uniform distribution, P(θ= 1) =P(θ= 0) = 0.5

    Priors via Maximum Entropy.To avoid introducing subjective bias, we select priors based on thePrinciple of IndifferenceJaynes (2003). For a binary state θ, the distribution maximizing Shannon Entropy H(θ) is the uniform distribution, P(θ= 1) =P(θ= 0) = 0.5 . This establishes the uninformative prior necessary for objective benchmarking

  4. [4]

    Minimizing Bayes Risk.We define the loss function as the Squared Error Loss with respect to the true alignment A∗: L( ˆA, A∗) = ( ˆA −A ∗)2. The optimal estimator that minimizes the Bayes Risk (Expected Posterior Loss) is the conditional expectation (MMSE estimator): ˆABayes = arg min ˆA Eθ|D[( ˆA −A ∗)2] =E[A ∗ |D h, Da](8) 16 Validated Hypotheses as a L...

  5. [5]

    Perspective II: The Frequentist View (Variance Reduction).In the Frequentist ontology, the latent truth states θh, θa ∈ {0,1} are fixed unknown constants

    Derivation.Given the conditional independence of Human and Agent generative processes (Appendix A.2): ˆABayes =P(θ h =θ a |D h, Da) =P(θ h = 1|Dh)P(θ a = 1|Da) +P(θ h = 0|Dh)P(θ a = 0|Da) =π hπa + (1−π h)(1−π a) = ˆAP AS (9) This creates a closed loop: the PAS formula presented in the main text is exactly the Minimum Bayes Risk estimator under Maximum Ent...

  6. [6]

    Since ˆAM LE behaves as a Bernoulli variable near the boundary, a marginal perturbation in noise causes a discrete jump, resulting in high instability: V ar( ˆAM LE) L≈0 = 0.25(11)

    The MLE Estimator (Hard Threshold).The Maximum Likelihood Estimator (MLE) for the alignment relies on the indicator functionI(·): ˆAM LE =I(L h >0)I(L a >0) +I(L h ≤0)I(L a ≤0)(10) This estimator is unbiased asymptotically but exhibits maximal variance at the decision boundary ( L≈0 ). Since ˆAM LE behaves as a Bernoulli variable near the boundary, a marg...

  7. [7]

    Note thatlimk→∞ σ(kx) =I(x >0) ; thus, PAS approaches MLE as evidence strength approaches infinity

    The PAS Estimator (Soft Threshold).PAS can be understood as aShrinkage Estimatorusing the logistic sigmoid functionσ(x) = (1 +e −x)−1: ˆAP AS =σ(L h)σ(La) + (1−σ(L h))(1−σ(L a))(12) PAS serves as a continuous relaxation of MLE. Note thatlimk→∞ σ(kx) =I(x >0) ; thus, PAS approaches MLE as evidence strength approaches infinity

  8. [8]

    The derivative of the sigmoid at the boundary isσ ′(0) = 0.25

    Variance Reduction via Delta Method.We prove PAS reduces variance using the Delta Method approximation V ar(f(X))≈[f ′(µ)]2σ2. The derivative of the sigmoid at the boundary isσ ′(0) = 0.25. Comparing the variance of the decision component: V ar( ˆAP AS) L≈0 ≈[σ ′(0)]2σ2 = 0.0625σ2 (13) Conclusion:Provided the sampling noise is not catastrophic ( σ2 <4 ), ...

  9. [9]

    Noise:Human data contains variance from unobserved confounders irrelevant to the hypothesis

    Inferential Signal vs. Noise:Human data contains variance from unobserved confounders irrelevant to the hypothesis. Distributional metrics prioritize matching this nuisance noise. PAS isolates theinferential signal—the strength of evidence for the hypothesis—rewarding agents that capture the causal mechanism even if they exhibit less variance than humans

  10. [10]

    Scale Invariance:Effect sizes are scale-dependent (e.g., Cohen’s d vs. η2). PAS normalizes these into a uniform probability space[0,1], allowing aggregation across heterogeneous study designs. B.3 Aggregation Implementation Scores are aggregated across the hierarchy (Test → Finding → Study → Benchmark) using variance-stabilizing transformations to ensure ...

  11. [11]

    Finding-level scores are the average of test z-scores; study-level scores are the average of finding z-scores

    Finding & Study Level (Variance Stabilization):We map probabilities to correlation space ( rj = 2Aj −1 ) and apply the Fisher-z transformation (Fisher, 1921) to normalize the variance. Finding-level scores are the average of test z-scores; study-level scores are the average of finding z-scores. Both are mapped back to the [0,1]PAS scale via the inverse hy...

  12. [12]

    Given the heterogeneity of the studies—which span diverse cognitive and social domains—we treat each study as an independent unit of capability

    Benchmark Level (Arithmetic Mean):We compute the unweighted arithmetic mean of study-level PAS. Given the heterogeneity of the studies—which span diverse cognitive and social domains—we treat each study as an independent unit of capability. This approach ensures equal representation across domains and prevents any single study with distinct statistical pr...

  13. [13]

    Extract the paper’s title, authors, and abstract

  14. [14]

    Identify all experiments or studies described in the paper

  15. [15]

    documentation_complete

    For each experiment, determine whether it can be replicated using LLM agents, based on the inclusion criteria below. 20 Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents D.2.2 Inclusion Criteria Criterion 1: Documentation Completeness A study is retained only if full experimental details are documented, including: - Materials (e.g,...

  16. [16]

    Finding 1

    Label each finding as "Finding 1", "Finding 2", etc. (or use paper’s notation like "F1", "F2")

  17. [17]

    Extract all statistical tests for each finding (significant, non-significant, marginal, interactions, follow-ups)

  18. [18]

    For EACH study/experiment, extract: D.3.3 Extraction Objectives Objective 1: Study Structure

    Include complete raw data for each test (means, SDs, sample sizes, differences). For EACH study/experiment, extract: D.3.3 Extraction Objectives Objective 1: Study Structure

  19. [19]

    - Findings: list all findings with IDs (Finding 1, Finding 2, etc.) and their hypotheses

    STUDY STRUCTURE: - Study ID, name, phenomenon. - Findings: list all findings with IDs (Finding 1, Finding 2, etc.) and their hypotheses. - All sub-studies/scenarios/conditions. Objective 2: Materials

  20. [20]

    - Item-level details: question text, response options, scales

    MATERIALS: - Actual text of questions, scenarios, instructions, stimuli. - Item-level details: question text, response options, scales. Objective 3: Participants

  21. [21]

    Note: Participant characteristics are extracted when available and serve as optional reference priors for agent specification

    PARTICIPANTS: - Sample sizes, demographics, group assignments, exclusion criteria. Note: Participant characteristics are extracted when available and serve as optional reference priors for agent specification. Objective 4: Statistical Results

  22. [22]

    Finding 1

    STATISTICAL RESULTS: -finding_id: Which finding this addresses (e.g., "Finding 1", "F2"). -test_name: Exact test name (e.g., "t-test", "ANOV A", "correlation"). 22 Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents -statistic: Complete string (e.g., "t(23) = 4.66", "F(1, 68) = 6.38", "t < 1"). -p_value: Exact value (e.g., "p < .001"...