Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents
Pith reviewed 2026-05-19 14:18 UTC · model grok-4.3
The pith
If AI agents are human-like, populations of them should reach the same conclusions as humans on established social science experiments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Validated hypotheses from social science serve as a lens for human-likeness evaluation because if an agent is human-like then a population of such agents should arrive at the same inferential conclusion as the human population when subjected to the same experimental protocol. The authors implement this through HumanStudy-Bench, which converts published studies into reusable simulation environments and computes a Probability Alignment Score for conclusion matching and an Effect Consistency Score for effect size matching. Evaluation of multiple models and agent designs on twelve replicated studies shows that performance polarizes between complete success and total failure, with design choices,
What carries the argument
HumanStudy-Bench, which converts replicated social science experiments into reusable simulation environments that measure population-level inferential agreement and effect-size consistency.
Load-bearing premise
The experimental protocols from social science can be translated into simulation environments that capture the intended behavioral signals without introducing new artifacts specific to how the agents are implemented.
What would settle it
Finding that even carefully configured agent designs produce low alignment scores on most of the twelve studies would indicate that the lens fails to detect human-likeness reliably.
Figures
read the original abstract
We propose using validated behavioral hypotheses as a lens for evaluating human-likeness in LLM-based agents. Our key idea is simple: If an agent is human-like, a population of such agents should reach the same inferential conclusion as the human population when run through the same experiment. Decades of social science have produced many such validated findings, each anchored to concrete experimental protocols and robustly established through independent replication. This yields an evaluation that is objective, decomposable, and scalable. We operationalize this lens through HumanStudy-Bench, an open platform that turns published human-subject studies into reusable simulation environments and administers the evaluation to configurable agents. It scores agent-human alignment on two metrics: the Probability Alignment Score (PAS) for inferential agreement and the Effect Consistency Score (ECS) for effect-size agreement. We curated an initial suite of 12 studies whose hypotheses are robustly established through independent replication, and evaluated 10 models under 4 agent designs. Results show that agent responses polarize between full replication and complete failure; agent design influences alignment more than model scale, but its effect is non-monotonic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes evaluating human-likeness in LLM-based agents by testing whether populations of agents reach the same inferential conclusions as human subjects when subjected to the same experimental protocols from validated social science hypotheses. The authors introduce HumanStudy-Bench as an open platform to convert 12 replicated studies into reusable simulation environments, define Probability Alignment Score (PAS) for inferential agreement and Effect Consistency Score (ECS) for effect-size agreement, and report results from evaluating 10 models under 4 agent designs. Key findings include polarization of agent responses between full replication and complete failure, with agent design exerting a stronger but non-monotonic influence on alignment compared to model scale.
Significance. If the reported effects prove robust, the framework offers an objective, decomposable, and scalable alternative to existing agent evaluations by anchoring assessments to independently replicated behavioral findings rather than ad-hoc benchmarks. The open platform and focus on population-level inferential matching represent a constructive step toward more cognitively grounded testing. The curation of replicated studies and provision of reusable environments are particular strengths that could enable community extensions.
major comments (3)
- [§4 (HumanStudy-Bench and protocol translation)] §4 (HumanStudy-Bench and protocol translation): The manuscript provides no explicit description of how the 12 experimental protocols are converted into prompts or simulation environments for LLM agents, including any scaffolding or output parsing steps. This is load-bearing for the central claim, as the skeptic's concern about LLM-specific artifacts (e.g., training-data leakage on classic high-visibility studies) cannot be evaluated without these details; the assumption that the translation preserves causal structure remains untested.
- [§5 (Results and evaluation)] §5 (Results and evaluation): The abstract and results report polarization between full replication and complete failure plus non-monotonic design effects, yet supply no statistical methods, population sizes per study, error analysis, or criteria for determining 'inferential conclusion' from agent outputs. Without these, the robustness of PAS and ECS cannot be verified, directly weakening the claim that matching conclusions demonstrates human-likeness.
- [§3 (Evaluation metrics)] §3 (Evaluation metrics): The definitions of PAS and ECS are not accompanied by any analysis of sensitivity to prompt variations or implementation choices. Given that agent design is reported to matter more than scale, the absence of ablation on how design parameters interact with the simulation environments leaves the non-monotonic effect claim without mechanistic support.
minor comments (2)
- [Table 1 or equivalent listing the 12 studies] Table 1 or equivalent listing the 12 studies: Include a column for the original human sample sizes and replication status to allow readers to gauge baseline strength.
- [Figure 2 (or results visualization)] Figure 2 (or results visualization): Ensure axis labels and legends explicitly distinguish the 4 agent designs and clarify whether error bars represent variability across models or runs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments highlight important areas for improving transparency and robustness, and we will incorporate revisions to address them. We respond to each major comment below.
read point-by-point responses
-
Referee: [§4 (HumanStudy-Bench and protocol translation)] The manuscript provides no explicit description of how the 12 experimental protocols are converted into prompts or simulation environments for LLM agents, including any scaffolding or output parsing steps. This is load-bearing for the central claim, as the skeptic's concern about LLM-specific artifacts (e.g., training-data leakage on classic high-visibility studies) cannot be evaluated without these details; the assumption that the translation preserves causal structure remains untested.
Authors: We agree that explicit details on protocol translation are essential for replicability and to address concerns about artifacts such as training-data leakage. In the revised manuscript, we will expand §4 with a dedicated subsection describing the conversion of each of the 12 protocols into prompts and simulation environments. This will cover prompt templates, scaffolding, output parsing logic, and measures taken to preserve the original causal structure. We will also discuss how the focus on independently replicated studies reduces risks of leakage and include concrete examples of translated protocols. revision: yes
-
Referee: [§5 (Results and evaluation)] The abstract and results report polarization between full replication and complete failure plus non-monotonic design effects, yet supply no statistical methods, population sizes per study, error analysis, or criteria for determining 'inferential conclusion' from agent outputs. Without these, the robustness of PAS and ECS cannot be verified, directly weakening the claim that matching conclusions demonstrates human-likeness.
Authors: We acknowledge that greater methodological transparency is required to substantiate the reported polarization and non-monotonic effects. The revised manuscript will augment §5 with descriptions of the statistical methods for PAS and ECS, explicit population sizes (agent runs per study), error analysis or variance reporting, and the precise criteria used to determine inferential conclusions from agent outputs. These additions will allow independent verification of the metrics' robustness. revision: yes
-
Referee: [§3 (Evaluation metrics)] The definitions of PAS and ECS are not accompanied by any analysis of sensitivity to prompt variations or implementation choices. Given that agent design is reported to matter more than scale, the absence of ablation on how design parameters interact with the simulation environments leaves the non-monotonic effect claim without mechanistic support.
Authors: We recognize the value of sensitivity and ablation analyses to support the claim that agent design exerts a stronger influence than scale. While the main results demonstrate the effects, the revised version will include an analysis of PAS and ECS sensitivity to prompt variations and implementation choices, along with ablations examining interactions between design parameters and the simulation environments. This will provide additional mechanistic insight into the observed non-monotonic patterns. revision: yes
Circularity Check
No significant circularity; framework grounded in external replicated studies
full rationale
The paper defines its human-likeness lens directly from independently replicated social-science hypotheses and protocols, operationalized via HumanStudy-Bench as a translation layer that scores PAS and ECS alignment. No equations, fitted parameters, or self-citations are load-bearing; the 12 curated studies are presented as externally validated through independent replication rather than derived from the present work. The central claim therefore remains self-contained against external benchmarks and does not reduce to any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Validated behavioral hypotheses from decades of social science are robust indicators of human inferential behavior.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
HUMANSTUDY-BENCH turns published human-subject studies into reusable simulation environments
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
PersonaLLM: Investigating the ability of large language models to express personality traits
URLhttps://openreview.net/forum?id=I9xE1Jsjfx. Hang Jiang, Xiajie Zhang, Xubo Cao, Cynthia Breazeal, Deb Roy, and Jad Kabbara. PersonaLLM: Investigating the ability of large language models to express personality traits. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.),Findings of the Association for Computational Linguistics: NAACL 2024, pp. 3605–36...
-
[2]
URLhttps://openreview.net/forum?id=u8VOQVzduP. Xuan Liu, Haoyang Shang, and Haojian Jin. Cobra: Programming cognitive bias in social agents using classic social science experiments, 2026. URLhttps://arxiv.org/abs/2509.13588. Marlene Lutz, Indira Sen, Georg Ahnert, Elisa Rogers, and Markus Strohmaier. The prompt makes the per- son(a): A systematic evaluati...
-
[3]
Priors via Maximum Entropy.To avoid introducing subjective bias, we select priors based on thePrinciple of IndifferenceJaynes (2003). For a binary state θ, the distribution maximizing Shannon Entropy H(θ) is the uniform distribution, P(θ= 1) =P(θ= 0) = 0.5 . This establishes the uninformative prior necessary for objective benchmarking
work page 2003
-
[4]
Minimizing Bayes Risk.We define the loss function as the Squared Error Loss with respect to the true alignment A∗: L( ˆA, A∗) = ( ˆA −A ∗)2. The optimal estimator that minimizes the Bayes Risk (Expected Posterior Loss) is the conditional expectation (MMSE estimator): ˆABayes = arg min ˆA Eθ|D[( ˆA −A ∗)2] =E[A ∗ |D h, Da](8) 16 Validated Hypotheses as a L...
-
[5]
Derivation.Given the conditional independence of Human and Agent generative processes (Appendix A.2): ˆABayes =P(θ h =θ a |D h, Da) =P(θ h = 1|Dh)P(θ a = 1|Da) +P(θ h = 0|Dh)P(θ a = 0|Da) =π hπa + (1−π h)(1−π a) = ˆAP AS (9) This creates a closed loop: the PAS formula presented in the main text is exactly the Minimum Bayes Risk estimator under Maximum Ent...
-
[6]
The MLE Estimator (Hard Threshold).The Maximum Likelihood Estimator (MLE) for the alignment relies on the indicator functionI(·): ˆAM LE =I(L h >0)I(L a >0) +I(L h ≤0)I(L a ≤0)(10) This estimator is unbiased asymptotically but exhibits maximal variance at the decision boundary ( L≈0 ). Since ˆAM LE behaves as a Bernoulli variable near the boundary, a marg...
-
[7]
Note thatlimk→∞ σ(kx) =I(x >0) ; thus, PAS approaches MLE as evidence strength approaches infinity
The PAS Estimator (Soft Threshold).PAS can be understood as aShrinkage Estimatorusing the logistic sigmoid functionσ(x) = (1 +e −x)−1: ˆAP AS =σ(L h)σ(La) + (1−σ(L h))(1−σ(L a))(12) PAS serves as a continuous relaxation of MLE. Note thatlimk→∞ σ(kx) =I(x >0) ; thus, PAS approaches MLE as evidence strength approaches infinity
-
[8]
The derivative of the sigmoid at the boundary isσ ′(0) = 0.25
Variance Reduction via Delta Method.We prove PAS reduces variance using the Delta Method approximation V ar(f(X))≈[f ′(µ)]2σ2. The derivative of the sigmoid at the boundary isσ ′(0) = 0.25. Comparing the variance of the decision component: V ar( ˆAP AS) L≈0 ≈[σ ′(0)]2σ2 = 0.0625σ2 (13) Conclusion:Provided the sampling noise is not catastrophic ( σ2 <4 ), ...
-
[9]
Noise:Human data contains variance from unobserved confounders irrelevant to the hypothesis
Inferential Signal vs. Noise:Human data contains variance from unobserved confounders irrelevant to the hypothesis. Distributional metrics prioritize matching this nuisance noise. PAS isolates theinferential signal—the strength of evidence for the hypothesis—rewarding agents that capture the causal mechanism even if they exhibit less variance than humans
-
[10]
Scale Invariance:Effect sizes are scale-dependent (e.g., Cohen’s d vs. η2). PAS normalizes these into a uniform probability space[0,1], allowing aggregation across heterogeneous study designs. B.3 Aggregation Implementation Scores are aggregated across the hierarchy (Test → Finding → Study → Benchmark) using variance-stabilizing transformations to ensure ...
-
[11]
Finding & Study Level (Variance Stabilization):We map probabilities to correlation space ( rj = 2Aj −1 ) and apply the Fisher-z transformation (Fisher, 1921) to normalize the variance. Finding-level scores are the average of test z-scores; study-level scores are the average of finding z-scores. Both are mapped back to the [0,1]PAS scale via the inverse hy...
work page 1921
-
[12]
Benchmark Level (Arithmetic Mean):We compute the unweighted arithmetic mean of study-level PAS. Given the heterogeneity of the studies—which span diverse cognitive and social domains—we treat each study as an independent unit of capability. This approach ensures equal representation across domains and prevents any single study with distinct statistical pr...
work page 2009
-
[13]
Extract the paper’s title, authors, and abstract
-
[14]
Identify all experiments or studies described in the paper
-
[15]
For each experiment, determine whether it can be replicated using LLM agents, based on the inclusion criteria below. 20 Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents D.2.2 Inclusion Criteria Criterion 1: Documentation Completeness A study is retained only if full experimental details are documented, including: - Materials (e.g,...
- [16]
-
[17]
Extract all statistical tests for each finding (significant, non-significant, marginal, interactions, follow-ups)
-
[18]
For EACH study/experiment, extract: D.3.3 Extraction Objectives Objective 1: Study Structure
Include complete raw data for each test (means, SDs, sample sizes, differences). For EACH study/experiment, extract: D.3.3 Extraction Objectives Objective 1: Study Structure
-
[19]
- Findings: list all findings with IDs (Finding 1, Finding 2, etc.) and their hypotheses
STUDY STRUCTURE: - Study ID, name, phenomenon. - Findings: list all findings with IDs (Finding 1, Finding 2, etc.) and their hypotheses. - All sub-studies/scenarios/conditions. Objective 2: Materials
-
[20]
- Item-level details: question text, response options, scales
MATERIALS: - Actual text of questions, scenarios, instructions, stimuli. - Item-level details: question text, response options, scales. Objective 3: Participants
-
[21]
PARTICIPANTS: - Sample sizes, demographics, group assignments, exclusion criteria. Note: Participant characteristics are extracted when available and serve as optional reference priors for agent specification. Objective 4: Statistical Results
-
[22]
STATISTICAL RESULTS: -finding_id: Which finding this addresses (e.g., "Finding 1", "F2"). -test_name: Exact test name (e.g., "t-test", "ANOV A", "correlation"). 22 Validated Hypotheses as a Lens for Human-Likeness Evaluation in AI Agents -statistic: Complete string (e.g., "t(23) = 4.66", "F(1, 68) = 6.38", "t < 1"). -p_value: Exact value (e.g., "p < .001"...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.