arxiv: 2603.25620 · v3 · submitted 2026-03-26 · 💻 cs.CL

Recognition: no theorem link

PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Minseo Kim , Sujeong Im , Junseong Choi , Junhee Lee , Chaeeun Shim , Hwajung Hong , Edward Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords persona agentsconsistency evaluationmulti-turn questioningLLM evaluationinternal consistencyexternal consistencyretest consistencyinterrogation framework

0 comments

The pith

Persona agents exhibit more contradictions than humans when tested with chained multi-turn questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PICon, a framework that applies logically chained questioning drawn from interrogation principles to test persona agents for reliability over extended interactions. It assesses three dimensions: internal consistency without self-contradiction, external consistency with real-world facts, and retest consistency under repetition. When applied to multiple groups of LLM-based persona agents and compared to responses from 63 human participants, the agents produce more contradictions and evasive answers than the human baseline. This matters because persona agents are increasingly used as scalable stand-ins for humans in research, services, and simulations. The work supplies both the evaluation method and evidence that current agents fall short of the consistency needed for trustworthy substitution.

Core claim

PICon applies a multi-turn interrogation protocol of logically chained questions to persona agents and finds that even systems previously described as highly consistent fail to match human performance across internal consistency, external consistency with facts, and retest consistency under repetition, instead generating contradictions and evasive responses.

What carries the argument

PICon, the multi-turn interrogation framework that chains questions logically to probe persona agents across internal consistency (freedom from self-contradiction), external consistency (alignment with facts), and retest consistency (stability on repetition).

If this is right

Persona agents cannot be treated as reliable human substitutes without passing multi-turn consistency checks.
Single-prompt evaluations are insufficient because they miss contradictions that emerge only under chained questioning.
The three consistency dimensions provide a concrete benchmark that future persona systems must meet.
Developers can use the framework to identify and reduce evasion and contradiction patterns in new agents.
Human response baselines become a required reference point for validating any agent intended for participant replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same chained-question approach could be adapted to test other LLM behaviors such as long-term memory or adherence to instructions over time.
Persistent gaps versus humans may point to architectural limits in current models that cannot be fixed by prompt engineering alone.
Widespread adoption of PICon-style testing could shift industry practice toward consistency-constrained training objectives.
Applying the protocol in live user settings might reveal how consistency levels affect trust and engagement with persona agents.

Load-bearing premise

The specific multi-turn interrogation protocol fairly measures consistency without introducing biases unique to how LLMs respond to prompts versus human patterns.

What would settle it

A new persona agent that maintains full consistency across all three dimensions when run through the exact PICon chained-question protocol would undermine the claim that agents systematically fall short of human baselines.

read the original abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PICon gives a usable multi-turn test that shows persona agents still contradict themselves and evade more than real humans do across three consistency dimensions.

read the letter

The main thing to know is that this framework turns interrogation-style chaining into a practical check for persona agents, and the head-to-head with 63 humans shows even previously strong agents produce more contradictions and evasions than people do on internal, external, and retest measures. That direct baseline is the clearest addition over earlier single-turn consistency checks. The authors define the three dimensions cleanly, run the same protocol on both agents and humans, and release code plus a demo so others can replicate or extend it without starting from scratch. The results line up with the claim that prior high-consistency reports do not survive this kind of pressure. The human comparison avoids the usual self-referential trap by using independent participant data. A couple of soft spots are worth noting. The exact question design and how they handled prompt sensitivity or response interpretation could use more explicit controls in the write-up, since LLMs can behave differently from humans under the same phrasing. The human sample is decent but the paper would benefit from clearer power calculations or demographic matching details. These are fixable and do not undermine the central comparison. This is aimed at anyone building or relying on persona agents as proxies in research or applications. Researchers who need a concrete way to vet consistency before deployment will get immediate value from the method. It deserves a serious referee because the protocol is new, the evidence is reproducible, and the gap it identifies is practically relevant.

Referee Report

2 major / 2 minor

Summary. The paper proposes PICon, a multi-turn interrogation framework for evaluating consistency in LLM-based persona agents along three dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). It applies logically chained questioning to seven groups of persona agents and 63 human participants, finding that even previously reported highly consistent agents fail to meet human baselines and exhibit contradictions and evasive responses.

Significance. If the central comparison holds after methodological details are supplied, the work supplies a practical, interrogation-inspired protocol and open code/demo for validating persona agents before deployment as human proxies. This directly addresses a growing need in HCI, social simulation, and AI evaluation, with the human baseline comparison providing a falsifiable external anchor rather than purely self-referential metrics.

major comments (2)

[Methods] Methods section: the manuscript provides only high-level descriptions of the PICon protocol and the seven agent groups; it lacks concrete details on question generation for each consistency dimension, exact statistical tests and effect-size reporting, controls for prompt sensitivity or order effects, and the precise procedure for matching human and agent response formats. These omissions prevent verification that the reported failure of agents to meet human baselines is not an artifact of protocol design.
[Results] Results: the claim that 'even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions' requires explicit per-dimension scores, variance estimates, and the identity of the prior systems being referenced; without these, the strength of the cross-agent comparison cannot be assessed.

minor comments (2)

[Abstract] The abstract states that source code and an interactive demo are provided at a GitHub link, but the manuscript does not include a reproducibility checklist or exact version numbers of the evaluated LLMs.
[Introduction] Notation for the three consistency dimensions is introduced without a compact table summarizing their operational definitions and measurement formulas.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive feedback on our manuscript. We believe the suggested revisions will significantly enhance the clarity and reproducibility of the PICon framework. Below, we provide point-by-point responses to the major comments and outline the changes we will implement in the revised version.

read point-by-point responses

Referee: [Methods] Methods section: the manuscript provides only high-level descriptions of the PICon protocol and the seven agent groups; it lacks concrete details on question generation for each consistency dimension, exact statistical tests and effect-size reporting, controls for prompt sensitivity or order effects, and the precise procedure for matching human and agent response formats. These omissions prevent verification that the reported failure of agents to meet human baselines is not an artifact of protocol design.

Authors: We agree with the referee that the Methods section requires more concrete details to enable full verification of our findings. In the revised manuscript, we will provide: specific templates and examples for generating questions in each consistency dimension (internal, external, and retest); the precise statistical tests used, including any multiple comparison corrections, along with effect size reporting (e.g., Cohen's d); descriptions of controls for prompt sensitivity, such as testing with varied prompt phrasings, and for order effects via randomization of question sequences; and the exact procedure for aligning human and agent response formats, including any post-processing steps to ensure comparability. These enhancements will address concerns about potential artifacts in the protocol design. revision: yes
Referee: [Results] Results: the claim that 'even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions' requires explicit per-dimension scores, variance estimates, and the identity of the prior systems being referenced; without these, the strength of the cross-agent comparison cannot be assessed.

Authors: We concur that the Results section should include more explicit quantitative details to support the central claim. The revised version will feature a comprehensive table reporting per-dimension consistency scores (means and standard deviations) for all seven agent groups and the 63 human participants. Additionally, we will specify the prior systems referenced in the claim, citing the relevant literature where high consistency was previously reported, and include statistical analyses comparing them to the human baseline. This will allow for a clearer assessment of the cross-agent comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity: independent human benchmark anchors all claims

full rationale

The paper defines the PICon interrogation protocol, applies it uniformly to seven agent groups and 63 human participants, and reports direct empirical comparisons on internal/external/retest consistency. No equations, fitted parameters, or self-citations are used to derive the central result; the human baseline is collected independently and serves as an external reference rather than a constructed input. The methodology is self-contained against observable outputs and does not reduce any prediction to its own definitions or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on one domain assumption drawn from interrogation methodology and introduces no free parameters or invented entities; the three consistency dimensions are defined within the paper.

axioms (1)

domain assumption Systematic multi-turn questioning can expose inconsistencies in persona responses, analogous to interrogation techniques.
Invoked as the core principle guiding the PICon design.

pith-pipeline@v0.9.0 · 5516 in / 1023 out tokens · 29721 ms · 2026-05-15T00:26:26.350873+00:00 · methodology