LLM self-reports predict behavior selectively: TPB reaches human-level coherence within shared conversations but collapses across sessions for primed behaviors, unlike Big 5, with persona prompting stabilizing reports but not actions.
How and why llms generalize: A fine-grained analysis of llm reasoning from cognitive behaviors to low-level patterns
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 3years
2026 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
citing papers explorer
-
Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior
LLM self-reports predict behavior selectively: TPB reaches human-level coherence within shared conversations but collapses across sessions for primed behaviors, unlike Big 5, with persona prompting stabilizing reports but not actions.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.
-
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.