A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
Arriaga, and Adam Tauman Kalai
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
unclear 1representative citing papers
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
Length-controlled AlpacaEval applies regression adjustment to remove length bias from LLM auto-evaluations, raising Spearman correlation with Chatbot Arena from 0.94 to 0.98.
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
In the Moltbook AI agent community, identity-claim production is highly concentrated among a few frame entrepreneurs, with event-driven attention not translating into broad claim-making.
AgentDynEx introduces nudging and a Configuration Matrix to help set up and maintain balanced mechanics and dynamics in multi-agent LLM simulations.
Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.
citing papers explorer
-
Process Matters more than Output for Distinguishing Humans from Machines
A new battery of 30 cognitive tasks demonstrates that process-level behavioral features distinguish humans from frontier AI agents better than performance metrics (mean AUC 0.88), with process-specific fine-tuning improving mimicry but limited cross-task transfer.
-
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
-
DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow
DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
-
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Length-controlled AlpacaEval applies regression adjustment to remove length bias from LLM auto-evaluations, raising Spearman correlation with Chatbot Arena from 0.94 to 0.98.
-
AI-Augmented Surveys: Leveraging Large Language Models and Surveys for Opinion Prediction
LLM embeddings enable strong retrodiction of masked GSS opinions via cross-validation and external validation but only modest performance on entirely unasked opinions.
-
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
-
Frame Entrepreneurs in an AI Agent Community: Concentrated Identity-Claim Production on Moltbook
In the Moltbook AI agent community, identity-claim production is highly concentrated among a few frame entrepreneurs, with event-driven attention not translating into broad claim-making.
-
AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations
AgentDynEx introduces nudging and a Configuration Matrix to help set up and maintain balanced mechanics and dynamics in multi-agent LLM simulations.
-
Avenir-UX: Automated UX Evaluation via Simulated Human Web Interaction with GUI Grounding
Avenir-UX automates web usability testing by using GUI-grounded simulation of user behavior to generate standardized reports with SUS, SEQ, and Think Aloud protocols.