hub Canonical reference

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

Joon Sung Park, Joseph C · 2023 · arXiv 6183.360676

Canonical reference. 94% of citing Pith papers cite this work as background.

66 Pith papers citing it

Background 94% of classified citations

read on arXiv browse 66 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16

citation-polarity summary

background 15 support 1

representative citing papers

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.

From Role to Person: Trust Calibration Challenges in Twin Agents

cs.HC · 2026-05-19 · unverdicted · novelty 7.0

Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.

Attributing Emergence in Million-Agent Systems

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 3 refs

MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.

ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies

cs.MA · 2026-04-21 · unverdicted · novelty 7.0

ClawCoin is a compute-cost-indexed token with oracle, vault, and settlement layers that stabilizes multi-agent workflows under cost shocks better than fiat baselines in simulator tests.

When to Forget: A Memory Governance Primitive

cs.AI · 2026-04-13 · unverdicted · novelty 7.0

Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.

ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

cs.AI · 2026-04-11 · unverdicted · novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.

Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.

FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.

Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games

cs.HC · 2026-04-06 · unverdicted · novelty 7.0

Bounded autonomy is a new control architecture that makes LLM characters workable in live multiplayer games by combining interaction stability techniques, action grounding, and lightweight player steering, validated through deployment and analysis.

Soft Tournament Equilibrium

cs.AI · 2026-04-06 · unverdicted · novelty 7.0

STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.

Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves

cs.HC · 2026-03-10 · unverdicted · novelty 7.0

Youth on Character.AI use chatbots for emotional restoration, creative exploration, and identity transformation, yielding a new three-intent framework and seven-archetype taxonomy from Discord discourse analysis.

The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

physics.soc-ph · 2026-05-17 · accept · novelty 6.0

Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.

citing papers explorer

Showing 50 of 66 citing papers.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts cs.CR · 2026-05-09 · unverdicted · none · ref 5 · 3 links
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
Evaluating Very Long-Term Conversational Memory of LLM Agents cs.CL · 2024-02-27 · unverdicted · none · ref 148
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents cs.LG · 2026-05-22 · unverdicted · none · ref 33
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 25
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL cs.SE · 2026-05-19 · unverdicted · none · ref 34
PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.
From Role to Person: Trust Calibration Challenges in Twin Agents cs.HC · 2026-05-19 · unverdicted · none · ref 9
Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.
Evaluating Cognitive Age Alignment in Interactive AI Agents cs.AI · 2026-05-18 · unverdicted · none · ref 20
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents cs.CL · 2026-05-16 · unverdicted · none · ref 13
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles cs.AI · 2026-05-13 · unverdicted · none · ref 4
ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.
Attributing Emergence in Million-Agent Systems cs.AI · 2026-05-12 · unverdicted · none · ref 8
A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs cs.AI · 2026-05-08 · unverdicted · none · ref 4 · 3 links
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 149
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory cs.AI · 2026-05-08 · unverdicted · none · ref 39
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory cs.AI · 2026-05-08 · unverdicted · none · ref 13
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies cs.MA · 2026-04-21 · unverdicted · none · ref 36
ClawCoin is a compute-cost-indexed token with oracle, vault, and settlement layers that stabilizes multi-agent workflows under cost shocks better than fiat baselines in simulator tests.
When to Forget: A Memory Governance Primitive cs.AI · 2026-04-13 · unverdicted · none · ref 6
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents cs.AI · 2026-04-11 · unverdicted · none · ref 34
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles cs.CL · 2026-04-07 · unverdicted · none · ref 24
LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems cs.SE · 2026-04-07 · unverdicted · none · ref 40
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games cs.HC · 2026-04-06 · unverdicted · none · ref 14
Bounded autonomy is a new control architecture that makes LLM characters workable in live multiplayer games by combining interaction stability techniques, action grounding, and lightweight player steering, validated through deployment and analysis.
Soft Tournament Equilibrium cs.AI · 2026-04-06 · unverdicted · none · ref 24
STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.
Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves cs.HC · 2026-03-10 · unverdicted · none · ref 65
Youth on Character.AI use chatbots for emotional restoration, creative exploration, and identity transformation, yielding a new three-intent framework and seven-archetype taxonomy from Discord discourse analysis.
The Illusion of Intervention: Your LLM-Simulated Experiment is an Observational Study cs.CL · 2026-05-20 · unverdicted · none · ref 14
Interventions in LLM-simulated user experiments induce distribution shifts in latent attributes that create confounding bias, diagnosable with negative control outcomes and partially mitigated by adding setting-relevant persona details.
Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits physics.soc-ph · 2026-05-17 · accept · none · ref 46
Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.
Neural Point-Forms cs.LG · 2026-05-15 · unverdicted · none · ref 53
Neural point-forms are introduced as permutation-invariant neural layers that output learned form-comparison matrices for point clouds, with a claimed consistency proof under sampling and manifold assumptions and competitive results on synthetic and biological data.
Useful Memories Become Faulty When Continuously Updated by LLMs cs.AI · 2026-05-13 · conditional · none · ref 1
LLM-consolidated memories in agents degrade over continuous updates even from useful experiences, causing up to 54% failure on previously solved ARC-AGI problems, while episodic retention preserves accuracy.
LLM-X: A Scalable Negotiation-Oriented Exchange for Communication Among Personal LLM Agents cs.AI · 2026-05-12 · unverdicted · none · ref 23
LLM-X is a scalable architecture for direct negotiation and communication among personal LLM agents, featuring federated gateways, typed protocols, and policy enforcement, shown stable in experiments with up to 12 agents.
PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines cs.AI · 2026-05-11 · unverdicted · none · ref 11
PRISM detects and stops credential leakage during LLM generation in multi-agent pipelines using per-token risk scores from lexical, structural, and behavioral signals, achieving zero observed leaks and F1 of 0.832 on a 2000-task benchmark.
EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents cs.AI · 2026-05-11 · conditional · none · ref 38 · 2 links
EnactToM is an evolving benchmark of embodied multi-agent tasks that tests functional Theory of Mind by requiring agents to act optimally on implicit beliefs in partially observable 3D environments.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators cs.CL · 2026-05-09 · unverdicted · none · ref 34
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
Sycamore: Characterizing Synthetic Personas for Evaluating Genomics Visualization Retrieval cs.HC · 2026-05-09 · conditional · none · ref 19
Grounding synthetic personas in real-user artifacts aligns their feedback language and concerns with documented experts, but both synthetic conditions converge on a find-and-adapt frame and miss the image-modality preference that real experts showed.
PrismAgent: Illuminating Harm in Memes via a Zero-Shot Interpretable Multi-Agent Framework cs.LG · 2026-05-01 · unverdicted · none · ref 35
PrismAgent deploys four specialized LLM agents in sequence to analyze meme intent, gather context, make preliminary judgments, and deliver a final harm verdict, outperforming prior zero-shot methods on three public datasets.
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents cs.CL · 2026-05-01 · unverdicted · none · ref 12
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
Learning to Evolve: A Self-Improving Framework for Multi-Agent Systems via Textual Parameter Graph Optimization cs.AI · 2026-04-22 · unverdicted · none · ref 7
TPGO represents multi-agent systems as graphs of textual parameters and applies group relative optimization to enable self-improvement from execution history.
SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems cs.AI · 2026-04-17 · unverdicted · none · ref 27
SocialGrid benchmark shows even top LLMs achieve below 60% in embodied planning and task completion, with deception detection near random chance regardless of model scale.
Evaluation of Agents under Simulated AI Marketplace Dynamics cs.IR · 2026-04-15 · unverdicted · none · ref 75
Marketplace Evaluation uses repeated-interaction simulations to assess information access systems with marketplace-level metrics such as retention and market share that complement traditional accuracy measures.
MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs cs.HC · 2026-04-07 · unverdicted · none · ref 33
MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework cs.CL · 2026-04-02 · unverdicted · none · ref 74
A unified framework for LLM agent memory is benchmarked, with a new hybrid method outperforming state-of-the-art on standard tasks.
WorkflowGen:an adaptive workflow generation mechanism driven by trajectory experience cs.LG · 2026-03-22 · unverdicted · none · ref 4
WorkflowGen reuses trajectory experiences via node-level and workflow-level extraction plus three-tier semantic routing to cut token use over 40% and raise success 20% on medium-similarity queries versus real-time planning baselines.
Collective AI can amplify tiny perturbations into divergent decisions cs.AI · 2026-03-10 · conditional · none · ref 5
Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.
ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction cs.CV · 2026-03-04 · unverdicted · none · ref 27
ECHO reframes multimedia event extraction as multi-agent iterative refinement over an explicit Multimedia Event Hypergraph with a decoupled Link-then-Bind strategy, delivering 7.3 and 15.5 F1 gains on event mention and argument role.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 85
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening cs.HC · 2026-05-22 · unverdicted · none · ref 29
Presents CultivAgents, a relationship-centered multi-agent system for socio-culturally grounded gardening support, with a mixed-methods evaluation showing modest gains in gardener confidence and motivation.
Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations? cs.CL · 2026-05-17 · unverdicted · none · ref 83
LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.
FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast cs.AI · 2026-05-15 · unverdicted · none · ref 13
FORGE is a staged population protocol that evolves prompt-injected memory (Rules, Examples, or Mixed) for ReAct agents via reflection and broadcast, yielding 1.7-7.7× gains over zero-shot and 29-72% over Reflexion on CybORG CAGE-2.
Toward Natural and Companionable Virtual Agents via Cross-Temporal Emotional Modeling cs.HC · 2026-05-15 · unverdicted · none · ref 57
CTEM framework links behavioral history to evolving emotional states with user feedback updates, instantiated as Auri agent and tested in a 21-day study showing gains in naturalness, coherence, and emotional harmony.
Learning with Conflicts of Interest cs.LG · 2026-05-15 · unverdicted · none · ref 24
A game-theoretic framework and algorithms are introduced to maximize beneficial information from ML systems while minimizing biased influences arising from conflicts of interest.
Control Charts for Multi-agent Systems cs.MA · 2026-05-11 · unverdicted · none · ref 20
Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against adversaries.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly cs.CR · 2026-05-02 · unverdicted · none · ref 13 · 2 links
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
Self-Awareness before Action: Mitigating Logical Inertia via Proactive Cognitive Awareness cs.AI · 2026-04-22 · unverdicted · none · ref 23
SABA improves LLM performance on detective puzzle benchmarks by recursively fusing information into a base state and using queries to resolve missing premises before concluding.

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer