ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
super hub Canonical reference
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
Canonical reference. 94% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background MAS decomposes high-level objectives into coordinated subtasks executed by specialized agents under the orchestration of an LLM. This paradigm has been extensively explored in recent literature [16, 42, 59]. To facilitate the development of these complex systems, a growing array of open-source frame- works, such as Microsoft AutoGen [34], CrewAI [19], Camel [2], and Praison [40], has emerged. These frameworks offer high-level modular abstractions, enabling developers to easily integrate custom t
- background marking: we provide controlled experiments (5-12 agents, varying policies) and extended runs (2 h, 12 h) to study negotiation under sustained load. Adaptation and Feedback.Another complementary line of work focuses on improving LLM agents via instruction tuning and feedback alignment. Instruction-tuned models such as Flan- T5 [4], Alpaca [27], and GPT-4-LLM [23] democratized adaptation. RLHF and its extensions (InstructGPT[ 21],Constitutional AI[ 1], DPO[ 24], and related datasets likeSelf-Instr
- background to mitigate SHADOWMERGE. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE at https://anonymous.4open.science/status/S hadowMerge -033C. I. INTRODUCTION LLM agents are moving from single-turn chatbots [1], [2] toward long-running systems that remember, adapt, and act across repeated interactions [3], [4], [5]. Persistent mem- ory [6], [7], [8], [9], [10] enables this shift by allowing agents to reuse past tool outcomes, maintain user prefere
- background ing on the human behavior and language in their training data [2], and can be engaged conversationally rather than read as static arti- facts [8, 22]. Building on this, generative agents extend LLM per- sonas with memory and reflection, and can serve as believable prox- ies of individuals and communities [18]. Follow-up work shows that grounding agents in interview and survey data improves their accuracy [19]. Design and UX researchers are also actively ex- ploring AI personas in design workflow
- background Stylette [ 22] maps styling goals to CSS edits, DynaVis [43] creates manipulable widgets for visualization editing, and DirectGPT [ 29] supports in-place modification of selected objects. These systems show that natural language can support in-situ GUI changes, but each interaction is largely self-contained. Recent works such as IRF [34] and CARE [33] explored sustained interaction by updating interface content as users refine preferences over time. Still, these systems largely position the agen
- background 3 Provenance-Based Credit Assignment In classical TD(λ) (Sutton & Barto, 2018; Sutton, 1988), theλ-return Gλ t = (1−λ)∑∞ n=1λn−1G(n) t interpolates between the one-step bootstrapG(1) t =rt +γQ(st+1,at+1)and the Monte Carlo return G(∞) t =∑∞ k=0γkrt+k. Theλ-return advantage is standardly expressed as a discounted sum of future TD errors: Gλ t−Q(st,at) = T−t−1∑ k=0 (γλ)kδt+k,(4) where δt = rt +γQ(st+1,at+1)−Q(st,at). This telescoping decomposition underpins eligibility traces, propagating credit t
authors
co-cited works
roles
background 15representative citing papers
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
LLM silicon surrogates for arts participation surveys exhibit positive liking bias, lose taste relationality, and fail to preserve known social space alignments.
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.
UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.
Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.
A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
citing papers explorer
-
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
-
Evaluating Very Long-Term Conversational Memory of LLM Agents
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
-
Not-quite-human tastes: the stylized omnivorousness of LLM survey surrogates
LLM silicon surrogates for arts participation surveys exhibit positive liking bias, lose taste relationality, and fail to preserve known social space alignments.
-
BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
-
LegalWorld: A Life-Cycle Interactive Environment for Legal Agents
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
-
ZIPP:Zero-shot Image Personalization from Personas
ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.
-
UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs
UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.
-
Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
-
TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
-
Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
-
BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.
-
Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
-
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
-
A Case for Agentic Tuning: From Documentation to Action in PostgreSQL
PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.
-
From Role to Person: Trust Calibration Challenges in Twin Agents
Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.
-
Evaluating Cognitive Age Alignment in Interactive AI Agents
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
-
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
-
ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles
ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.
-
Attributing Emergence in Million-Agent Systems
A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.
-
Causal state binding predicts action control in language agents
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
-
MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
-
ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies
ClawCoin is a compute-cost-indexed token with oracle, vault, and settlement layers that stabilizes multi-agent workflows under cost shocks better than fiat baselines in simulator tests.
-
When to Forget: A Memory Governance Primitive
Memory Worth converges almost surely to the conditional probability of task success given memory retrieval and correlates at rho=0.89 with ground-truth utility in controlled experiments.
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
Stories of Your Life as Others: A Round-Trip Evaluation of LLM-Generated Life Stories Conditioned on Rich Psychometric Profiles
LLMs conditioned on actual psychometric profiles produce life stories from which independent LLMs recover personality scores at mean r=0.75, 85% of human reliability, with emotional patterns replicating in real human data.
-
FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
FLARE extracts specifications from multi-agent LLM code and applies coverage-guided fuzzing to achieve 96.9% inter-agent and 91.1% intra-agent coverage while uncovering 56 new failures across 16 applications.
-
Bounded Autonomy: Controlling LLM Characters in Live Multiplayer Games
Bounded autonomy is a new control architecture that makes LLM characters workable in live multiplayer games by combining interaction stability techniques, action grounding, and lightweight player steering, validated through deployment and analysis.
-
Soft Tournament Equilibrium
STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.
-
Restoration, Exploration and Transformation: How Youth Engage Character.AI Chatbots for Feels, Fun and Finding themselves
Youth on Character.AI use chatbots for emotional restoration, creative exploration, and identity transformation, yielding a new three-intent framework and seven-archetype taxonomy from Discord discourse analysis.
-
Episodic-to-Semantic Consolidation Without Identity Drift
A deterministic episodic-to-semantic consolidation function with a structural lemma proving identity invariance, demonstrated in synthetic experiments on an embodied service agent.
-
BOUNDARY_SYNC: Measuring Communication-Induced Representational Coupling in Multi-Agent LLM Systems
BOUNDARY_SYNC defines CAF as the ratio of conditional to baseline Jensen-Shannon divergence to quantify communication-induced representational coupling in multi-agent LLMs, reporting homogenization from text communication (CAF=0.803).
-
Emergence of Preferential Attachment and Glass-Ceiling Effects in Autonomous Networks of LLMs
Autonomous LLM agent networks develop preferential attachment and type-dependent centrality gaps that converge to stable equilibria under a mean-field model with a cross-attention utility, validated in 100-agent experiments.
-
Attractor States Emerge in Multi-Turn LLM Conversations
Self-play LLM trajectories form model-specific attractors that asymmetrically influence mixed-play partners' stylistic choices and stances across 7 models and 20 topics.
-
Stop Hand-Holding Your Coding Agent: Engineering the Loops that Replace Step-by-Step Prompting
Introduces loop engineering as a distinct practice layer for coding agents, supplies a taxonomy and verification ladder, and analyzes a hand-coded corpus of fifty real loops.
-
MedEvoEval: Evaluating Continual Evolution of Doctor Agents through Simulated Clinical Episodes
MedEvoEval is an executable longitudinal evaluation framework that converts medical cases into action-gated simulated episodes to track how doctor agents evolve decision-making, resource use, and experience across multiple encounters.
-
Staying In Character: Perspective-Bounded Memory For Book-Based Role-Playing Agents
REVERIEMEM is a three-layer perspective-bounded memory system that raises knowledge boundary fidelity by 34.6 points and wins ~79% of narrative comparisons on a new book-based role-playing benchmark.
-
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
-
Honeyquest for LLMs: Rethinking Cyber Deception for AI Attackers
LLMs fall for deceptive traps at higher rates than humans, lack the human attention-diversion effect, and exploit traps 73.4% of the time even after recognizing them in reasoning.
-
Efficient and Sound Probabilistic Verification for AI Agents
Presents a distributionally robust optimization method for sound probabilistic verification of Datalog policies in AI agents that bounds violation risk regardless of predicate correlations.
-
Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games
Equation-to-Behavior Prompting lets large LLMs match cognitive models like Bayesian updating in persuasion games; RL training cuts small-model belief error by 26.5% and improves diverse training outcomes by 2.5-12%.
-
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
-
Civil Court Simulation with Large Language Models
Multi-agent LLM framework simulates Chinese civil trials through five-stage procedures with memory and retrieval, producing judgments strong in liability allocation and multi-item decisions.
-
To Nuke or Not to Nuke: LLMs' (Missing) Ethical Reasoning and Actions in a High-Stakes Decision-Making Simulation
In Civilization V self-play, LLMs escalate to nuclear authorization and three prompt interventions do not reliably prevent it, revealing failure pathways where ethical reasoning either fails to surface, fails to appear when prompted, or fails to override strategic factors.
-
Beyond Alignment: Value Diversity as a Collective Property in Multicultural Agent Systems
Multicultural multi-agent LLM systems exhibit substantially lower value diversity than human societies on the World Values Survey, with diversity uncorrelated to per-agent alignment and further reduced by agent interactions.
-
AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents
AURA improves implicit-need coverage by 0.07 over ReAct baselines on a 100-query benchmark by inserting an intent inference step controlled by a gap score, while cutting probes 82% on factual tasks.
-
Ahoy: LLMs Enacting Multiagent Interaction Protocols
Ahoy enables LLM agents to select and enact multiple declarative interaction protocols concurrently without specialized training to achieve goals.
-
Toward Pre-Deployment Assurance for Enterprise AI Agents: Ontology-Grounded Simulation and Trust Certification
The authors introduce a three-part ontology-based verification system for AI agents that generates regulatory and adversarial test scenarios and issues machine-verifiable trust certificates, with pilot results indicating improved coverage over baselines in four industries.