ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
super hub Canonical reference
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
Canonical reference. 94% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- background MAS decomposes high-level objectives into coordinated subtasks executed by specialized agents under the orchestration of an LLM. This paradigm has been extensively explored in recent literature [16, 42, 59]. To facilitate the development of these complex systems, a growing array of open-source frame- works, such as Microsoft AutoGen [34], CrewAI [19], Camel [2], and Praison [40], has emerged. These frameworks offer high-level modular abstractions, enabling developers to easily integrate custom t
- background marking: we provide controlled experiments (5-12 agents, varying policies) and extended runs (2 h, 12 h) to study negotiation under sustained load. Adaptation and Feedback.Another complementary line of work focuses on improving LLM agents via instruction tuning and feedback alignment. Instruction-tuned models such as Flan- T5 [4], Alpaca [27], and GPT-4-LLM [23] democratized adaptation. RLHF and its extensions (InstructGPT[ 21],Constitutional AI[ 1], DPO[ 24], and related datasets likeSelf-Instr
- background to mitigate SHADOWMERGE. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE at https://anonymous.4open.science/status/S hadowMerge -033C. I. INTRODUCTION LLM agents are moving from single-turn chatbots [1], [2] toward long-running systems that remember, adapt, and act across repeated interactions [3], [4], [5]. Persistent mem- ory [6], [7], [8], [9], [10] enables this shift by allowing agents to reuse past tool outcomes, maintain user prefere
- background ing on the human behavior and language in their training data [2], and can be engaged conversationally rather than read as static arti- facts [8, 22]. Building on this, generative agents extend LLM per- sonas with memory and reflection, and can serve as believable prox- ies of individuals and communities [18]. Follow-up work shows that grounding agents in interview and survey data improves their accuracy [19]. Design and UX researchers are also actively ex- ploring AI personas in design workflow
- background Stylette [ 22] maps styling goals to CSS edits, DynaVis [43] creates manipulable widgets for visualization editing, and DirectGPT [ 29] supports in-place modification of selected objects. These systems show that natural language can support in-situ GUI changes, but each interaction is largely self-contained. Recent works such as IRF [34] and CARE [33] explored sustained interaction by updating interface content as users refine preferences over time. Still, these systems largely position the agen
- background 3 Provenance-Based Credit Assignment In classical TD(λ) (Sutton & Barto, 2018; Sutton, 1988), theλ-return Gλ t = (1−λ)∑∞ n=1λn−1G(n) t interpolates between the one-step bootstrapG(1) t =rt +γQ(st+1,at+1)and the Monte Carlo return G(∞) t =∑∞ k=0γkrt+k. Theλ-return advantage is standardly expressed as a discounted sum of future TD errors: Gλ t−Q(st,at) = T−t−1∑ k=0 (γλ)kδt+k,(4) where δt = rt +γQ(st+1,at+1)−Q(st,at). This telescoping decomposition underpins eligibility traces, propagating credit t
authors
co-cited works
roles
background 15representative citing papers
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
LLM silicon surrogates for arts participation surveys exhibit positive liking bias, lose taste relationality, and fail to preserve known social space alignments.
BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.
UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.
The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.
LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.
BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.
Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.
Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.
A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.
citing papers explorer
-
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
-
LLM-Assisted Web Measurements
LLMs achieve strong performance on website classification tasks relevant to web measurements and support a practical two-step methodology for targeted studies from the Tranco list.
-
Beyond Static Responses: Multi-Agent LLM Systems as a New Paradigm for Social Science Research
The paper maps LLM agent architectures onto a six-level continuum and argues that higher levels can enable simulation of emergent social phenomena while requiring attention to reproducibility and ethical issues.
-
Characterizing Creativity in Data Visualization: Reflections and Future Directions
A systematic review and interview study characterize creativity in visualization design, finding that design processes are undervalued compared to final artifacts with ideation as a universal bottleneck.