super hub Canonical reference

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

doi: 10 · 2023 · arXiv 6183.360676

Canonical reference. 94% of citing Pith papers cite this work as background.

119 Pith papers citing it

Background 94% of classified citations

read on arXiv browse 119 citing papers more from doi: 10

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 16

citation-polarity summary

background 15 support 1

claims ledger

background MAS decomposes high-level objectives into coordinated subtasks executed by specialized agents under the orchestration of an LLM. This paradigm has been extensively explored in recent literature [16, 42, 59]. To facilitate the development of these complex systems, a growing array of open-source frame- works, such as Microsoft AutoGen [34], CrewAI [19], Camel [2], and Praison [40], has emerged. These frameworks offer high-level modular abstractions, enabling developers to easily integrate custom t
background marking: we provide controlled experiments (5-12 agents, varying policies) and extended runs (2 h, 12 h) to study negotiation under sustained load. Adaptation and Feedback.Another complementary line of work focuses on improving LLM agents via instruction tuning and feedback alignment. Instruction-tuned models such as Flan- T5 [4], Alpaca [27], and GPT-4-LLM [23] democratized adaptation. RLHF and its extensions (InstructGPT[ 21],Constitutional AI[ 1], DPO[ 24], and related datasets likeSelf-Instr
background to mitigate SHADOWMERGE. We have responsibly disclosed our findings to affected graph-memory vendors and open sourced SHADOWMERGE at https://anonymous.4open.science/status/S hadowMerge -033C. I. INTRODUCTION LLM agents are moving from single-turn chatbots [1], [2] toward long-running systems that remember, adapt, and act across repeated interactions [3], [4], [5]. Persistent mem- ory [6], [7], [8], [9], [10] enables this shift by allowing agents to reuse past tool outcomes, maintain user prefere
background ing on the human behavior and language in their training data [2], and can be engaged conversationally rather than read as static arti- facts [8, 22]. Building on this, generative agents extend LLM per- sonas with memory and reflection, and can serve as believable prox- ies of individuals and communities [18]. Follow-up work shows that grounding agents in interview and survey data improves their accuracy [19]. Design and UX researchers are also actively ex- ploring AI personas in design workflow
background Stylette [ 22] maps styling goals to CSS edits, DynaVis [43] creates manipulable widgets for visualization editing, and DirectGPT [ 29] supports in-place modification of selected objects. These systems show that natural language can support in-situ GUI changes, but each interaction is largely self-contained. Recent works such as IRF [34] and CARE [33] explored sustained interaction by updating interface content as users refine preferences over time. Still, these systems largely position the agen
background 3 Provenance-Based Credit Assignment In classical TD(λ) (Sutton & Barto, 2018; Sutton, 1988), theλ-return Gλ t = (1−λ)∑∞ n=1λn−1G(n) t interpolates between the one-step bootstrapG(1) t =rt +γQ(st+1,at+1)and the Monte Carlo return G(∞) t =∑∞ k=0γkrt+k. Theλ-return advantage is standardly expressed as a discounted sum of future TD errors: Gλ t−Q(st,at) = T−t−1∑ k=0 (γλ)kδt+k,(4) where δt = rt +γQ(st+1,at+1)−Q(st,at). This telescoping decomposition underpins eligibility traces, propagating credit t

authors

doi: 10

co-cited works

representative citing papers

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts

cs.CR · 2026-05-09 · unverdicted · novelty 8.0 · 3 refs

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

Not-quite-human tastes: the stylized omnivorousness of LLM survey surrogates

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

LLM silicon surrogates for arts participation surveys exhibit positive liking bias, lose taste relationality, and fail to preserve known social space alignments.

BehaviorBench: Benchmarking Foundation Models for Behavioral Science Tasks

cs.CL · 2026-06-23 · unverdicted · novelty 7.0

BehaviorBench is a benchmark for foundation models on behavioral tasks that reveals fine-tuned behavioral models outperform general models on distributional alignment while general models lead on individual-level accuracy.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.

ZIPP:Zero-shot Image Personalization from Personas

cs.AI · 2026-06-07 · unverdicted · novelty 7.0

ZIPP conditions diffusion models on LLM-rewritten prompts derived from graph-mined natural-language personas to achieve zero-shot personalization, reporting 13-20% gains and 79% human preference win rate over generic outputs.

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

UnpredictaBench creates 448 distributional sampling tasks and the KS@N metric to measure LLM approximation of target distributions, finding no model exceeds 40% success at N=100.

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

The paper delivers the first systems characterization of agent memory, with a four-axis taxonomy, phase-aware profiler, evaluation of ten systems on two benchmarks, and ten design recommendations.

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

cs.DB · 2026-06-04 · unverdicted · novelty 7.0

TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies while retaining a language-model judge.

Synthetic Personalities: How Well Can LLMs Mimic Individual Respondents Using Socio-Economic Microdata?

cs.CY · 2026-06-03 · unverdicted · novelty 7.0

LLMs achieve up to 78.8% accuracy and r=0.590 correlation mimicking individual SOEP respondents using cumulative microdata, with gains from more information but diminishing returns past the 75% entropy point.

BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

BehaviorBench reconstructs 2,000 real wallets into 141k belief and 1.4M trade prediction tasks to test if personalization from history improves model performance over non-personalized baselines.

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

Introduces QGP and PushBench to evaluate LLM agent persistence on quantitative goals, showing specialized controllers outperform baselines on verifier-checked artifact collection tasks.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

A Case for Agentic Tuning: From Documentation to Action in PostgreSQL

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

PerfEvolve equips LLM agents with executable skills from expert methods to enable dynamic, version-consistent, workload-specific tuning in PostgreSQL, outperforming documentation baselines by up to 35.2% on TPC-C and TPC-H.

From Role to Person: Trust Calibration Challenges in Twin Agents

cs.HC · 2026-05-19 · unverdicted · novelty 7.0

Twin agents as personal digital representations create distinct trust calibration challenges because they dissolve the boundary between AI and human decision-makers, unlike existing frameworks designed for clear separation.

Evaluating Cognitive Age Alignment in Interactive AI Agents

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.

ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

ScioMind combines anchoring-based belief updates, hierarchical memory, and dynamic profiles in LLM multi-agent systems to produce more stable, diverse, and psychologically aligned opinion trajectories than prior fixed-rule or unconstrained approaches.

Attributing Emergence in Million-Agent Systems

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

A scalable Aumann-Shapley attribution method for million-agent systems reveals that small-scale samples structurally misattribute emergence under nonlinear macro indicators, as shown by the Attribution Scaling Bias theorem.

Causal state binding predicts action control in language agents

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 3 refs

Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 3 refs

MemQ improves LLM agent performance by using eligibility traces over provenance DAGs to assign credit to dependent memories, achieving top success rates on six benchmarks with largest gains on complex multi-step tasks.

The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory

cs.AI · 2026-05-08 · unverdicted · novelty 7.0

MemoRepair formalizes the cascade update problem in agentic memory and solves it via a min-cut reduction that eliminates invalidated memory exposure to 0% while recovering 91-94% of valid successors at 57-76% of baseline repair cost.

citing papers explorer

Showing 4 of 4 citing papers after filters.

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 85
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
LLM-Assisted Web Measurements cs.CR · 2025-10-09 · unverdicted · none · ref 44
LLMs achieve strong performance on website classification tasks relevant to web measurements and support a practical two-step methodology for targeted studies from the Tranco list.
Beyond Static Responses: Multi-Agent LLM Systems as a New Paradigm for Social Science Research cs.MA · 2025-06-02 · unverdicted · none · ref 32
The paper maps LLM agent architectures onto a six-level continuum and argues that higher levels can enable simulation of emergent social phenomena while requiring attention to reproducibility and ethical issues.
Characterizing Creativity in Data Visualization: Reflections and Future Directions cs.HC · 2025-04-03 · accept · none · ref 44
A systematic review and interview study characterize creativity in visualization design, finding that design processes are undervalued compared to final artifacts with ideation as a universal bottleneck.

O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer