hub Canonical reference

Agentfold: Long-horizon web agents with proactive context management

Agentfold: Long-horizon web agents with proactive context management · 2026 · arXiv 2510.24699

Canonical reference. 100% of citing Pith papers cite this work as background.

24 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 24 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

cs.AI · 2026-05-13 · unverdicted · novelty 8.0

RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.

Self-Programmed Execution for Language-Model Agents

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

Self-programmed execution lets language models act as agents by writing and executing their own orchestrator programs in a self-modifying Lisp called Spell.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation

cs.SE · 2026-05-14 · unverdicted · novelty 7.0

MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.

Analytic Concept-Centric Memory for Agentic Embodied Manipulation

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

HORMA builds a hierarchical memory structure from agent experiences and trains a lightweight RL navigator to retrieve minimal sufficient context, yielding better task performance with at most 22.17% of baseline token usage on ALFWorld, LoCoMo, and LongMemEval.

TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

TreeSeeker is an inference-time tree-structured branch-and-return framework that applies textual UCB signals for controlled trial-and-error in deep search and outperforms baselines on XBench-DeepSearch and BrowseComp.

Signal-Driven Observation for Long-Horizon Web Agents

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.

AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

cs.AI · 2026-05-23 · unverdicted · novelty 6.0

AgentFugue introduces a plug-in shared reasoning hub trained with SFT and RL that enables peer agents to share intermediate reasoning, yielding gains on long-horizon tasks over strong baselines.

SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

cs.AI · 2026-05-23 · unverdicted · novelty 6.0

SAM is a standalone memory framework for long-horizon LLM agents that creates state-adaptive cues from interactions, preserves raw trajectories for intent-driven recall, and optimizes the module via expert supervision and RL, outperforming baselines on BrowseComp and related benchmarks.

Argus: Evidence Assembly for Scalable Deep Research Agents

cs.CL · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

cs.AI · 2026-05-12 · unverdicted · novelty 6.0

ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.

PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.

LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.

POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.

LightThinker++: From Reasoning Compression to Memory Management

cs.CL · 2026-04-04 · unverdicted · novelty 6.0

LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling

cs.AI · 2026-02-15 · unverdicted · novelty 6.0

HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower computational cost on LOCOMO and LongMemEval benchmarks.

SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 5.0

SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.

When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents

cs.IR · 2026-06-19 · unverdicted · novelty 5.0

OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.

Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application

cs.CL · 2026-06-10 · unverdicted · novelty 5.0

This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.

ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

cs.AI · 2026-06-09 · unverdicted · novelty 5.0

ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.

ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

cs.AI · 2026-05-27 · unverdicted · novelty 5.0

ZipRL is a new RL-based adaptive compression method for multi-turn LLM agents that adds multi-granularity prompts and hindsight response replay to GRPO, reporting 27.9-34.7% gains on five agent tasks with maintained token efficiency.

Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning

cs.AI · 2026-05-14 · unverdicted · novelty 5.0

LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

cs.AI · 2026-04-13 · unverdicted · novelty 5.0

SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.

citing papers explorer

Showing 24 of 24 citing papers after filters.

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation cs.AI · 2026-05-13 · unverdicted · none · ref 36
RealICU is a new benchmark using physician hindsight labels on MIMIC-IV ICU data that exposes LLM failures in long-horizon clinical assessment, acute problem detection, action recommendation, and red-flag identification.
Self-Programmed Execution for Language-Model Agents cs.AI · 2026-05-07 · unverdicted · none · ref 5
Self-programmed execution lets language models act as agents by writing and executing their own orchestrator programs in a self-modifying Lisp called Spell.
VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions cs.AI · 2026-05-26 · unverdicted · none · ref 83
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
Remember Your Trace: Memory-Guided Long-Horizon Agentic Framework for Consistent and Hierarchical Repository-Level Code Documentation cs.SE · 2026-05-14 · unverdicted · none · ref 25
MemDocAgent generates consistent hierarchical repository-level code documentation by combining dependency-aware traversal with memory-guided agent interactions that accumulate work traces.
Analytic Concept-Centric Memory for Agentic Embodied Manipulation cs.RO · 2026-06-29 · unverdicted · none · ref 22
Proposes a structured concept-centric memory system for embodied agents that connects object, scene, transition, and skill memories to support coarse-to-fine retrieval and improve task performance over baselines.
Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents cs.AI · 2026-06-10 · unverdicted · none · ref 51
HORMA builds a hierarchical memory structure from agent experiences and trains a lightweight RL navigator to retrieve minimal sufficient context, yielding better task performance with at most 22.17% of baseline token usage on ALFWorld, LoCoMo, and LongMemEval.
TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search cs.AI · 2026-06-10 · unverdicted · none · ref 2
TreeSeeker is an inference-time tree-structured branch-and-return framework that applies textual UCB signals for controlled trial-and-error in deep search and outperforms baselines on XBench-DeepSearch and BrowseComp.
Signal-Driven Observation for Long-Horizon Web Agents cs.CL · 2026-06-04 · unverdicted · none · ref 7
Signal-Driven Observation decouples observation from action frequency in long-horizon web agents by invoking selective task-relevant DOM reads only on signals such as URL changes or action failures.
AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning cs.AI · 2026-05-23 · unverdicted · none · ref 66
AgentFugue introduces a plug-in shared reasoning hub trained with SFT and RL that enables peer agents to share intermediate reasoning, yielding gains on long-horizon tasks over strong baselines.
SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent cs.AI · 2026-05-23 · unverdicted · none · ref 42
SAM is a standalone memory framework for long-horizon LLM agents that creates state-adaptive cues from interactions, preserves raw trajectories for intent-driven recall, and optimizes the module via expert supervision and RL, outperforming baselines on BrowseComp and related benchmarks.
Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL · 2026-05-15 · unverdicted · none · ref 54 · 2 links
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents cs.AI · 2026-05-12 · unverdicted · none · ref 57
ToolCUA introduces a trajectory scaling pipeline and staged RL to optimize GUI-tool switching, reaching 46.85% accuracy on OSWorld-MCP for a 66% relative gain over baseline.
PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning cs.CL · 2026-05-11 · unverdicted · none · ref 42
PruneTIR prunes erroneous tool-call trajectories during LLM inference via three trigger-based components to raise Pass@1 accuracy and efficiency while shortening context.
LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents cs.AI · 2026-05-06 · unverdicted · none · ref 14
Context-ReAct enables agents to dynamically manage context via five atomic operations, and LongSeeker fine-tuned on 10k trajectories achieves 61.5% and 62.5% on BrowseComp benchmarks, outperforming prior agents.
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch cs.CV · 2026-04-15 · unverdicted · none · ref 57
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
LightThinker++: From Reasoning Compression to Memory Management cs.CL · 2026-04-04 · unverdicted · none · ref 74
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling cs.AI · 2026-02-15 · unverdicted · none · ref 49
HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower computational cost on LOCOMO and LongMemEval benchmarks.
SWE-MeM: Learning Adaptive Memory Management for Long-Horizon Coding Agents cs.SE · 2026-06-26 · unverdicted · none · ref 58
SWE-MeM introduces adaptive memory management for coding agents via synthesized trajectories and Memory-aware GRPO, reporting 43.4% and 60.2% resolve rates on SWE-Bench Verified for 4B and 30B models while beating baselines on performance and token use.
When Does Overlap Help? OSU-Mem and a Cell-Conditional Analysis of Trajectory Memory for LLM Agents cs.IR · 2026-06-19 · unverdicted · none · ref 16
OSU-Mem shows overlapping memory helps retrieval when evidence shares tools or entities but hurts when steps are heterogeneous, with benefits on synthetic benchmarks vanishing on mixed real ones due to query mixing.
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application cs.CL · 2026-06-10 · unverdicted · none · ref 281
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning cs.AI · 2026-06-09 · unverdicted · none · ref 9
ActiveMem proposes a heterogeneous distributed memory framework for LLM agents that separates planning from active memory management, reporting SOTA accuracy with lower overhead on BrowseComp-Plus and GAIA.
ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay cs.AI · 2026-05-27 · unverdicted · none · ref 5
ZipRL is a new RL-based adaptive compression method for multi-turn LLM agents that adds multi-granularity prompts and hindsight response replay to GRPO, reporting 27.9-34.7% gains on five agent tasks with maintained token efficiency.
Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning cs.AI · 2026-05-14 · unverdicted · none · ref 36
LaMR decomposes code context pruning into two rubrics using dedicated CRFs, a mixture-of-experts gate, and AST-derived labels to filter noise and often match or beat full-context baselines on coding benchmarks.
SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context cs.AI · 2026-04-13 · unverdicted · none · ref 8
SWE-AGILE introduces a Dynamic Reasoning Context with sliding windows of detailed steps and compressed Reasoning Digests to enable efficient long-horizon reasoning in software engineering agents, claiming new benchmark results on SWE-Bench-Verified for 7B-8B models.

Agentfold: Long-horizon web agents with proactive context management

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer