Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning
read the original abstract
Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios. Our code and checkpoints are available at https://github.com/jinyangwu/SPARK.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
-
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
-
MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments
MetaResearcher is a proposed multi-component framework for scaling deep research agent training via adversarial virtual worlds, discovery tasks, meta-rewards, and multi-agent collaboration.
-
StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
StreamMeCo compresses agent memory by 70% in streaming video understanding, yielding 1.87x faster retrieval and 1% higher average accuracy on benchmarks.
-
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
OPID distills episode- and step-level skills from completed on-policy trajectories, routes them via critical-first mechanism, and combines the resulting log-probability shift advantage with outcome advantage for polic...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.