Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Changpeng Yang; Jianhua Tao; Jinyang Wu; Shuai Zhang; Shuo Yang; Yuhao Shen; Zhengqi Wen

arxiv: 2601.20209 · v2 · pith:NZ7RDKPLnew · submitted 2026-01-28 · 💻 cs.LG · cs.CL

Spark: Strategic Policy-Aware Exploration via Dynamic Branching for Long-Horizon Agentic Learning

Jinyang Wu , Shuo Yang , Changpeng Yang , Yuhao Shen , Shuai Zhang , Zhengqi Wen , Jianhua Tao This is my paper

classification 💻 cs.LG cs.CL

keywords textbfexplorationsparkbranchingagentcriticaldecisiondynamic

0 comments

read the original abstract

Reinforcement learning has empowered large language models to act as intelligent agents, yet training them for long-horizon tasks remains challenging due to the scarcity of high-quality trajectories, especially under limited resources. Existing methods typically scale up rollout sizes and indiscriminately allocate computational resources among intermediate steps. Such attempts inherently waste substantial computation budget on trivial steps while failing to guarantee sample quality. To address this, we propose \textbf{Spark} (\textbf{S}trategic \textbf{P}olicy-\textbf{A}ware explo\textbf{R}ation via \textbf{K}ey-state dynamic branching), a novel framework that selectively branches at critical decision states for resource-efficient exploration. Our key insight is to activate adaptive branching exploration at critical decision points to probe promising trajectories, thereby achieving precise resource allocation that prioritizes sampling quality over blind coverage. This design leverages the agent's intrinsic decision-making signals to reduce dependence on human priors, enabling the agent to autonomously expand exploration and achieve stronger generalization. Experiments across diverse tasks (e.g., embodied planning), demonstrate that \textsc{Spark} achieves superior success rates with significantly fewer training samples, exhibiting robust generalization even in unseen scenarios. Our code and checkpoints are available at https://github.com/jinyangwu/SPARK.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles
cs.LG 2026-05 unverdicted novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and...
FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.
MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments
cs.AI 2026-06 unverdicted novelty 5.0

MetaResearcher is a proposed multi-component framework for scaling deep research agent training via adversarial virtual worlds, discovery tasks, meta-rewards, and multi-agent collaboration.
StreamMeCo: Long-Term Agent Memory Compression for Efficient Streaming Video Understanding
cs.CV 2026-04 unverdicted novelty 5.0

StreamMeCo compresses agent memory by 70% in streaming video understanding, yielding 1.87x faster retrieval and 1% higher average accuracy on benchmarks.
OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning
cs.CL 2026-06 unverdicted novelty 4.0

OPID distills episode- and step-level skills from completed on-policy trajectories, routes them via critical-first mechanism, and combines the resulting log-probability shift advantage with outcome advantage for polic...