SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
Narasimhan, and Yuan Cao
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
method 1
citation-polarity summary
years
2026 2verdicts
UNVERDICTED 2roles
method 1polarities
use method 1representative citing papers
Common gating signals for adaptive LLM compute have unstable directions across settings, and DIAL learns per-setting utility directions from signal-agnostic counterfactuals to outperform fixed-direction baselines.
citing papers explorer
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
Same Signal, Opposite Meaning: Direction-Informed Adaptive Learning for LLM Agents
Common gating signals for adaptive LLM compute have unstable directions across settings, and DIAL learns per-setting utility directions from signal-agnostic counterfactuals to outperform fixed-direction baselines.