SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

· 2026 · cs.AI · arXiv 2601.03555

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

open full Pith review browse 12 citing papers arXiv PDF

abstract

Training reliable tool-augmented agents remains a significant challenge, largely due to the difficulty of credit assignment in multi-step reasoning. While process-level reward models offer a promising direction, existing LLM-based judges often produce noisy and inconsistent signals because they lack fine-grained, task-specific rubrics to distinguish high-level planning from low-level execution. In this work, we introduce SCRIBE (Skill-Conditioned Reward with Intermediate Behavioral Evaluation), a reinforcement learning framework that intervenes at a novel mid-level abstraction. SCRIBE grounds reward modeling in a curated library of skill prototypes, transforming open-ended LLM evaluation into a constrained verification problem. By routing each subgoal to a corresponding prototype, the reward model is equipped with precise, structured rubrics that substantially reduce reward variance. Experimental results show that SCRIBE achieves state-of-the-art performance across a range of reasoning and tool-use benchmarks. In particular, it improves the AIME25 accuracy of a Qwen3-4B model from 43.3% to 63.3%, and significantly increases success rates in complex multi-turn tool interactions. Further analysis of training dynamics reveals a co-evolution across abstraction levels, where mastery of mid-level skills consistently precedes the emergence of effective high-level planning behaviors. Finally, we demonstrate that SCRIBE is additive to low-level tool optimizations, providing a scalable and complementary pathway toward more autonomous and reliable tool-using agents.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.

When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems

cs.SE · 2026-05-30 · unverdicted · novelty 6.0

About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

cs.CL · 2026-05-29 · unverdicted · novelty 6.0 · 2 refs

TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.

Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation

cs.CL · 2026-05-10 · unverdicted · novelty 6.0 · 3 refs

Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.

MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.

MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment

cs.LG · 2026-06-27 · unverdicted · novelty 5.0

MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.

MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens

cs.CR · 2026-06-15 · unverdicted · novelty 5.0

MASCOT-Android curates Android malware source code specimens via a GitHub collection pipeline whose README-only LinearSVC classifier on character TF-IDF features reaches 96.28% accuracy and 1.06% FPR.

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

cs.SD · 2026-04-13 · unverdicted · novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.

From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.

RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval

cs.CV · 2026-06-10 · unverdicted · novelty 4.0

RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.

IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval

cs.CV · 2026-06-06 · unverdicted · novelty 4.0

IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.

Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

cs.LG · 2026-05-08 · unverdicted · novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

citing papers explorer

Showing 12 of 12 citing papers after filters.

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents cs.LG · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
LQM-ContextRoute routes LLM tool calls via latency-quality matching in a contextual bandit, improving F1 by 2.18 pp, accuracy by up to 18 pp, and NDCG by 2.91-3.22 pp over SW-UCB on web-search, StrategyQA, and retriever benchmarks.
When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems cs.SE · 2026-05-30 · unverdicted · none · ref 38 · internal anchor
About 18.2% of structurally flagged skill pairs represent genuine compositional safety risks in agent skill registries, with exploitation gated by host model behavior.
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance cs.CL · 2026-05-29 · unverdicted · none · ref 45 · 2 links · internal anchor
TOPD improves on-policy distillation for LLM reasoning by using near-future guidance to identify divergent states, raising average accuracy from 47.8% to 52.2% on math benchmarks including AIME24 and AIME25.
Cornerstones or Stumbling Blocks? Deciphering the Rock Tokens in On-Policy Distillation cs.CL · 2026-05-10 · unverdicted · none · ref 14 · 3 links · internal anchor
Rock Tokens in on-policy distillation persist at high loss, account for up to 18% of outputs, absorb large gradient norms, but add negligible value to reasoning performance.
MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality cs.CV · 2026-05-07 · unverdicted · none · ref 105 · internal anchor
MUSE decouples reconstruction and semantic learning in visual tokenization via topological orthogonality, yielding SOTA generation quality and improved semantic performance over its teacher model.
MOSAIC: Orchestrating Collaborative Knowledge Tracing with Hierarchical Semantic Alignment cs.LG · 2026-06-27 · unverdicted · none · ref 31 · internal anchor
MOSAIC combines frozen-LLM semantic embeddings with hierarchical consistency objectives to report up to 3.4% AUC gains on knowledge-tracing benchmarks including a new MOOC dataset.
MASCOT-Android: A Curated Dataset and Automated Collection Pipeline for Android Malware Source Code Specimens cs.CR · 2026-06-15 · unverdicted · none · ref 15 · internal anchor
MASCOT-Android curates Android malware source code specimens via a GitHub collection pipeline whose README-only LinearSVC classifier on character TF-IDF features reaches 96.28% accuracy and 1.06% FPR.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing cs.SD · 2026-04-13 · unverdicted · none · ref 25 · internal anchor
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models cs.CL · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
A survey of credit assignment techniques in LLM reinforcement learning that distinguishes maturing methods for reasoning from new approaches needed for agentic settings and provides supporting resources.
RankVR: Low-Rank Structure Perception and Value Recalibration for Robust Composed Image Retrieval cs.CV · 2026-06-10 · unverdicted · none · ref 48 · internal anchor
RankVR introduces GSCP and ASVC modules to improve CIR robustness by decoupling clean samples via low-rank structure and dynamically scoring triplet value in noisy datasets.
IMAGINE: Adaptive Schema-Imagery Enhanced Composition for Composed Video Retrieval cs.CV · 2026-06-06 · unverdicted · none · ref 51 · internal anchor
IMAGINE uses adaptive schema-imagery via dynamic multimodal prototypes to incorporate implicit semantics into composed video retrieval, claiming SOTA results on CVR and CIR benchmarks.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems cs.LG · 2026-05-08 · unverdicted · none · ref 166 · internal anchor
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

SCRIBE: Structured Mid-Level Supervision for Tool-Using Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer