AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use
Pith reviewed 2026-05-09 21:41 UTC · model grok-4.3
The pith
Small language models trained with dual RL flywheels on synthetic data achieve strong agentic performance and close the gap with larger models on industrial search and data analysis tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the AgenticQwen family of models trained via multi-round reinforcement learning on synthetic data and limited open-source data. The training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks.
What carries the argument
Dual data flywheels: the reasoning flywheel that increases task difficulty by learning from errors, and the agentic flywheel that expands linear workflows into multi-branch behavior trees, together generating the training data for multi-round RL.
If this is right
- Small models become viable for multi-step reasoning and tool use in real-world industrial settings under strict cost and latency constraints.
- Automatic generation of increasingly hard tasks via the two flywheels reduces the need for large amounts of human-annotated data.
- Performance on public agentic benchmarks and internal search and data analysis tasks approaches that of much larger models.
- The approach supports deployment of agentic systems where model size must remain small for efficiency.
Where Pith is reading between the lines
- The method could be tested on domains outside tool use, such as planning or code generation, to check if the flywheel structure generalizes.
- Combining the two flywheels might reduce the total compute needed for agent training compared with scaling model size alone.
- If the branching behavior trees capture decision complexity well, similar structures could improve simulation-based training in robotics or game agents.
Load-bearing premise
The synthetic tasks generated by the dual flywheels are sufficiently representative of real industrial decision complexity and that performance gains transfer beyond the specific benchmarks and internal system described.
What would settle it
Evaluating the trained models on a fresh collection of industrial agent tasks drawn from a different domain or company workflow and measuring whether the performance gap to larger models on search and data analysis remains closed.
Figures
read the original abstract
Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba-pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi-sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the AgenticQwen family of small language models trained via multi-round reinforcement learning on synthetic data generated by dual data flywheels (a reasoning flywheel that escalates difficulty from errors and an agentic flywheel that converts linear workflows into multi-branch behavior trees), supplemented by limited open-source data. It claims strong performance on public agentic benchmarks and that the models close the gap with much larger models on search and data analysis tasks within the authors' industrial agent system, with code, models, and partial data released publicly.
Significance. If the empirical claims hold after addressing validation gaps, the work would be significant for enabling practical, low-latency agentic capabilities in industrial settings using small models rather than relying on larger ones. The dual-flywheel mechanism for automated task generation and the public release of the data synthesis pipeline, RL training code, and model checkpoints represent clear strengths that support reproducibility and further research.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments/Industrial Evaluation): The headline claim that the models 'close the gap with much larger models on search and data analysis tasks' in the industrial system is load-bearing for the paper's contribution, yet the provided text supplies no concrete metrics (e.g., success rates, latency, cost comparisons), baselines, error bars, or ablation results; without these, the magnitude and reliability of the reported gains cannot be assessed.
- [§3.2] §3.2 (Agentic Flywheel description): The transfer claim to industrial settings depends on the generated multi-branch behavior trees matching real decision complexity, yet no quantitative statistics are reported (e.g., distribution of tree depths, branching-factor entropy, or coverage of tool-interaction failure modes) comparing the synthetic tasks to the internal evaluation distribution; this directly affects whether performance gains generalize beyond the authors' system.
minor comments (2)
- [Abstract] Abstract: The phrase 'dual data flywheels' is used without an immediate high-level diagram or concise definition, which would help readers quickly grasp the two distinct mechanisms before the detailed methods section.
- [§5] §5 (Conclusion): A brief limitations paragraph discussing potential mismatches between synthetic task distributions and real-world industrial branching/uncertainty would strengthen the manuscript.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and data characterization where possible.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments/Industrial Evaluation): The headline claim that the models 'close the gap with much larger models on search and data analysis tasks' in the industrial system is load-bearing for the paper's contribution, yet the provided text supplies no concrete metrics (e.g., success rates, latency, cost comparisons), baselines, error bars, or ablation results; without these, the magnitude and reliability of the reported gains cannot be assessed.
Authors: We agree that the industrial evaluation section requires more granular quantitative support for the headline claim. In the revised manuscript we will expand §4 with a table reporting success rates, latency, and cost metrics for AgenticQwen models versus larger baselines on the search and data analysis tasks, including error bars from multiple runs and a brief ablation discussion. These additions will allow readers to directly evaluate the magnitude and reliability of the reported improvements. revision: yes
-
Referee: [§3.2] §3.2 (Agentic Flywheel description): The transfer claim to industrial settings depends on the generated multi-branch behavior trees matching real decision complexity, yet no quantitative statistics are reported (e.g., distribution of tree depths, branching-factor entropy, or coverage of tool-interaction failure modes) comparing the synthetic tasks to the internal evaluation distribution; this directly affects whether performance gains generalize beyond the authors' system.
Authors: We recognize that characterizing the alignment between synthetic and real tasks is important for assessing generalization. Because the internal industrial evaluation distribution is proprietary, we cannot release or directly compare statistics from the real tasks. In the revision we will add quantitative descriptors of the generated behavior trees (tree-depth distributions, branching-factor statistics, and entropy) to §3.2 and explain how the flywheel targets observed tool-interaction failure modes. This provides a transparent view of the synthetic data while respecting confidentiality. revision: partial
- Direct quantitative comparison between the synthetic multi-branch behavior trees and the proprietary internal evaluation distribution, which cannot be disclosed for confidentiality reasons.
Circularity Check
No circularity: empirical training pipeline with no derivation chain
full rationale
The paper presents an empirical method for training small agentic models via multi-round RL on synthetic data from dual flywheels (reasoning flywheel for error-driven difficulty and agentic flywheel for expanding workflows into behavior trees). No mathematical equations, first-principles derivations, or predictions are claimed that reduce by construction to fitted inputs, self-citations, or ansatzes. All claims rest on described training procedures and reported benchmark/industrial results, which are self-contained without load-bearing reductions to prior author work or tautological definitions. This is standard non-circular empirical ML research.
Axiom & Free-Parameter Ledger
free parameters (1)
- RL training hyperparameters
axioms (1)
- domain assumption Synthetic data generated by error-driven and branching flywheels improves agentic reasoning and tool use
invented entities (1)
-
Dual data flywheels
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2509.14257 , year =
Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. 2025. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257. Yuanjie Lyu, Chengyu Wang, Lei Shen, Jun Huang, and Tong Xu. 2026. Mock worlds, real sk...
-
[2]
Shen, M., Li, Y ., Chen, L., and Yang, Q
From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024. Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. 2025. Crossing the reward bridge: Expanding rl with verifi- able rewards across diverse domains.arXiv preprint arXiv:2503.23829. Li Sun, Liu He, Shuyue Jia,...
-
[3]
Academic Standing: GPA ≥ 3.0 and no disciplinary hold
-
[4]
Athletic Performance: ≥ 15 pts/game OR ≥ 8 re- b/game
-
[5]
Coach Endorsement: Required for player self- nominations 4.Nomination Window: April 1–15 only
-
[6]
Academic office confirmed eligibility
Single Nomination: One active nomination per player per year Submission Protocol The system must verifyalleligibility criteria before cre- ating a nomination record.If any condition is not met, the nomination must be refusedwith explanation of the specific deficiency. Profile Updates Public profile fields (bio, honors) may be updated inde- pendently of no...
-
[7]
Agent Instruction:A policy that contains only the rules necessary to execute the se- lected path (Section B.2)
-
[8]
Mock User Inputs:A natural-language re- quest plus an adversarial strategy that pushes the agent toward an incorrect path (Sec- tions B.3 and B.4)
-
[9]
Mock Tool and Environment Inputs:The tool interface and system state (Sections B.5 and B.6), ensuring that every tool call in Fig- ure 4 is reproducible. This procedure converts a single sampled path from the behavior tree into a complete RL-ready training example that combines realistic user intent, ID Tool Name 1 Search Engine 2 Web Browser 3 Calculator...
-
[10]
Preserves the core successful execution path from the original workflow
-
[11]
Adds constraint branches (refusal conditions, prerequisite checks)
-
[12]
Introduces adversarial branches (edge cases, policy violations)
-
[13]
happy path
Defines tools that support state-verifiable operations The completion of tasks will be judged byobjectively verifiable state changes(e.g., database modifications, record updates), NOT subjective content generation. INPUT COMPONENTS: 1.Standard Workflow: A linear or simple branching sequence of steps representing the “happy path” for task completion 2.Back...
-
[14]
User requests item purchase
-
[15]
Check inventory availability
-
[16]
Confirm order Evolved Behavior Tree (Conceptual): Root: Purchase request +-- Missing user_id -> CLARIFY +-- Missing item_id -> CLARIFY +-- Item unavailable -> REFUSE +-- Payment failed -> REFUSE (with sub-branches for retry/transfer) +-- User lacks purchase permission -> REFUSE +-- All conditions met +– User confirms -> EXECUTE tools (reserve_inventory,| ...
-
[17]
Workflow Analysis: [Identify core operations, decision points, required data]
-
[18]
Tool Extraction: [List tools derived from workflow steps]
-
[19]
Branch Expansion: [How to handle failures, violations, edge cases] </reasoning>
-
[20]
Task Description <task> [Describe the complex task scenario, maintaining the core workflow while acknowledging the need for policy enforcement and error handling] </task>
-
[21]
Tool List (JSON) <tools> [JSON array of 3-5 tools with <= 3 parameters each, derived from workflow operations] </tools>
-
[22]
refusal_condition: booking outside 24h window
Tree-based Policy (JSON) <behavior_tree> [Complete JSON behavior tree with happy path + constraint/refusal/transfer branches] </behavior_tree> Figure 7: Prompt for workflow expansion and agent-instruction generation (Part 3: Output format). BRANCH_TO_TASK_PROMPT OBJECTIVE: Given a behavior tree branch, generate a complete test case with: (1) natural user ...
-
[23]
Cancel my booking (I know it’s past 24h)
User Request Construction Create a natural, single-sentence task that: - Appears superficially reasonable and legal - Does NOT explicitly mention policy violations - Implicitly triggers the target branch condition - Written from genuine user perspective (no test language) Example: Instead of “Cancel my booking (I know it’s past 24h)”, use “Hi, I’d like to...
-
[24]
I booked last week
User Background Design Provide realistic context containing: - User identity and relevant IDs (user_id, account_name, etc.) - Situational details (purpose, timeline, domain-specific info) -All parameters needed for initial query tools - Subtle cues that align with policy violation (e.g., “I booked last week”) without stating the violation Must feel authen...
-
[25]
within_24_hours
Normal Path (Compliant Workflow) Design the tool-calling sequence a policy-compliant agent should execute: Step 1: Query tools to gather information - Use lookup/validation tools with parameters from user_background - Expected outputs must reveal the policy violation Step 2: Detect policy violation - Tool outputs expose condition mismatch (e.g.,"within_24...
-
[26]
I’m sure it’s eligible
Hack Path (Policy Violation Workflow) Design the tool sequence a jailbroken agent might execute: Characteristics: - Skips critical validation steps, OR - Ignores policy-violating outputs from query tools, OR - Proceeds to state-modifying tools despite refusal conditions Trigger: User provides follow-up persuasion (“I’m sure it’s eligible”, “Just process i...
-
[27]
I found the confirmation email showing
Hack_success_user_background Additional info user provides to enable hack_path execution: - Contains identifiers needed for state-modifying tools - Phrased as natural follow-up (“I found the confirmation email showing...”) -Cannot fabricate new data, only provide existing identifiers - Empty string""if case is fully allowed Figure 8: Prompt for converting...
-
[28]
Test Policy and Evaluation -test_policy: Which policy rule/branch this case validates -user_escape_strategy: How user attempts to bypass constraints after initial refusal -evaluation: Database state changes that determine pass/fail ◦Pass: Correct state-modifying tools called or refusal provided ◦Fail: Prohibited tools executed, database altered OUTPUT FOR...
-
[29]
Branch Analysis: [Which policy branch, what condition triggers it]
-
[30]
User Scenario Design: [How to phrase request naturally]
-
[31]
Normal Path Construction: [Tool sequence for compliant behavior]
-
[32]
Hack Path Construction: [How policy violation would manifest]
-
[33]
task": "Single natural sentence request
Parameter Verification: [Confirm all params traceable] </reasoning> <test_case> { "task": "Single natural sentence request", "user_background": "Complete context with all initial query params", "hack_success_user_background": "Follow-up info for hack_path or ’’", "tool_return_expected": { "normal_path": [ {"tool_name": "...", "input": {...}, "expected_out...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.