arxiv: 2604.21590 · v1 · submitted 2026-04-23 · 💻 cs.CL

AgenticQwen: Training Small Agentic Language Models with Dual Data Flywheels for Industrial-Scale Tool Use

Yuanjie Lyu , Chengyu Wang , Haonan Zheng , Yuanhao Yue , Junbing Yan , Ming Wang , Jun Huang This is my paper

Pith reviewed 2026-05-09 21:41 UTC · model grok-4.3

classification 💻 cs.CL

keywords agentic language modelstool usereinforcement learningsynthetic data generationdata flywheelssmall modelsindustrial applicationsmulti-step reasoning

0 comments

The pith

Small language models trained with dual RL flywheels on synthetic data achieve strong agentic performance and close the gap with larger models on industrial search and data analysis tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that small models can be made effective agents for complex tool use by combining multi-round reinforcement learning with two automatic data generation loops. One loop hardens reasoning tasks by focusing on errors, while the other turns simple linear workflows into branching decision trees that mirror real-world complexity. A sympathetic reader would care because this approach reduces dependence on large human-labeled datasets and makes capable agents feasible under tight cost and latency limits common in industry.

Core claim

We introduce the AgenticQwen family of models trained via multi-round reinforcement learning on synthetic data and limited open-source data. The training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks.

What carries the argument

Dual data flywheels: the reasoning flywheel that increases task difficulty by learning from errors, and the agentic flywheel that expands linear workflows into multi-branch behavior trees, together generating the training data for multi-round RL.

If this is right

Small models become viable for multi-step reasoning and tool use in real-world industrial settings under strict cost and latency constraints.
Automatic generation of increasingly hard tasks via the two flywheels reduces the need for large amounts of human-annotated data.
Performance on public agentic benchmarks and internal search and data analysis tasks approaches that of much larger models.
The approach supports deployment of agentic systems where model size must remain small for efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on domains outside tool use, such as planning or code generation, to check if the flywheel structure generalizes.
Combining the two flywheels might reduce the total compute needed for agent training compared with scaling model size alone.
If the branching behavior trees capture decision complexity well, similar structures could improve simulation-based training in robotics or game agents.

Load-bearing premise

The synthetic tasks generated by the dual flywheels are sufficiently representative of real industrial decision complexity and that performance gains transfer beyond the specific benchmarks and internal system described.

What would settle it

Evaluating the trained models on a fresh collection of industrial agent tasks drawn from a different domain or company workflow and measuring whether the performance gap to larger models on search and data analysis remains closed.

Figures

Figures reproduced from arXiv: 2604.21590 by Chengyu Wang, Haonan Zheng, Junbing Yan, Jun Huang, Ming Wang, Yuanhao Yue, Yuanjie Lyu.

**Figure 2.** Figure 2: Performance gains from iterative data flywheel training. Across TAU [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Case study of AgenticQwen in a production agentic system for data analytics. fewer parameters. Our results indicate that small agentic models can effectively support complex real-world workflows, making advanced agentic capabilities more accessible and practical to deploy. Limitations Our current work focuses on reasoning and function calling. Although AgenticQwen models exhibit robust performance in th… view at source ↗

**Figure 4.** Figure 4: Expected execution: compliant path (left) ver [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt for workflow expansion and agent-instruction generation (Part 1: Objective and tool design). [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt for workflow expansion and agent-instruction generation (Part 2: Behavior tree structure). [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt for workflow expansion and agent-instruction generation (Part 3: Output format). [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for converting branches into executable test cases (Part 1: User input). [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for converting branches into executable test cases (Part 2: Output format). [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Modern industrial applications increasingly demand language models that act as agents, capable of multi-step reasoning and tool use in real-world settings. These tasks are typically performed under strict cost and latency constraints, making small agentic models highly desirable. In this paper, we introduce the AgenticQwen family of models, trained via multi-round reinforcement learning (RL) on synthetic data and a limited amount of open-source data. Our training framework combines reasoning RL and agentic RL with dual data flywheels that automatically generate increasingly challenging tasks. The reasoning flywheel increases task difficulty by learning from errors, while the agentic flywheel expands linear workflows into multi-branch behavior trees that better reflect the decision complexity of real-world applications. We validate AgenticQwen on public benchmarks and in an industrial agent system. The models achieve strong performance on multiple agentic benchmarks, and in our industrial agent system, close the gap with much larger models on search and data analysis tasks. Model checkpoints and part of the synthetic data: https://huggingface.co/collections/alibaba-pai/agenticqwen. Data synthesis and RL training code: https://github.com/haruhi-sudo/data_synth_and_rl. The data synthesis pipeline is also integrated into EasyDistill: https://github.com/modelscope/easydistill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgenticQwen gives a practical dual-flywheel recipe for training small tool-using models that close gaps on benchmarks and internal tasks, but the industrial transfer rests on unshown task similarity.

read the letter

The main point is that this work trains small Qwen variants with multi-round RL on synthetic data generated by two self-improving loops: a reasoning flywheel that escalates difficulty from past errors, and an agentic flywheel that converts linear workflows into multi-branch behavior trees. The resulting models hit strong numbers on public agent benchmarks and narrow the distance to much larger models inside the authors' own search and data-analysis system, all while staying under tight cost and latency limits. They also ship code, part of the data, and checkpoints, which is straightforward to use.

Referee Report

2 major / 2 minor

Summary. The paper introduces the AgenticQwen family of small language models trained via multi-round reinforcement learning on synthetic data generated by dual data flywheels (a reasoning flywheel that escalates difficulty from errors and an agentic flywheel that converts linear workflows into multi-branch behavior trees), supplemented by limited open-source data. It claims strong performance on public agentic benchmarks and that the models close the gap with much larger models on search and data analysis tasks within the authors' industrial agent system, with code, models, and partial data released publicly.

Significance. If the empirical claims hold after addressing validation gaps, the work would be significant for enabling practical, low-latency agentic capabilities in industrial settings using small models rather than relying on larger ones. The dual-flywheel mechanism for automated task generation and the public release of the data synthesis pipeline, RL training code, and model checkpoints represent clear strengths that support reproducibility and further research.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments/Industrial Evaluation): The headline claim that the models 'close the gap with much larger models on search and data analysis tasks' in the industrial system is load-bearing for the paper's contribution, yet the provided text supplies no concrete metrics (e.g., success rates, latency, cost comparisons), baselines, error bars, or ablation results; without these, the magnitude and reliability of the reported gains cannot be assessed.
[§3.2] §3.2 (Agentic Flywheel description): The transfer claim to industrial settings depends on the generated multi-branch behavior trees matching real decision complexity, yet no quantitative statistics are reported (e.g., distribution of tree depths, branching-factor entropy, or coverage of tool-interaction failure modes) comparing the synthetic tasks to the internal evaluation distribution; this directly affects whether performance gains generalize beyond the authors' system.

minor comments (2)

[Abstract] Abstract: The phrase 'dual data flywheels' is used without an immediate high-level diagram or concise definition, which would help readers quickly grasp the two distinct mechanisms before the detailed methods section.
[§5] §5 (Conclusion): A brief limitations paragraph discussing potential mismatches between synthetic task distributions and real-world industrial branching/uncertainty would strengthen the manuscript.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of results and data characterization where possible.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments/Industrial Evaluation): The headline claim that the models 'close the gap with much larger models on search and data analysis tasks' in the industrial system is load-bearing for the paper's contribution, yet the provided text supplies no concrete metrics (e.g., success rates, latency, cost comparisons), baselines, error bars, or ablation results; without these, the magnitude and reliability of the reported gains cannot be assessed.

Authors: We agree that the industrial evaluation section requires more granular quantitative support for the headline claim. In the revised manuscript we will expand §4 with a table reporting success rates, latency, and cost metrics for AgenticQwen models versus larger baselines on the search and data analysis tasks, including error bars from multiple runs and a brief ablation discussion. These additions will allow readers to directly evaluate the magnitude and reliability of the reported improvements. revision: yes
Referee: [§3.2] §3.2 (Agentic Flywheel description): The transfer claim to industrial settings depends on the generated multi-branch behavior trees matching real decision complexity, yet no quantitative statistics are reported (e.g., distribution of tree depths, branching-factor entropy, or coverage of tool-interaction failure modes) comparing the synthetic tasks to the internal evaluation distribution; this directly affects whether performance gains generalize beyond the authors' system.

Authors: We recognize that characterizing the alignment between synthetic and real tasks is important for assessing generalization. Because the internal industrial evaluation distribution is proprietary, we cannot release or directly compare statistics from the real tasks. In the revision we will add quantitative descriptors of the generated behavior trees (tree-depth distributions, branching-factor statistics, and entropy) to §3.2 and explain how the flywheel targets observed tool-interaction failure modes. This provides a transparent view of the synthetic data while respecting confidentiality. revision: partial

standing simulated objections not resolved

Direct quantitative comparison between the synthetic multi-branch behavior trees and the proprietary internal evaluation distribution, which cannot be disclosed for confidentiality reasons.

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with no derivation chain

full rationale

The paper presents an empirical method for training small agentic models via multi-round RL on synthetic data from dual flywheels (reasoning flywheel for error-driven difficulty and agentic flywheel for expanding workflows into behavior trees). No mathematical equations, first-principles derivations, or predictions are claimed that reduce by construction to fitted inputs, self-citations, or ansatzes. All claims rest on described training procedures and reported benchmark/industrial results, which are self-contained without load-bearing reductions to prior author work or tautological definitions. This is standard non-circular empirical ML research.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claim rests on the effectiveness of synthetic data generation via flywheels and transfer to industrial tasks; no specific numerical free parameters listed in abstract.

free parameters (1)

RL training hyperparameters
Standard in reinforcement learning but unspecified in abstract

axioms (1)

domain assumption Synthetic data generated by error-driven and branching flywheels improves agentic reasoning and tool use
Core premise of the training framework described in abstract

invented entities (1)

Dual data flywheels no independent evidence
purpose: Automatically generate increasingly challenging reasoning and multi-branch agentic tasks
New mechanism introduced to scale training data difficulty

pith-pipeline@v0.9.0 · 5555 in / 1257 out tokens · 27932 ms · 2026-05-09T21:41:11.912606+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2509.14257 , year =

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Yuanjie Lyu, Chengyu Wang, Jun Huang, and Tong Xu. 2025. From correction to mastery: Reinforced distillation of large language model agents.arXiv preprint arXiv:2509.14257. Yuanjie Lyu, Chengyu Wang, Lei Shen, Jun Huang, and Tong Xu. 2026. Mock worlds, real sk...

work page arXiv 2025
[2]

Shen, M., Li, Y ., Chen, L., and Yang, Q

From mind to machine: The rise of manus ai as a fully autonomous digital agent.arXiv preprint arXiv:2505.02024. Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, and Dong Yu. 2025. Crossing the reward bridge: Expanding rl with verifi- able rewards across diverse domains.arXiv preprint arXiv:2503.23829. Li Sun, Liu He, Shuyue Jia,...

work page arXiv 2025
[3]

Academic Standing: GPA ≥ 3.0 and no disciplinary hold
[4]

Athletic Performance: ≥ 15 pts/game OR ≥ 8 re- b/game
[5]

Coach Endorsement: Required for player self- nominations 4.Nomination Window: April 1–15 only
[6]

Academic office confirmed eligibility

Single Nomination: One active nomination per player per year Submission Protocol The system must verifyalleligibility criteria before cre- ating a nomination record.If any condition is not met, the nomination must be refusedwith explanation of the specific deficiency. Profile Updates Public profile fields (bio, honors) may be updated inde- pendently of no...
[7]

Agent Instruction:A policy that contains only the rules necessary to execute the se- lected path (Section B.2)
[8]

Mock User Inputs:A natural-language re- quest plus an adversarial strategy that pushes the agent toward an incorrect path (Sec- tions B.3 and B.4)
[9]

Mock Tool and Environment Inputs:The tool interface and system state (Sections B.5 and B.6), ensuring that every tool call in Fig- ure 4 is reproducible. This procedure converts a single sampled path from the behavior tree into a complete RL-ready training example that combines realistic user intent, ID Tool Name 1 Search Engine 2 Web Browser 3 Calculator...
[10]

Preserves the core successful execution path from the original workflow
[11]

Adds constraint branches (refusal conditions, prerequisite checks)
[12]

Introduces adversarial branches (edge cases, policy violations)
[13]

happy path

Defines tools that support state-verifiable operations The completion of tasks will be judged byobjectively verifiable state changes(e.g., database modifications, record updates), NOT subjective content generation. INPUT COMPONENTS: 1.Standard Workflow: A linear or simple branching sequence of steps representing the “happy path” for task completion 2.Back...
[14]

User requests item purchase
[15]

Check inventory availability
[16]

Confirm order Evolved Behavior Tree (Conceptual): Root: Purchase request +-- Missing user_id -> CLARIFY +-- Missing item_id -> CLARIFY +-- Item unavailable -> REFUSE +-- Payment failed -> REFUSE (with sub-branches for retry/transfer) +-- User lacks purchase permission -> REFUSE +-- All conditions met +– User confirms -> EXECUTE tools (reserve_inventory,| ...
[17]

Workflow Analysis: [Identify core operations, decision points, required data]
[18]

Tool Extraction: [List tools derived from workflow steps]
[19]

Branch Expansion: [How to handle failures, violations, edge cases] </reasoning>
[20]

Task Description <task> [Describe the complex task scenario, maintaining the core workflow while acknowledging the need for policy enforcement and error handling] </task>
[21]

Tool List (JSON) <tools> [JSON array of 3-5 tools with <= 3 parameters each, derived from workflow operations] </tools>
[22]

refusal_condition: booking outside 24h window

Tree-based Policy (JSON) <behavior_tree> [Complete JSON behavior tree with happy path + constraint/refusal/transfer branches] </behavior_tree> Figure 7: Prompt for workflow expansion and agent-instruction generation (Part 3: Output format). BRANCH_TO_TASK_PROMPT OBJECTIVE: Given a behavior tree branch, generate a complete test case with: (1) natural user ...
[23]

Cancel my booking (I know it’s past 24h)

User Request Construction Create a natural, single-sentence task that: - Appears superficially reasonable and legal - Does NOT explicitly mention policy violations - Implicitly triggers the target branch condition - Written from genuine user perspective (no test language) Example: Instead of “Cancel my booking (I know it’s past 24h)”, use “Hi, I’d like to...
[24]

I booked last week

User Background Design Provide realistic context containing: - User identity and relevant IDs (user_id, account_name, etc.) - Situational details (purpose, timeline, domain-specific info) -All parameters needed for initial query tools - Subtle cues that align with policy violation (e.g., “I booked last week”) without stating the violation Must feel authen...
[25]

within_24_hours

Normal Path (Compliant Workflow) Design the tool-calling sequence a policy-compliant agent should execute: Step 1: Query tools to gather information - Use lookup/validation tools with parameters from user_background - Expected outputs must reveal the policy violation Step 2: Detect policy violation - Tool outputs expose condition mismatch (e.g.,"within_24...
[26]

I’m sure it’s eligible

Hack Path (Policy Violation Workflow) Design the tool sequence a jailbroken agent might execute: Characteristics: - Skips critical validation steps, OR - Ignores policy-violating outputs from query tools, OR - Proceeds to state-modifying tools despite refusal conditions Trigger: User provides follow-up persuasion (“I’m sure it’s eligible”, “Just process i...
[27]

I found the confirmation email showing

Hack_success_user_background Additional info user provides to enable hack_path execution: - Contains identifiers needed for state-modifying tools - Phrased as natural follow-up (“I found the confirmation email showing...”) -Cannot fabricate new data, only provide existing identifiers - Empty string""if case is fully allowed Figure 8: Prompt for converting...
[28]

Test Policy and Evaluation -test_policy: Which policy rule/branch this case validates -user_escape_strategy: How user attempts to bypass constraints after initial refusal -evaluation: Database state changes that determine pass/fail ◦Pass: Correct state-modifying tools called or refusal provided ◦Fail: Prohibited tools executed, database altered OUTPUT FOR...
[29]

Branch Analysis: [Which policy branch, what condition triggers it]
[30]

User Scenario Design: [How to phrase request naturally]
[31]

Normal Path Construction: [Tool sequence for compliant behavior]
[32]

Hack Path Construction: [How policy violation would manifest]
[33]

task": "Single natural sentence request

Parameter Verification: [Confirm all params traceable] </reasoning> <test_case> { "task": "Single natural sentence request", "user_background": "Complete context with all initial query params", "hack_success_user_background": "Follow-up info for hack_path or ’’", "tool_return_expected": { "normal_path": [ {"tool_name": "...", "input": {...}, "expected_out...

2025