HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Haoxiang Ma; Hongyu Zang; Jianing Wang; Linsen Guo; Qi Guo; Wei Wang; Wenjie Shi; Xiangyu Xi; Xiaoyu Li; Xunliang Cai

arxiv: 2605.02396 · v1 · submitted 2026-05-04 · 💻 cs.AI

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Jianing Wang , Linsen Guo , Zhengyu Chen , Qi Guo , Hongyu Zang , Wenjie Shi , Haoxiang Ma , Xiangyu Xi

show 3 more authors

Xiaoyu Li Wei Wang Xunliang Cai

This is my paper

Pith reviewed 2026-05-08 19:10 UTC · model grok-4.3

classification 💻 cs.AI

keywords HeavySkillheavy thinkingagentic harnessparallel reasoningsummarizationreinforcement learningself-evolving LLMsorchestration frameworks

0 comments

The pith

Heavy thinking operates as an internal two-stage skill in LLMs that outperforms external agent orchestration and scales via reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that performance in complex agentic systems stems primarily from an internalized capability called heavy thinking rather than from elaborate external coordination of agents and tools. This capability is defined as a two-stage process in which the model first produces multiple parallel reasoning paths and then summarizes them into a final output. If models can learn and expand this internal process, they could handle difficult reasoning tasks on their own without depending on fragile orchestration layers. Experiments across domains show the approach beats standard Best-of-N sampling, and reinforcement learning can further increase its depth and width.

Core claim

HeavySkill views heavy thinking as both a minimal execution unit in any orchestration harness and an inner skill internalized within the model's parameters. The skill is identified as a two-stage pipeline of parallel reasoning followed by summarization that drives complex task solving. Empirical results demonstrate that this internalized skill consistently outperforms traditional Best-of-N strategies, with stronger LLMs approaching Pass@N performance, and that its depth and width can be scaled through reinforcement learning to support self-evolving LLMs.

What carries the argument

The two-stage pipeline of parallel reasoning then summarization, functioning as a learnable and scalable inner skill beneath any agentic harness.

If this is right

This inner skill outperforms traditional Best-of-N strategies across diverse domains.
Stronger LLMs can approach Pass@N performance levels using the internalized process.
The depth and width of heavy thinking scale further when trained with reinforcement learning.
Self-evolving LLMs become possible that internalize complex reasoning without external orchestration layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent designs could simplify to single-model prompts that directly elicit the two-stage process instead of multi-agent coordination.
Reinforcement learning on this skill might allow models to iteratively improve their own reasoning capabilities over successive training cycles.
The same internal mechanism could be tested on tasks beyond current benchmarks to check whether orchestration overhead can be eliminated entirely.

Load-bearing premise

Heavy thinking can be isolated as a consistent two-stage internal pipeline that generalizes as a skill across models and harnesses and responds to reinforcement learning.

What would settle it

An experiment in which increasing the depth and width of the two-stage pipeline through reinforcement learning produces no gain over Best-of-N sampling or fails to approach Pass@N levels.

Figures

Figures reproduced from arXiv: 2605.02396 by Haoxiang Ma, Hongyu Zang, Jianing Wang, Linsen Guo, Qi Guo, Wei Wang, Wenjie Shi, Xiangyu Xi, Xiaoyu Li, Xunliang Cai, Zhengyu Chen.

**Figure 1.** Figure 1: The overview framework of heavy thinking in LLMs test time scaling. 2.4. Readable Skill for Agentic Harness This workflow provides a concrete Python pipeline for executing heavy thinking. However, modern agentic harnesses— such as Claude Code (Claude), CodeX (Chen et al., 2021), and Hermes (Hermes-Agent, 2024)—organize capabilities as skills: human-readable, model-interpretable documents that the orchestr… view at source ↗

**Figure 2.** Figure 2: The pass rate distribution of heavy thinking in different pass rates of parallel reasoning. of 85.5% on LiveCodeBench. Similarly, R1-Distill-Qwen32B experiences a significant boost on IFEval (35.7% → 69.3%). This confirms that for tasks with clear logical or programmatic constraints, the summary model effectively distills high-quality solutions from multiple reasoning paths. Challenges in Subjective Align… view at source ↗

**Figure 3.** Figure 3: When fixing the LLM as R1-Distill-Qwen-7B in the parallel reasoning phase, the final performance of different LLMs in sequential deliberation. queries are successfully rectified through deliberation. This empirical evidence underscores the model’s ability to refine errors even when the initial success probability is low. 2) In scenarios where the parallel pass rate exceeds 0.5, it maintains high accuracy,… view at source ↗

**Figure 4.** Figure 4: The effectiveness of different numbers of iterations. trend in the HM@K metric as the number of iterations increases. This phenomenon suggests that the heavy thinking framework exhibits intrinsic scaling capabilities, where extended deliberation cycles contribute to improved collective reasoning performance. However, this gain is accompanied by a significant degradation in the HP@K metric. This divergen… view at source ↗

**Figure 5.** Figure 5: When choosing different permutations. Random: randomly select K trajectories; Max-Diversity: select K trajectories that have the highest diversity; Max-Length: select the top K trajectories based on the length; Max-Answer-Num: select the trajectories that have the highest frequency answer. A. Impact of Parallel Trajectories In this section, we aim to investigate the impact of the permutations of parallel… view at source ↗

**Figure 6.** Figure 6: The RL training on heavy thinking. Blue curve denotes the number of parallel trajectories is 8, while the green curve denotes 16. We use the VeRL framework to perform RLVR. 14 view at source ↗

**Figure 7.** Figure 7: The prompt of heavy thinking in serialized memory cache. 15 view at source ↗

**Figure 8.** Figure 8: The skill file of heavy thinking (part I). 16 view at source ↗

**Figure 9.** Figure 9: The skill file of heavy thinking (part II). 17 view at source ↗

**Figure 10.** Figure 10: The skill file of heavy thinking (part III). 18 view at source ↗

read the original abstract

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks. However, the underlying mechanism that truly drives performance remains obscured behind intricate system designs. In this paper, we propose HeavySkill, a perspective that views heavy thinking not only as a minimal execution unit in orchestration harness but also as an inner skill internalized within the model's parameters that drives the orchestrator to solve complex tasks. We identify this skill as a two-stage pipeline, i.e., parallel reasoning then summarization, which can operate beneath any agentic harness. We present a systematic empirical study of HeavySkill across diverse domains. Our results show that this inner skill consistently outperforms traditional Best-of-N (BoN) strategies; notably, stronger LLMs can even approach Pass@N performance. Crucially, we demonstrate that the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning, offering a promising path toward self-evolving LLMs that internalize complex reasoning without relying on brittle orchestration layers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HeavySkill reframes complex reasoning as an internal two-stage skill (parallel paths then summarization) that RL can scale to cut down on external agent orchestration, but the experiments do not isolate that benefit from extra sampling or scaffolding.

read the letter

The paper's core move is to treat heavy thinking as something the model can internalize rather than something an external harness has to manage at runtime. The two-stage pipeline is presented as a minimal unit that works under any agent setup, with claims that it beats Best-of-N, approaches Pass@N on stronger models, and can have its depth and width grown through reinforcement learning for self-evolving behavior.

Referee Report

3 major / 0 minor

Summary. The manuscript proposes HeavySkill as a perspective that treats heavy thinking as an internalizable skill within LLM parameters, operationalized as a two-stage pipeline of parallel reasoning followed by summarization. This skill is claimed to operate beneath any agentic harness, consistently outperform Best-of-N sampling, approach Pass@N performance for stronger models, and be scalable in depth and width via reinforcement learning to enable self-evolving LLMs that reduce reliance on external orchestration.

Significance. If the empirical claims were substantiated with proper controls and ablations, the work could be significant for shifting agentic systems toward internalized reasoning capabilities, potentially simplifying architectures and improving robustness. The absence of any methods, datasets, results, or formal definitions in the manuscript, however, prevents assessment of whether these outcomes are achievable or novel.

major comments (3)

[Abstract] Abstract: The central claim that 'this inner skill consistently outperforms traditional Best-of-N (BoN) strategies' and that 'stronger LLMs can even approach Pass@N performance' is presented without any experimental details, datasets, baselines, controls, error bars, or quantitative results, rendering the claim unevaluable.
[Abstract] Abstract: The assertion that 'the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning' lacks any definition of depth/width metrics, RL reward formulation, training procedure, or post-RL evaluation (e.g., zero-prompt internal mode without harness), making the self-evolving LLM claim unsupported.
[Abstract] Abstract: No ablations are described to isolate the two-stage pipeline (parallel reasoning then summarization) from matched extra sampling or external scaffolding at fixed token budget, which is required to substantiate that gains arise from internalization rather than increased compute or the orchestration the paper aims to replace.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We acknowledge that the current manuscript presents HeavySkill primarily as a conceptual perspective and that the empirical claims in the abstract require additional substantiation to be fully evaluable. We will revise the manuscript to incorporate the requested details on experiments, definitions, and controls while preserving its focus on the proposed framework.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'this inner skill consistently outperforms traditional Best-of-N (BoN) strategies' and that 'stronger LLMs can even approach Pass@N performance' is presented without any experimental details, datasets, baselines, controls, error bars, or quantitative results, rendering the claim unevaluable.

Authors: The referee correctly notes that the abstract states these performance claims without accompanying details. The manuscript is structured as a perspective paper that summarizes results from a systematic empirical study across domains; however, the current version does not elaborate the datasets, baselines, or quantitative metrics. We will revise the abstract to qualify the claims and add a dedicated experimental section in the main text that reports the requested details, including controls and error bars, to enable proper evaluation. revision: yes
Referee: [Abstract] Abstract: The assertion that 'the depth and width of heavy thinking, as a learnable skill, can be further scaled via reinforcement learning' lacks any definition of depth/width metrics, RL reward formulation, training procedure, or post-RL evaluation (e.g., zero-prompt internal mode without harness), making the self-evolving LLM claim unsupported.

Authors: We agree that the abstract provides no definitions or procedures for these elements. Depth and width are intended as the number of sequential summarization steps and parallel reasoning branches, respectively, with RL scaling proposed as a direction for future work. In the revision we will add formal definitions, a high-level description of the reward formulation based on task success, the training procedure, and discussion of post-RL evaluation in zero-prompt internal mode. revision: yes
Referee: [Abstract] Abstract: No ablations are described to isolate the two-stage pipeline (parallel reasoning then summarization) from matched extra sampling or external scaffolding at fixed token budget, which is required to substantiate that gains arise from internalization rather than increased compute or the orchestration the paper aims to replace.

Authors: This observation is accurate; the manuscript does not describe such ablations. We will revise the paper to include a description of controlled ablations that match token budgets across the internalized two-stage pipeline, equivalent Best-of-N sampling, and external scaffolding, thereby isolating the contribution of internalization. revision: yes

Circularity Check

0 steps flagged

No circularity detected; proposal is conceptual and empirical

full rationale

The paper advances a perspective reframing heavy thinking as an internal two-stage skill (parallel reasoning followed by summarization) that can be scaled via RL, supported by empirical comparisons showing outperformance over Best-of-N and approach to Pass@N. No equations, formal derivations, or prediction steps exist that could reduce outputs to inputs by construction. The skill identification is an explicit definitional choice rather than a self-referential loop, and performance claims rest on experimental results across domains rather than fitted parameters or self-citation chains. This is a standard non-circular empirical framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that the described two-stage process constitutes a distinct internalizable skill whose scaling via RL produces self-evolving behavior; no free parameters or external benchmarks are specified.

axioms (1)

domain assumption Heavy thinking can be represented as a two-stage pipeline of parallel reasoning then summarization that can be internalized in model parameters and operate beneath any agentic harness.
Directly stated in the abstract as the identification of the skill.

invented entities (1)

HeavySkill no independent evidence
purpose: To treat heavy thinking as an inner skill in agentic harness that can be scaled via RL
New term and perspective introduced without independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5517 in / 1423 out tokens · 56190 ms · 2026-05-08T19:10:30.523351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages

[1]

**Parallel Reasoning** — Generate multiple independent reasoning trajectories for the same problem

work page
[2]

**Sequential Deliberation** — Synthesize all trajectories through critical analysis to produce a superior final answer This skill should be activated when facing complex reasoning tasks where a single chain-of-thought may be insufficient. ## When to Activate Activate HeavySkill when the task involves: - Mathematical reasoning (competition math, STEM probl...

work page
[3]

**Identify answer distribution** — What answers appear and how frequently?

work page
[4]

**Analyze reasoning quality** — Which chains are logically sound vs. flawed?

work page
[5]

**Cross-validate** — Do different approaches confirm the same result?

work page
[6]

**Critical evaluation** — Apply professional skepticism: - Majority consensus is a signal but NOT proof of correctness - A minority answer backed by rigorous logic may be correct - All trajectories may be wrong — be prepared to reason anew

work page
[7]

Analyze their reasoning: Problem: {query} Thinker #1: {trajectory_1} Thinker #2: {trajectory_2}

**Synthesize final answer** — Produce the best answer based on analysis **Deliberation prompt framework:** ``` Multiple independent thinkers have attempted this problem. Analyze their reasoning: Problem: {query} Thinker #1: {trajectory_1} Thinker #2: {trajectory_2} ... Thinker #K: {trajectory_K} Your task: - Analyze the thought processes of all thinkers -...

work page
[8]

**Identify the problem** — Extract the core reasoning task from the user's request

work page
[9]

**Spawn parallel agents** — Use the Agent tool to launch K=3 independent reasoning agents in a single message (parallel execution)

work page
[10]

**Collect results** — Wait for all agents to complete and gather their outputs

work page
[11]

**Deliberate** — Perform the sequential deliberation analysis yourself (do NOT delegate this step)

work page
[12]

**Output** — Provide the final synthesized answer to the user (...) HeavySkill.md Figure 9.The skill file of heavy thinking (part II). 17 HEAVYSKILL: Heavy Thinking as the Inner Skill in Agentic Harness (...) ### Key Principles - **Independence is critical** — Parallel agents must not share context or see each other's work - **Diversity helps** — Encourag...

work page
[13]

Run Stage 1 + Stage 2 as above

work page
[14]

expert thinker

Feed the deliberation result back as an additional "expert thinker" trajectory

work page
[15]

Re-run Stage 2 with the augmented trajectory set

work page
[16]

Repeat until convergence (typically 2-3 iterations max) ## Output Format Your final output should: - Present ONLY the final answer (not the meta-analysis) - Follow the format conventions of the domain: - Math/STEM: answer in `\boxed{}` - Code: solution in a code block - General: clean prose response - Match the language of the original query HeavySkill.md...

work page

[1] [1]

**Parallel Reasoning** — Generate multiple independent reasoning trajectories for the same problem

work page

[2] [2]

**Sequential Deliberation** — Synthesize all trajectories through critical analysis to produce a superior final answer This skill should be activated when facing complex reasoning tasks where a single chain-of-thought may be insufficient. ## When to Activate Activate HeavySkill when the task involves: - Mathematical reasoning (competition math, STEM probl...

work page

[3] [3]

**Identify answer distribution** — What answers appear and how frequently?

work page

[4] [4]

**Analyze reasoning quality** — Which chains are logically sound vs. flawed?

work page

[5] [5]

**Cross-validate** — Do different approaches confirm the same result?

work page

[6] [6]

**Critical evaluation** — Apply professional skepticism: - Majority consensus is a signal but NOT proof of correctness - A minority answer backed by rigorous logic may be correct - All trajectories may be wrong — be prepared to reason anew

work page

[7] [7]

Analyze their reasoning: Problem: {query} Thinker #1: {trajectory_1} Thinker #2: {trajectory_2}

**Synthesize final answer** — Produce the best answer based on analysis **Deliberation prompt framework:** ``` Multiple independent thinkers have attempted this problem. Analyze their reasoning: Problem: {query} Thinker #1: {trajectory_1} Thinker #2: {trajectory_2} ... Thinker #K: {trajectory_K} Your task: - Analyze the thought processes of all thinkers -...

work page

[8] [8]

**Identify the problem** — Extract the core reasoning task from the user's request

work page

[9] [9]

**Spawn parallel agents** — Use the Agent tool to launch K=3 independent reasoning agents in a single message (parallel execution)

work page

[10] [10]

**Collect results** — Wait for all agents to complete and gather their outputs

work page

[11] [11]

**Deliberate** — Perform the sequential deliberation analysis yourself (do NOT delegate this step)

work page

[12] [12]

**Output** — Provide the final synthesized answer to the user (...) HeavySkill.md Figure 9.The skill file of heavy thinking (part II). 17 HEAVYSKILL: Heavy Thinking as the Inner Skill in Agentic Harness (...) ### Key Principles - **Independence is critical** — Parallel agents must not share context or see each other's work - **Diversity helps** — Encourag...

work page

[13] [13]

Run Stage 1 + Stage 2 as above

work page

[14] [14]

expert thinker

Feed the deliberation result back as an additional "expert thinker" trajectory

work page

[15] [15]

Re-run Stage 2 with the augmented trajectory set

work page

[16] [16]

Repeat until convergence (typically 2-3 iterations max) ## Output Format Your final output should: - Present ONLY the final answer (not the meta-analysis) - Follow the format conventions of the domain: - Math/STEM: answer in `\boxed{}` - Code: solution in a code block - General: clean prose response - Match the language of the original query HeavySkill.md...

work page