Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

Guojun Yin; Jiajun Chai; Jinyang Wu; Li Wang; Wei Lin; Xiaodong Lu; Xiaohan Wang; Zipeng Zhang

arxiv: 2605.18500 · v1 · pith:N46WSXLFnew · submitted 2026-05-18 · 💻 cs.CL

Implicit Hierarchical GRPO: Decoupling Tool Invocation from Execution for Tool-Integrated Mathematical Reasoning

Li Wang , Xiaohan Wang , Xiaodong Lu , Zipeng Zhang , Jinyang Wu , Jiajun Chai , Wei Lin , Guojun Yin This is my paper

Pith reviewed 2026-05-20 11:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords tool-integrated reasoninghierarchical policysurrogate lossimplicit hierarchymathematical reasoningdelayed executionGRPOLLM tool use

0 comments

The pith

Decoupling tool invocation from execution improves mathematical reasoning in LLMs

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that tightly coupling tool calls with immediate execution in LLMs disrupts reasoning coherence and limits expressivity during tool-integrated tasks such as math problem solving. It formalizes the alternative of decoupling invocation from execution through delayed execution under explicit control. A hierarchical control framework is introduced along with a theoretically derived surrogate loss that trains an implicit policy to produce the same behavior as an explicit hierarchical policy. The resulting IH-GRPO algorithm delivers measurable gains on out-of-domain benchmarks without imposing extra constraints on action spaces or rewards.

Core claim

We propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm for decoupling tool invocation from execution in tool-integrated reasoning.

What carries the argument

The surrogate loss in the hierarchical control framework that trains an implicit policy to match explicit hierarchical behavior for delayed tool execution.

If this is right

Absolute gains of 1.87 percent, 2.16 percent, and 2.53 percent on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain math benchmarks over the strongest baseline.
Consistent performance improvements appear in non-mathematical domains as well.
Reasoning coherence is preserved by allowing tool calls to be planned separately from their execution.
The approach provides the first explicit formalization of decoupling invocation from execution in tool-integrated reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models using this method could generate longer, more structured sequences of planned tool calls before any execution occurs.
The same separation of planning and action might apply to other LLM tasks that mix internal reasoning with external tools, such as code synthesis.
Scaling the implicit hierarchy to settings with many interdependent tools could be tested by measuring how well the surrogate loss continues to enforce equivalence.

Load-bearing premise

The surrogate loss produces behavior equivalent to an explicit hierarchical policy without requiring additional constraints on the action space, reward function, or policy parameterization.

What would settle it

Train both an implicit policy with the surrogate loss and an explicit hierarchical policy on the same mathematical tasks, then compare their tool invocation sequences and final answers; large systematic differences would show the claimed equivalence does not hold.

Figures

Figures reproduced from arXiv: 2605.18500 by Guojun Yin, Jiajun Chai, Jinyang Wu, Li Wang, Wei Lin, Xiaodong Lu, Xiaohan Wang, Zipeng Zhang.

**Figure 1.** Figure 1: (Top-left) Coupled tool invocation triggers immediate function calls, leading to empty outputs, disrupted reasoning coherence, and premature termination due to hallucinated results. In complex calculations, manual computation is often error-prone. Rigid tool-use patterns prevent the model from flexibly leveraging code tools to handle intermediate computational steps, thereby increasing the likelihood of re… view at source ↗

**Figure 2.** Figure 2: Comparison of invocation methods: (Left) tool usage patterns, (Right) tool positions. sponses from the corresponding training settings.1 Inference Coherence: As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Different step types in the reasoning process. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: (Left) Performance of IH-GRPO across varying λ. (Middle) Token sensitivity and (Right) Prompt sensitivity analysis of IH-GRPO on Qwen3 models. 3 5 10 Maximum Tool Interaction Turns 40 45 50 55 60 65 70 Average Accuracy (%) Impact of Maximum Tool Interaction Turns on IH-GRPO 44.71 45.64 47.01 63.61 63.76 63.23 64.91 66.28 67.35 0 25 50 75 100 Training Data Ratio (%) 20 30 40 50 60 70 Average Accuracy (%) Im… view at source ↗

**Figure 5.** Figure 5: Performance of IH-GRPO under (left) varying tool-interaction budgets and (right) training-data ratios. Impact of Varying Training Data Sizes. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Large language models (LLMs) have increasingly leveraged tool invocation to enhance their reasoning capabilities. However, existing approaches typically tightly couple tool invocation with immediate execution. Such immediate tool interaction may disrupt the reasoning coherence of LLMs and constrain their expressivity, ultimately degrading reasoning performance. To this end, for the first time, we propose and formalize the problem of decoupling tool invocation from execution during reasoning, and introduce delayed execution with explicit control to enhance tool-integrated reasoning (TIR). Furthermore, we propose a hierarchical control framework and theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm. Extensive experiments on IH-GRPO achieve absolute improvements of 1.87\%, 2.16\%, and 2.53\% on Qwen3-1.7B, Qwen3-4B, and Qwen3-8B across six out-of-domain mathematical reasoning benchmarks over the strongest baseline method, while also yielding consistent performance gains in other domains. Our code is available at https://github.com/Lumina04/IH-GRPO-01.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper formalizes decoupling tool invocation from execution in LLMs and derives a surrogate loss for an implicit hierarchy to match explicit one, with modest gains on math benchmarks.

read the letter

The main thing here is that the authors formalize decoupling tool invocation from execution during LLM reasoning and derive a surrogate loss so an implicit hierarchical policy can learn behavior equivalent to an explicit one. They package this as the IH-GRPO algorithm and test it on mathematical reasoning with delayed execution for better coherence. The experiments report absolute gains of 1.87% to 2.53% on six out-of-domain math benchmarks across Qwen3 models from 1.7B to 8B, plus some consistent lifts in other domains. Code is released, which helps. The decoupling idea and the hierarchical control framing look like the actual new pieces relative to prior TIR work. The experiments cover multiple model sizes and benchmarks, which is a reasonable setup. The soft spots are the modest size of the gains and the need to verify the surrogate loss derivation. The abstract presents equivalence without extra constraints on action space or rewards, but RL surrogate derivations often depend on specifics about how delayed execution enters the transitions or value function. If those assumptions are hidden, the implicit policy may not reliably reproduce the explicit hierarchical behavior. The stress-test concern lands here until the equations are checked. This paper is for researchers working on tool-integrated reasoning and hierarchical RL methods for language models. A reader who follows TIR or delayed-action setups would get value from the new framing and algorithm. It deserves a serious referee to examine the derivation and experimental controls in detail. I would recommend sending it through peer review rather than a desk reject.

Referee Report

1 major / 2 minor

Summary. The manuscript formalizes the problem of decoupling tool invocation from execution in tool-integrated reasoning (TIR) for LLMs, introducing delayed execution with explicit control. It proposes a hierarchical control framework and derives a surrogate loss enabling an implicitly hierarchical policy to match the behavior of an explicit hierarchical policy, yielding the IH-GRPO algorithm. Experiments report absolute gains of 1.87%, 2.16%, and 2.53% on Qwen3-1.7B/4B/8B models across six out-of-domain math benchmarks over the strongest baseline, plus gains in other domains; code is released.

Significance. If the surrogate-loss equivalence holds without hidden restrictions, the work could meaningfully improve reasoning coherence in tool-using LLMs by avoiding immediate execution disruptions. The explicit code release is a clear strength for reproducibility.

major comments (1)

[Abstract and §3] Abstract and §3 (Theoretical Derivation): the claim that the surrogate loss yields behavior equivalent to an explicit hierarchical policy 'without requiring additional constraints on the action space, reward function, or policy parameterization' is load-bearing. The derivation must be checked to confirm that delayed execution is folded into the MDP transition and value function without implicit restrictions on the tool-use action distribution or reward structure; otherwise the implicit policy will not reliably reproduce explicit hierarchical behavior.

minor comments (2)

[Experiments] Experimental section: absolute percentage gains are reported without variance, statistical significance tests, or detailed baseline strength comparisons; adding these would strengthen the empirical claims without altering the central contribution.
[Notation and §3] Notation: ensure consistent use of symbols for the surrogate loss, implicit vs. explicit policies, and delayed-execution MDP components across the derivation and algorithm description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and insightful comments on our manuscript. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Theoretical Derivation): the claim that the surrogate loss yields behavior equivalent to an explicit hierarchical policy 'without requiring additional constraints on the action space, reward function, or policy parameterization' is load-bearing. The derivation must be checked to confirm that delayed execution is folded into the MDP transition and value function without implicit restrictions on the tool-use action distribution or reward structure; otherwise the implicit policy will not reliably reproduce explicit hierarchical behavior.

Authors: We appreciate the referee's emphasis on verifying the theoretical equivalence. In §3, we define the MDP with delayed execution by augmenting the state to include a pending tool invocation flag, and the transition function executes the tool only when the control signal is issued in subsequent steps. The surrogate loss is constructed as the difference between the implicit policy's action probabilities and the explicit hierarchical decomposition, leading to an equivalence in the policy gradient updates. Theorem 3.1 proves that under this formulation, the implicit policy achieves the same expected return as the explicit one. The derivation does not introduce constraints on the action space, as tool invocations are still sampled from the full distribution; the reward remains the task-specific reward without modification; and the policy is the standard autoregressive LLM policy. We can clarify this in a revised §3 by adding a corollary that explicitly notes the lack of such restrictions. revision: partial

Circularity Check

0 steps flagged

Theoretical derivation of surrogate loss presented as independent mathematical result with no evident reduction to inputs

full rationale

The paper's central claim rests on a theoretical derivation of a surrogate loss that makes an implicit hierarchical policy equivalent to an explicit one for decoupled tool invocation. The abstract and reader's summary describe this as a general result holding without additional constraints on action space, reward function, or policy parameterization. No equations, self-citations, or fitted parameters are shown in the provided text as load-bearing for the equivalence claim. The derivation is therefore treated as self-contained against external benchmarks rather than circular by construction, renaming, or self-referential fitting.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into hyperparameters or background assumptions; the central addition is the surrogate loss and hierarchical framing.

axioms (1)

domain assumption Surrogate loss produces policy behavior equivalent to explicit hierarchical policy
Invoked in the theoretical derivation section referenced in the abstract.

pith-pipeline@v0.9.0 · 5760 in / 969 out tokens · 33205 ms · 2026-05-20T11:16:56.940181+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

theoretically derive a surrogate loss that enables an implicitly hierarchical policy to learn behavior equivalent to that of an explicit hierarchical policy, leading to the proposed IH-GRPO algorithm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing.arXiv preprint arXiv:2502.11271. Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Wei- jiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, and 1 others

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Wei- jiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, and 1 others

work page
[3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

rstar2-agent: Agentic reasoning technical re- port.arXiv preprint arXiv:2508.20722. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Guangming Sheng, Ch...

work page arXiv 2024
[4]

Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025

Rema: Learning to meta-think for llms with multi-agent reinforcement learning.arXiv preprint arXiv:2503.09501. Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. 2025. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shi...

work page arXiv 2025
[5]

InThe 2023 Conference on Empirical Methods in Natural Language Processing

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Ad- vances in Neural Information Processing Systems, 37:95266–95290. Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Chonghua Liao, and Jianhua Tao. 2024. Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts.arXiv preprint...

work page arXiv 2024
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, and 1 others. 2026. Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, an...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

eβi PV s=1 eβs = eθi PV s=1 eθi

σ(θ 0) = PV s=1 eβs PV s=0 eβs =γand2. eβi PV s=1 eβs = eθi PV s=1 eθi . Without loss of generality, assume θi =β i for i≥1 from condition 2. Besides, we have: θ0 = lnPV s=1 eβs −β 0. Therefore, {β0, β1, . . . , βV } can equivalently represent {θ0, θ1, . . . , θV } from the initial condition. A.2 Step 2: Explicit Hierarchical Policy Update We assume use p...

work page
[8]

= 1 1 +e −θ′ 0 = Zi Zi +e β′ 0 =γ ′ i, so condition 1 holds exactly. Case 2: Sampled Token is Non-Tool (i≥1) The surrogate loss is: L′ I (βi) =−A " βi −log VX s=0 eβs !# +A(1−sg(γ i))·logZ i −f i ·β 0, where η is learning rate, fi = 1 η ln sg( Z′ i Zi ) , Zi = PV s=1 eβs, Z′ i = PV s=1 exp (βs +ηA(δ si −softmax 1−V (βs))), γi = Zi eβ0 +Zi , and δsi denote...

work page
[9]

Thus,γ ′ i =σ(θ ′

Substitut- ing: lnZ ′ i −β ′ 0 = lnZ ′ i − β0 + ln Z′ i Zi −ηA(1−γ i) = lnZ ′ i −β 0 −lnZ ′ i + lnZ i +ηA(1−γ i) = lnZ i −β 0 +ηA(1−γ i) =θ ′ 0, soθ ′ 0 = lnZ ′ i −β ′ 0 holds exactly. Thus,γ ′ i =σ(θ ′

work page
[10]

Logical Deduction

= Z′ i eβ′ 0 +Z′ i , satisfying condition 1 strictly. Summary: The surrogate loss functionL ′ I (βi)for the implicit hierarchical policy is defined as follows: L′ I (βi) =    −A h β0 −log PV s=0 eβs i −A·sg(γ i)·logZ i,ifi= 0(E), −A h βi −log PV s=0 eβs i −A·sg(γ i)·logZ i +AlogZ i −f i ·β 0,ifi≥1(C), =−A " βi −log VX s=0 eβs !# −A·sg(γ i)·logZ i + (Al...

work page 2025
[12]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page
[14]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page
[15]

If y ou need t o e x ecut e a block immediat ely , append `<t ool _call>` right aft er t he code block

By def ault, when y ou writ e a ```p yt hon``` code block, it is e x ecut ed in a dela y manner , because some v alues ar e int ermediat e v ariables and do not need t o be kno wn immediat ely f or subsequent r easoning, and t her ef or e do not r equir e print output. If y ou need t o e x ecut e a block immediat ely , append `<t ool _call>` right aft er ...

work page
[16]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page 2017

[1] [1]

Understanding R1-Zero-Like Training: A Critical Perspective

Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783. Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou. 2025. Octotools: An agentic framework with extensible tools for complex reason- ing.arXiv preprint arXiv:2502.11271. Xiaodong Lu, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Zhijun Chen, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Wei- jiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, and 1 others

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Ning Shang, Yifei Liu, Yi Zhu, Li Lyna Zhang, Wei- jiang Xu, Xinyu Guan, Buze Zhang, Bingcheng Dong, Xudong Zhou, Bowen Zhang, and 1 others

work page

[3] [3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others

rstar2-agent: Agentic reasoning technical re- port.arXiv preprint arXiv:2508.20722. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Guangming Sheng, Ch...

work page arXiv 2024

[4] [4]

Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025

Rema: Learning to meta-think for llms with multi-agent reinforcement learning.arXiv preprint arXiv:2503.09501. Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. 2025. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734. Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shi...

work page arXiv 2025

[5] [5]

InThe 2023 Conference on Empirical Methods in Natural Language Processing

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Ad- vances in Neural Information Processing Systems, 37:95266–95290. Jinyang Wu, Mingkuan Feng, Shuai Zhang, Feihu Che, Zengqi Wen, Chonghua Liao, and Jianhua Tao. 2024. Beyond examples: High-level automated reasoning paradigm in in-context learning via mcts.arXiv preprint...

work page arXiv 2024

[6] [6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Fengkai Yang, Zherui Chen, Xiaohan Wang, Xiaodong Lu, Jiajun Chai, Guojun Yin, Wei Lin, Shuai Ma, Fuzhen Zhuang, Deqing Wang, and 1 others. 2026. Your group-relative advantage is biased.arXiv preprint arXiv:2601.08521. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, an...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

eβi PV s=1 eβs = eθi PV s=1 eθi

σ(θ 0) = PV s=1 eβs PV s=0 eβs =γand2. eβi PV s=1 eβs = eθi PV s=1 eθi . Without loss of generality, assume θi =β i for i≥1 from condition 2. Besides, we have: θ0 = lnPV s=1 eβs −β 0. Therefore, {β0, β1, . . . , βV } can equivalently represent {θ0, θ1, . . . , θV } from the initial condition. A.2 Step 2: Explicit Hierarchical Policy Update We assume use p...

work page

[8] [8]

= 1 1 +e −θ′ 0 = Zi Zi +e β′ 0 =γ ′ i, so condition 1 holds exactly. Case 2: Sampled Token is Non-Tool (i≥1) The surrogate loss is: L′ I (βi) =−A " βi −log VX s=0 eβs !# +A(1−sg(γ i))·logZ i −f i ·β 0, where η is learning rate, fi = 1 η ln sg( Z′ i Zi ) , Zi = PV s=1 eβs, Z′ i = PV s=1 exp (βs +ηA(δ si −softmax 1−V (βs))), γi = Zi eβ0 +Zi , and δsi denote...

work page

[9] [9]

Thus,γ ′ i =σ(θ ′

Substitut- ing: lnZ ′ i −β ′ 0 = lnZ ′ i − β0 + ln Z′ i Zi −ηA(1−γ i) = lnZ ′ i −β 0 −lnZ ′ i + lnZ i +ηA(1−γ i) = lnZ i −β 0 +ηA(1−γ i) =θ ′ 0, soθ ′ 0 = lnZ ′ i −β ′ 0 holds exactly. Thus,γ ′ i =σ(θ ′

work page

[10] [10]

Logical Deduction

= Z′ i eβ′ 0 +Z′ i , satisfying condition 1 strictly. Summary: The surrogate loss functionL ′ I (βi)for the implicit hierarchical policy is defined as follows: L′ I (βi) =    −A h β0 −log PV s=0 eβs i −A·sg(γ i)·logZ i,ifi= 0(E), −A h βi −log PV s=0 eβs i −A·sg(γ i)·logZ i +AlogZ i −f i ·β 0,ifi≥1(C), =−A " βi −log VX s=0 eβs !# −A·sg(γ i)·logZ i + (Al...

work page 2025

[11] [12]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page

[12] [14]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page

[13] [15]

If y ou need t o e x ecut e a block immediat ely , append `<t ool _call>` right aft er t he code block

By def ault, when y ou writ e a ```p yt hon``` code block, it is e x ecut ed in a dela y manner , because some v alues ar e int ermediat e v ariables and do not need t o be kno wn immediat ely f or subsequent r easoning, and t her ef or e do not r equir e print output. If y ou need t o e x ecut e a block immediat ely , append `<t ool _call>` right aft er ...

work page

[14] [16]

Code e x ecution r esult:

By def ault, a `p yt hon` code block is e x ecut ed in a def err ed manner . This design r eflect s t he f act t hat man y v ariables ser v e as int ermediat e r esult s and do not need t o be e v aluat ed immediat ely , nor do t he y r equir e print ed output s. W hen immediat e e x ecution is necessar y , append t he `<t ool _call>` tag aft er t he code...

work page 2017