TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Jintao Du; Ming Yang; Qiliang Liu; Shichao Ma; Weiqiang Wang; Xiaofan Li; Xing Wu; Yang Wang; Yu Cheng; Zhengyang Zhou

arxiv: 2601.22776 · v2 · submitted 2026-01-30 · 💻 cs.AI

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

Shichao Ma , Zhiyuan Ma , Ming Yang , Xiaofan Li , Xing Wu , Jintao Du , Yu Cheng , Weiqiang Wang

show 3 more authors

Qiliang Liu Zhengyang Zhou Yang Wang

This is my paper

Pith reviewed 2026-05-16 10:06 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-turn reasoningpolicy optimizationreinforcement learninglarge language modelssearch-augmented reasoningreward designgroup relative policy optimization

0 comments

The pith

TSPO assigns partial rewards at the first correct answer step to resolve double homogenization in multi-turn LLM search optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reinforcement learning for multi-turn tool use in large language models typically relies on sparse final-answer rewards, which ignore all intermediate thinking and tooling steps while also producing nearly identical rewards inside each sampled group. This creates process homogenization that discards reasoning signals and intra-group homogenization that weakens advantage estimates in methods such as Group Relative Policy Optimization. TSPO counters both problems by introducing the First-Occurrence Latent Reward mechanism, which grants a partial reward exactly at the generation step where the ground-truth answer first appears. The change supplies process-level credit without any extra annotations or learned reward models and raises reward variance inside groups. Experiments report average gains of 24 percent on Qwen2.5-3B models and 13.6 percent on 7B models over prior baselines.

Core claim

TSPO introduces the First-Occurrence Latent Reward mechanism that allocates a partial reward to the precise step at which the ground-truth answer first appears inside a multi-turn trajectory. This single change simultaneously supplies process-level signals that were previously ignored and increases intra-group reward variance, thereby improving advantage estimation under group-relative policy optimization. The resulting policy optimization yields higher final performance on search-augmented reasoning tasks while requiring no external reward models or additional annotations.

What carries the argument

The First-Occurrence Latent Reward (FOLR) mechanism, which places a partial reward at the exact generation step where the ground-truth answer first appears.

If this is right

Group-relative advantage estimates become more informative because reward variance inside each sampling group increases.
Process-level credit is supplied without requiring dense human annotations or auxiliary reward models.
Multi-turn search policies improve on average by 24 percent for 3B-scale models and 13.6 percent for 7B-scale models.
The same sparse-outcome reward structure can be retained while still capturing intermediate reasoning steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The first-occurrence placement rule could be tested on other sparse-reward reasoning domains such as code generation or theorem proving.
The variance increase may allow smaller group sizes during sampling without loss of training stability.
The method might combine with existing stage-aware prompting techniques to further localize credit assignment.

Load-bearing premise

Assigning a partial reward exactly at the first appearance of the ground-truth answer preserves useful process signals and raises intra-group variance without introducing new biases.

What would settle it

Training the same models with TSPO and with standard outcome-only rewards on identical multi-turn search tasks and observing no measurable increase in either final accuracy or intra-group reward variance would falsify the mechanism.

read the original abstract

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively. Code is available at https://github.com/Flipped-May/TSPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TSPO adds a first-occurrence partial reward to GRPO for multi-turn tool use, but the reported gains rest on unshown details and a shaky detection assumption.

read the letter

TSPO targets the double homogenization problem in RL for multi-turn LLM search: sparse outcome rewards that ignore process steps and produce flat intra-group advantages under GRPO. The new piece is the FOLR rule, which assigns partial credit exactly at the first step where the ground-truth answer appears. This is meant to keep some process signal alive and raise reward variance within groups without extra models or labels. The turn-level stage-aware framing is a clean way to apply it across iterations. That part is straightforward and directly addresses a practical pain point in tool-augmented agents. The paper does a decent job naming the issue and offering a lightweight fix that builds on existing GRPO setups. The soft spots are the missing pieces. The abstract states 24% and 13.6% average gains on Qwen2.5-3B and 7B but gives no ablations, no error analysis, and no description of how first-occurrence detection actually works. If it relies on naive string matching, the step it credits could easily be inside a tool output or retrieved document rather than the model's reasoning, which would inject bias instead of preserving meaningful process information. That concern from the stress test looks live based on what's shown. Without the code or more results, the central claim stays hard to evaluate. This is for people already running RL on tool-using LLMs who want a simple reward tweak. A reader in that niche could try the FOLR idea quickly, but only after checking the detection logic. I would send it to peer review. The idea is narrow but real, and referees could force the needed verification on the reward assignment and the gains.

Referee Report

2 major / 2 minor

Summary. The paper proposes Turn-level Stage-aware Policy Optimization (TSPO) to resolve the Double Homogenization Dilemma in multi-turn tool-integrated LLM reasoning. It introduces the First-Occurrence Latent Reward (FOLR) mechanism, which assigns partial rewards exactly at the first occurrence of the ground-truth answer string to preserve process-level signals and boost intra-group reward variance for GRPO-style advantage estimation, without external reward models or annotations. Experiments on Qwen2.5-3B and 7B models report average gains of 24% and 13.6% over state-of-the-art baselines.

Significance. If the empirical claims are substantiated with rigorous ablations and the FOLR mechanism is shown to deliver genuine process signals rather than incidental string matches, the work could meaningfully advance RL methods for search-augmented reasoning by mitigating sparse rewards and homogenization issues in a practical, annotation-free manner. The open-source code is a positive factor for reproducibility.

major comments (2)

[Abstract] Abstract: the central claims of 24% and 13.6% average performance gains over baselines rest on empirical assertions with no reported experimental details, ablation studies, number of runs, statistical significance, or error analysis, making it impossible to verify whether the gains address the stated dilemma or arise from other factors.
[FOLR mechanism (Section 3)] FOLR mechanism description: the assumption that naive first-occurrence string matching for the ground-truth answer assigns a meaningful process-level signal (rather than incidental matches in tool outputs or retrieved documents) is load-bearing for both the process-homogenization and intra-group variance claims, yet no analysis, filtering method, or counterexample handling is provided to rule out the bias risk in multi-turn tool trajectories.

minor comments (2)

[Introduction] The introduction of the term 'Double Homogenization Dilemma' would benefit from explicit citations to prior RL or search-augmented reasoning literature to clarify its novelty relative to existing sparse-reward discussions.
[Method] Notation for reward allocation and advantage estimation in the TSPO formulation should be cross-referenced to standard GRPO equations for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major point below and commit to revisions that strengthen the empirical rigor and mechanism analysis without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 24% and 13.6% average performance gains over baselines rest on empirical assertions with no reported experimental details, ablation studies, number of runs, statistical significance, or error analysis, making it impossible to verify whether the gains address the stated dilemma or arise from other factors.

Authors: We agree that the abstract is too concise and omits key details needed to substantiate the reported gains. In the revised manuscript we will expand the abstract to include the number of evaluation runs (3 random seeds), the specific benchmarks used, a brief reference to the ablation studies in Section 4, and a note that all results include standard deviation and statistical significance testing. The main text already contains the full experimental protocol, but we will ensure the abstract now provides sufficient context for readers to assess the claims. revision: yes
Referee: [FOLR mechanism (Section 3)] FOLR mechanism description: the assumption that naive first-occurrence string matching for the ground-truth answer assigns a meaningful process-level signal (rather than incidental matches in tool outputs or retrieved documents) is load-bearing for both the process-homogenization and intra-group variance claims, yet no analysis, filtering method, or counterexample handling is provided to rule out the bias risk in multi-turn tool trajectories.

Authors: We acknowledge that the current description of FOLR does not include explicit analysis of incidental string matches. In the revision we will add a new subsection (3.3) that quantifies the frequency of first-occurrence matches across trajectories, provides representative examples distinguishing reasoning-aligned matches from potential noise in tool outputs, and reports an ablation that measures performance when incidental matches are manually filtered. This will directly address the bias risk while preserving the annotation-free nature of the method. revision: yes

Circularity Check

0 steps flagged

No significant circularity; TSPO/FOLR is a design choice, not a derived reduction

full rationale

The paper introduces TSPO and the FOLR mechanism as an explicit design rule: partial rewards are assigned exactly at the first string occurrence of the ground-truth answer. This rule is stated directly in the abstract and is not obtained by solving any equation, fitting a parameter to a subset of data, or invoking a self-citation chain. No equations appear in the provided text, and the claimed performance gains (24 % / 13.6 %) are presented as empirical outcomes of the rule rather than quantities forced by construction. The approach extends standard GRPO-style advantage estimation without redefining any input quantity in terms of its own output. Therefore the derivation chain contains no self-definitional, fitted-input, or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5546 in / 1024 out tokens · 52291 ms · 2026-05-16T10:06:13.889270+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dissecting Failure Dynamics in Large Language Model Reasoning
cs.AI 2026-04 unverdicted novelty 6.0

LLM reasoning failures cluster at early entropy-spike transitions; the GUARD inference-time framework redirects them for more reliable results.