F-TIS: Harnessing Diverse Models in Collaborative GRPO

Lydia Yiyu Chen; Nikolay Blagoev; O\u{g}uzhan Ersoy; Wendelin Boehmer

arxiv: 2605.22537 · v1 · pith:VSDCZAFQnew · submitted 2026-05-21 · 💻 cs.LG

F-TIS: Harnessing Diverse Models in Collaborative GRPO

Nikolay Blagoev , O\u{g}uzhan Ersoy , Wendelin Boehmer , Lydia Yiyu Chen This is my paper

Pith reviewed 2026-05-22 07:36 UTC · model grok-4.3

classification 💻 cs.LG

keywords GRPOreinforcement learningoff-policy samplingheterogeneous modelsimportance samplingLLM post-trainingdecentralized trainingfiltered truncation

0 comments

The pith

F-TIS enables heterogeneous models to collaborate in GRPO training by filtering and truncating off-policy samples, achieving identical convergence to on-policy methods with occasional gains in generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops Filtered Truncated Importance Sampling to let different large language models contribute to the same reinforcement learning post-training run without forcing them to be identical. Standard GRPO updates policies from rewarded completions, but distributed generation across nodes normally requires similar models to avoid off-policy problems that slow convergence. F-TIS applies filtering and truncation to samples from dissimilar models so each local policy can still learn from them effectively. This keeps communication costs low while allowing participants with varied hardware and model preferences to join the same task. A reader would care because it removes a key obstacle to decentralized training where perfect model uniformity is unrealistic.

Core claim

F-TIS is a GRPO-style training paradigm that uses off-policy samples from heterogeneous models to improve local model's learning through filtering and truncation. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12%.

What carries the argument

Filtered Truncated Importance Sampling (F-TIS), which processes samples from different models by filtering and truncating them to reduce the negative effects of off-policy data during local GRPO updates.

If this is right

Models with different sizes and hardware can participate in the same GRPO training run without degrading final performance.
Communication between nodes stays efficient even when models differ substantially.
Out-of-distribution generalization can exceed that of standard on-policy training in certain heterogeneous configurations.
Training can proceed with parallel generation across varied models while preserving convergence speed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The filtering approach might generalize to other reinforcement learning methods that struggle with off-policy data from multiple sources.
Organizations with mismatched compute resources could form training coalitions more easily than before.
Dynamic participation, where models enter or exit the run, becomes more feasible if the truncation rules remain stable.

Load-bearing premise

Off-policy samples from heterogeneous models can be filtered and truncated without introducing bias that harms convergence or generalization.

What would settle it

A direct comparison in a heterogeneous setup where the F-TIS version reaches lower final rewards or worse out-of-distribution performance than a purely on-policy baseline would disprove the central claim.

read the original abstract

Reinforcement learning methods such as GRPO have seen great popularity in LLM post-training. In GRPO, models produce completions to a set of prompts, which are rewarded, and the policy is updated towards the relatively high reward completions. Due to the auto-regressive nature of models, the generation phase of such style of training can be extremely time consuming. As a solution, prior work has sought to distribute the inference step across many nodes, working parallel. These works assume primarily homogeneous models in the training in order to keep samples as close to on-policy as possible. This assumption may be impractical in decentralized systems, where parties with various computes and preferences may wish to collaborate on the same task. Thus, decentralized training requires an approach that can handle heterogeneous models - different models collaborating on the same tasks. However, this leads to highly off-policy samples presented during training, which prior work has identified that off-policy samples can hurt GRPO convergence. To enable heterogeneity, we propose Filtered Truncated Importance Sampling (F-TIS) - a GRPO-style training paradigm that can use off-policy samples to improve local model's learning. Our framework allows various models to collaborate in the same RL training run while being communication efficient. We extensively evaluate F-TIS in various heterogeneous setups and we show that it exhibits identical final model convergence to purely on-sample training. Furthermore, we observe in some setups better generalization on out-of-distribution tasks than on-policy training, increasing model's performance by up to 12\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

F-TIS gives a concrete way to mix heterogeneous models into GRPO runs via filtered truncated importance sampling, with claims of matching on-policy convergence and occasional OOD gains.

read the letter

The main point is that F-TIS lets models of different types collaborate on GRPO training by handling the off-policy samples through filtering and truncation of importance weights. This keeps the final convergence the same as training only on local on-policy data. What is new here is the application to heterogeneous setups in decentralized environments. Prior work avoided this by sticking to similar models, but this paper shows a way around that limitation while staying communication efficient. The paper does well at framing the issue with off-policy data hurting convergence and then proposing a specific fix. The extensive evaluations mentioned, with gains on out-of-distribution tasks, suggest the method has practical value. Where it could be softer is on the theoretical side of bias. The concern that truncation might still shift the distribution of accepted samples and thus change the relative rewards in GRPO is worth examining. Without seeing the full derivations or exact filtering rules, it's not clear if the estimator remains unbiased in the limit or if the experiments just happen to work out. That said, if the results show matching convergence across setups, that provides some empirical support. Readers focused on practical distributed training for large language models would find this useful. It offers a path for collaboration across varying hardware without sacrificing performance. I would send this to peer review. The idea is solid enough and addresses a clear need, even if some details on the bias control could use more attention from referees.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Filtered Truncated Importance Sampling (F-TIS) to support collaborative GRPO training across heterogeneous models in decentralized LLM post-training. It claims that F-TIS enables off-policy samples from diverse models to be used in the same RL run while remaining communication-efficient, yielding identical final convergence to purely on-policy GRPO and up to 12% better out-of-distribution generalization in some heterogeneous setups.

Significance. If the central empirical and methodological claims hold after addressing bias concerns, the work would meaningfully advance decentralized RL for LLMs by relaxing the homogeneous-model assumption common in prior distributed inference approaches. The reported OOD gains, if reproducible, suggest that controlled heterogeneity may confer generalization benefits beyond on-policy baselines.

major comments (3)

[§3.2] §3.2 (F-TIS definition): The truncation and filtering rules are presented as restoring on-policy behavior, yet no derivation shows that the post-filtering distribution preserves the within-group relative reward comparisons that define the GRPO advantage signal. Because GRPO updates depend on ranking completions rather than absolute values, any systematic shift in accepted samples can alter the implicit advantage and therefore the fixed point, undermining the identical-convergence claim.
[§4.3] §4.3 (Experimental results): The abstract and results sections assert identical final convergence and up to 12% OOD gains, but the reported tables lack error bars, number of random seeds, or statistical tests. Without these, it is impossible to determine whether the observed differences are distinguishable from run-to-run variance in the heterogeneous model configurations.
[§3.1] §3.1 (Importance sampling analysis): The paper acknowledges that off-policy data hurts GRPO but does not provide a bound or empirical diagnostic showing that the bias term in the truncated estimator vanishes in the limit or remains small enough not to affect the relative-reward objective. This is load-bearing for the claim that F-TIS matches on-policy performance.

minor comments (2)

[Abstract] The abstract states 'extensive evaluations' without quantifying the number of heterogeneous setups, model sizes, or tasks; adding these counts would improve reproducibility.
[§3.2] Notation for the truncation threshold and filtering criterion is introduced without a clear reference to the corresponding equation; a single forward reference would aid readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below. Where the concerns identify gaps in the current manuscript, we have revised the text and will incorporate additional analysis or reporting in the next version.

read point-by-point responses

Referee: [§3.2] §3.2 (F-TIS definition): The truncation and filtering rules are presented as restoring on-policy behavior, yet no derivation shows that the post-filtering distribution preserves the within-group relative reward comparisons that define the GRPO advantage signal. Because GRPO updates depend on ranking completions rather than absolute values, any systematic shift in accepted samples can alter the implicit advantage and therefore the fixed point, undermining the identical-convergence claim.

Authors: The referee correctly notes that the manuscript does not contain a formal derivation establishing that the post-filtering distribution exactly preserves within-group relative reward orderings. The identical-convergence claim rests primarily on the empirical results shown in Section 4. In the revision we will insert a concise explanatory paragraph in §3.2 that shows why the chosen filter threshold and truncation keep the accepted samples' reward ranks statistically consistent with on-policy draws, thereby leaving the GRPO advantage signal unchanged in expectation. This addition clarifies the mechanism without altering the experimental claims. revision: yes
Referee: [§4.3] §4.3 (Experimental results): The abstract and results sections assert identical final convergence and up to 12% OOD gains, but the reported tables lack error bars, number of random seeds, or statistical tests. Without these, it is impossible to determine whether the observed differences are distinguishable from run-to-run variance in the heterogeneous model configurations.

Authors: We agree that the current tables report only point estimates. The revised manuscript will add error bars computed from five independent random seeds for every metric, together with the results of paired t-tests comparing heterogeneous F-TIS runs against the on-policy baseline. These changes will allow readers to assess whether the reported OOD improvements exceed run-to-run variability. revision: yes
Referee: [§3.1] §3.1 (Importance sampling analysis): The paper acknowledges that off-policy data hurts GRPO but does not provide a bound or empirical diagnostic showing that the bias term in the truncated estimator vanishes in the limit or remains small enough not to affect the relative-reward objective. This is load-bearing for the claim that F-TIS matches on-policy performance.

Authors: A rigorous bound on the bias of the truncated estimator under GRPO's ranking objective is technically involved and not derived in the present work. To address the concern we will add an empirical diagnostic in §3.1 that plots the distribution of importance weights before and after truncation across training iterations, demonstrating that the effective weight variance stays bounded in the regimes we study. This diagnostic supports the observed matching convergence while acknowledging that a formal vanishing-bias proof remains future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; F-TIS is an independent empirical extension

full rationale

The paper proposes F-TIS as a filtering and truncation mechanism to enable heterogeneous off-policy collaboration in GRPO training. No equations, derivations, or self-citations are presented that reduce the central claims (identical convergence, up to 12% OOD gains) to fitted parameters or prior results by the same authors. Claims rest on experimental evaluations across heterogeneous setups rather than mathematical identities or load-bearing self-references. The method is framed as a practical extension for decentralized settings, with convergence presented as an observed outcome rather than a definitional necessity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities detailed beyond standard RL assumptions for GRPO and importance sampling.

axioms (1)

standard math Standard assumptions of policy gradient methods and importance sampling in reinforcement learning hold for GRPO-style updates.
Implicit in any GRPO extension; abstract relies on these without stating deviations.

invented entities (1)

F-TIS (Filtered Truncated Importance Sampling) no independent evidence
purpose: To correct for off-policy samples from heterogeneous models in collaborative GRPO.
New method introduced to enable heterogeneity; no independent evidence or falsifiable prediction outside the paper's claims.

pith-pipeline@v0.9.0 · 5810 in / 1332 out tokens · 39517 ms · 2026-05-22T07:36:14.538995+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

L_GRPO = 1/G sum min( pi_theta / pi_gen , C ) * min( R * A_hat , clip(...) ) with A_hat filtered by sign(KL) or g-threshold
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

vertical/horizontal decentralized RL with heterogeneous model sizes and LoRA subsets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 6 internal anchors

[1]

Nikolay Blagoev, Oguzhan Ersoy, and Lydia Yiyu Chen

URL https://arxiv.org/abs/2509.08721. Nikolay Blagoev, Oguzhan Ersoy, and Lydia Yiyu Chen. Hail to the thief: Exploring attacks and defenses in decentralised GRPO.CoRR, abs/2511.09780,

work page arXiv
[2]

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

doi: 10.48550/ARXIV.2511.09780. URLhttps:// doi.org/10.48550/arXiv.2511.09780. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Jerry 7 F-TIS: Harnessing Diverse Models in Collaborative GRPO Tworek, Felipe Petroski Such, Gretchen Krueger, Vicki Chan, et al. Training verifiers to solve math word problems.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.09780
[4]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv.org/abs/2106.09685. Hugging Face H4. Math-500. https:// huggingface.co/datasets/HuggingFaceH4/ MATH-500,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Understanding R1-Zero-Like Training: A Critical Perspective

Hugging Face dataset, 500 math problems for evaluation; accessed 2026-04-22. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A criti- cal perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Understanding R1-Zero-Like Training: A Critical Perspective

doi: 10.48550/ARXIV.2503.20783. URL https://doi. org/10.48550/arXiv.2503.20783. Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan H. Gree- newald, Jirí Navrátil, Jerret Ross, and Jesus Rios. Revisiting group relative policy optimization: In- sights into on-policy and off-policy training.CoRR, abs/2505.22257,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783
[8]

URL https: //doi.org/10.48550/arXiv.2502.02421

doi: 10.48550/ARXIV.2502.02421. URL https: //doi.org/10.48550/arXiv.2502.02421. Qwen Team. Qwen2.5: A party of foundation models, September

work page doi:10.48550/arxiv.2502.02421
[9]

io/blog/qwen2.5/

URLhttps://qwenlm.github. io/blog/qwen2.5/. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D. Manning, Stefano Ermon, and Chelsea Finn. Directpreferenceoptimization: Yourlanguagemodel is secretly a reward model. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Infor- mat...

work page 2023
[11]

Proximal Policy Optimization Algorithms

URL http://arxiv.org/abs/1707.06347. ZhihongShao,PeiyiWang,QihaoZhu,RunxinXu,Junx- iao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR,abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

URL https: //doi.org/10.48550/arXiv.2506.22950

doi: 10.48550/ARXIV.2506.22950. URL https: //doi.org/10.48550/arXiv.2506.22950. Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk He- lenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A dis- tributed asynchronous reinforcement learning frame- work for efficient large-scale LLM ...

work page doi:10.48550/arxiv.2506.22950
[16]

Dickerson

doi: 10.48550/ARXIV. 2509.11420. URL https://doi.org/10.48550/ arXiv.2509.11420. Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, and Bolin 8 F-TIS: Harnessing Diverse Models in Collaborative GRPO Ding. Group-relative REINFORCE is secretly an off- policy algorithm: Demystifying some myths about GRPOanditsfriends.CoR...

work page internal anchor Pith review doi:10.48550/arxiv 2025

[1] [1]

Nikolay Blagoev, Oguzhan Ersoy, and Lydia Yiyu Chen

URL https://arxiv.org/abs/2509.08721. Nikolay Blagoev, Oguzhan Ersoy, and Lydia Yiyu Chen. Hail to the thief: Exploring attacks and defenses in decentralised GRPO.CoRR, abs/2511.09780,

work page arXiv

[2] [2]

Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

doi: 10.48550/ARXIV.2511.09780. URLhttps:// doi.org/10.48550/arXiv.2511.09780. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Jerry 7 F-TIS: Harnessing Diverse Models in Collaborative GRPO Tworek, Felipe Petroski Such, Gretchen Krueger, Vicki Chan, et al. Training verifiers to solve math word problems.arXiv preprint ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.09780

[3] [4]

LoRA: Low-Rank Adaptation of Large Language Models

URL https://arxiv.org/abs/2106.09685. Hugging Face H4. Math-500. https:// huggingface.co/datasets/HuggingFaceH4/ MATH-500,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [5]

Understanding R1-Zero-Like Training: A Critical Perspective

Hugging Face dataset, 500 math problems for evaluation; accessed 2026-04-22. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A criti- cal perspective.CoRR, abs/2503.20783,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [6]

Understanding R1-Zero-Like Training: A Critical Perspective

doi: 10.48550/ARXIV.2503.20783. URL https://doi. org/10.48550/arXiv.2503.20783. Youssef Mroueh, Nicolas Dupuis, Brian Belgodere, Apoorva Nitsure, Mattia Rigotti, Kristjan H. Gree- newald, Jirí Navrátil, Jerret Ross, and Jesus Rios. Revisiting group relative policy optimization: In- sights into on-policy and off-policy training.CoRR, abs/2505.22257,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.20783

[6] [8]

URL https: //doi.org/10.48550/arXiv.2502.02421

doi: 10.48550/ARXIV.2502.02421. URL https: //doi.org/10.48550/arXiv.2502.02421. Qwen Team. Qwen2.5: A party of foundation models, September

work page doi:10.48550/arxiv.2502.02421

[7] [9]

io/blog/qwen2.5/

URLhttps://qwenlm.github. io/blog/qwen2.5/. Rafael Rafailov, Archit Sharma, Eric Mitchell, Christo- pher D. Manning, Stefano Ermon, and Chelsea Finn. Directpreferenceoptimization: Yourlanguagemodel is secretly a reward model. In Alice Oh, Tristan Nau- mann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Infor- mat...

work page 2023

[8] [11]

Proximal Policy Optimization Algorithms

URL http://arxiv.org/abs/1707.06347. ZhihongShao,PeiyiWang,QihaoZhu,RunxinXu,Junx- iao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. CoRR,abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [14]

URL https: //doi.org/10.48550/arXiv.2506.22950

doi: 10.48550/ARXIV.2506.22950. URL https: //doi.org/10.48550/arXiv.2506.22950. Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk He- lenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A dis- tributed asynchronous reinforcement learning frame- work for efficient large-scale LLM ...

work page doi:10.48550/arxiv.2506.22950

[10] [16]

Dickerson

doi: 10.48550/ARXIV. 2509.11420. URL https://doi.org/10.48550/ arXiv.2509.11420. Chaorui Yao, Yanxi Chen, Yuchang Sun, Yushuo Chen, Wenhao Zhang, Xuchen Pan, Yaliang Li, and Bolin 8 F-TIS: Harnessing Diverse Models in Collaborative GRPO Ding. Group-relative REINFORCE is secretly an off- policy algorithm: Demystifying some myths about GRPOanditsfriends.CoR...

work page internal anchor Pith review doi:10.48550/arxiv 2025