arxiv: 2604.22785 · v1 · submitted 2026-04-03 · 💻 cs.LG

Recognition: no theorem link

CoFi-PGMA: Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs

Stela Tong , Elai Ben-Gal

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-agent LLMscounterfactual policy gradientsfiltered feedbackmarginal contributioncredit assignmentRLHFrouting mechanismscollaborative agents

0 comments

The pith

Multi-agent LLM systems correct training signals with a counterfactual objective based on each agent's marginal contribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework for training multiple LLMs when feedback arrives filtered through routing or collaboration mechanisms. Standard RLHF objectives break down because only the selected response receives reward or because rewards are shared without revealing individual impact. The authors derive a per-agent counterfactual objective that attributes credit according to marginal contribution, restoring unbiased updates in both cases. This adjustment turns selection-gated feedback into off-policy corrections and shared rewards into leave-one-out differences. The result supplies concrete algorithms that integrate with existing policy optimization while handling multiturn data.

Core claim

CoFi-PGMA derives a single counterfactual per-agent policy gradient objective whose updates equal the difference in expected reward when that agent's action is included versus excluded, thereby correcting the misspecified signal that arises when routing selects one response or when agents share a final reward.

What carries the argument

The counterfactual per-agent objective based on marginal contribution, which reweights or subtracts the filtered reward to isolate each agent's incremental effect on the outcome.

If this is right

Routing systems receive off-policy corrections that account for the fact that only the chosen response is evaluated.
Collaborative systems obtain leave-one-out difference rewards that isolate each agent's credit.
Softmax routing creates risk-sensitive incentives that the same objective can quantify.
The framework supplies practical estimators that combine with multiturn reward models and standard policy optimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marginal-contribution correction could apply to non-LLM multi-agent systems that use selection or shared scoring.
Teams could scale the number of agents without requiring full observability of every contribution.
Dynamic or learned routing policies might be trained end-to-end under the same objective without separate credit-assignment modules.

Load-bearing premise

Each agent's marginal contribution can be recovered in an unbiased way from the filtered reward without needing further assumptions about reward structure or agent independence.

What would settle it

Run a controlled multi-agent routing experiment where ground-truth per-agent contributions are known in advance, then check whether the counterfactual objective assigns higher probability mass to the truly higher-contributing agents than standard RLHF does.

read the original abstract

Large language model (LLM) deployments increasingly rely on multi-agent architectures in which multiple models either compete through routing mechanisms or collaborate to produce a final answer. In both settings, the learning signal received by each agent is filtered by the system mechanism. Routing produces selection-gated feedback where only the chosen response is evaluated, while collaboration produces shared rewards that obscure the individual contribution of each agent. As a result, standard RLHF objectives designed for a single deployed policy become misspecified. We introduce CoFi-PGMA (Counterfactual Policy Gradients under Filtered Feedback for Multi-Agent LLMs), a unified framework for learning under filtered feedback in multi-agent LLM systems. Our approach derives a counterfactual per-agent training objective based on marginal contribution, which corrects the learning signal under both routing and collaborative mechanisms. For routing systems, the objective corresponds to off-policy corrections for selection-gated feedback, while for collaborative systems it reduces to leave-one-out difference rewards for credit assignment. We further analyze how softmax routing induces risk-sensitive incentives and provide practical training algorithms that integrate counterfactual estimators, multiturn-aware rewards, and policy optimization methods, and demonstrate the approach on a real-world reasoning dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoFi-PGMA unifies counterfactual corrections for routing and collaboration in multi-agent LLMs via marginal contribution, but the abstract gives no equations or results so the claims stay untested.

read the letter

The core idea is a single per-agent objective that fixes the learning signal when rewards are filtered—either only the routed agent gets evaluated or the whole team shares one score. It frames routing as an off-policy selection problem and collaboration as a leave-one-out credit assignment task, both under the same marginal-contribution estimator. That unification is the actual new piece; prior multi-agent RL has handled each case separately, but not packaged them for LLM-scale filtered feedback with multiturn rewards and softmax routing side effects noted in passing. The practical algorithms section and the real-world reasoning dataset experiment are the parts that could matter to people shipping these systems. The soft spot is exactly what the stress-test flagged: without the derivation or any error bounds, it is impossible to check whether the marginal contribution stays unbiased when non-selected agents get zero direct signal or when shared rewards are non-additive. The abstract asserts the correction works but supplies no steps, no independence assumptions, and no ablation on how the estimator is computed at LLM scale. Experiments are mentioned only in the last sentence with no numbers, baselines, or variance reported. This is the kind of paper that belongs in a reading group focused on applied multi-agent RL for LLMs. Readers who already work on routing or collaborative agents will see the practical hook immediately; everyone else will wait for the full derivations and tables. It deserves a serious referee because the deployment pattern is already live in production and the framing is coherent even if the current write-up is thin on evidence.

Referee Report

2 major / 2 minor

Summary. The paper introduces CoFi-PGMA, a unified framework for training multi-agent LLM systems under filtered feedback. It claims to derive a counterfactual per-agent training objective based on marginal contribution that corrects learning signals for routing (off-policy selection-gated feedback) and collaboration (leave-one-out difference rewards for credit assignment). The work further analyzes risk-sensitive incentives induced by softmax routing and provides practical algorithms integrating counterfactual estimators, multiturn-aware rewards, and policy optimization, with a demonstration on a real-world reasoning dataset.

Significance. If the derivation is sound and the estimators are unbiased without hidden assumptions on reward decomposability, this framework could meaningfully advance multi-agent RL for LLMs by providing a principled correction for credit assignment in routed and collaborative deployments, where standard single-policy RLHF objectives are misspecified. It unifies two common mechanisms under one counterfactual approach and could support more scalable training of such systems.

major comments (2)

[Abstract / §3] Abstract and derivation section: The central claim that a counterfactual objective based on marginal contribution corrects the learning signal is asserted without any equations, proof steps, or estimator definitions. This prevents verification of whether the approach avoids bias under selection-gated rewards (where non-selected agents receive no signal) or non-additive shared rewards, as required by standard multi-agent RL results on identifiability.
[§3 / §4] Identifiability claim: The assumption that marginal contribution remains identifiable and unbiased from filtered (selection-gated or shared) rewards without additional modeling of the routing policy, joint distribution, or reward structure is load-bearing for the unified framework but is neither stated nor justified. A concrete derivation or counterexample analysis is needed to support the reduction to off-policy corrections and leave-one-out rewards.

minor comments (2)

[Experiments] The demonstration on the real-world reasoning dataset lacks any description of the dataset, baselines, metrics, or quantitative results, making it impossible to assess practical impact.
[Abstract] Terms such as 'multiturn-aware rewards' and 'filtered feedback' are introduced without precise definitions or notation in the abstract, which would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We agree that the derivation of the counterfactual objective and the identifiability assumptions require explicit mathematical presentation. We will revise the manuscript to include the requested equations, proof steps, estimator definitions, and analysis.

read point-by-point responses

Referee: [Abstract / §3] Abstract and derivation section: The central claim that a counterfactual objective based on marginal contribution corrects the learning signal is asserted without any equations, proof steps, or estimator definitions. This prevents verification of whether the approach avoids bias under selection-gated rewards (where non-selected agents receive no signal) or non-additive shared rewards, as required by standard multi-agent RL results on identifiability.

Authors: We acknowledge that the abstract and §3 present the central claim at a high level. In the revised manuscript we will expand §3 with the full derivation: starting from the marginal contribution definition, we will show the step-by-step reduction to the off-policy correction for selection-gated routing feedback and to leave-one-out difference rewards for collaboration. Explicit estimator formulas will be provided together with a discussion of unbiasedness conditions drawn from multi-agent RL identifiability results. revision: yes
Referee: [§3 / §4] Identifiability claim: The assumption that marginal contribution remains identifiable and unbiased from filtered (selection-gated or shared) rewards without additional modeling of the routing policy, joint distribution, or reward structure is load-bearing for the unified framework but is neither stated nor justified. A concrete derivation or counterexample analysis is needed to support the reduction to off-policy corrections and leave-one-out rewards.

Authors: We agree the identifiability assumptions must be stated explicitly. The revision will add a subsection in §3 that (i) lists the assumptions on reward decomposability and routing-policy knowledge, (ii) supplies the concrete derivation linking marginal contribution to the two corrected objectives, and (iii) includes a brief counterexample analysis illustrating bias when the assumptions are violated. This will clarify the scope of the unified framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent from marginal contribution and standard RL concepts

full rationale

The abstract states that the approach 'derives a counterfactual per-agent training objective based on marginal contribution' which 'corrects the learning signal under both routing and collaborative mechanisms' and 'corresponds to off-policy corrections' or 'reduces to leave-one-out difference rewards'. No equations, self-citations, or fitted parameters are shown in the provided text that would make any claimed prediction equivalent to its inputs by construction. The central claim relies on standard multi-agent RL notions (marginal contribution, counterfactuals, leave-one-out) without evidence of self-definitional reduction or load-bearing self-citation chains. The reader's note confirms absence of equations that could reveal circularity, making the derivation appear self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that filtered feedback can be corrected via marginal-contribution counterfactuals; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Filtered feedback in routing and collaboration distorts standard single-policy RLHF objectives
Stated directly in the abstract as the motivating problem.

pith-pipeline@v0.9.0 · 5505 in / 1195 out tokens · 55153 ms · 2026-05-13T19:23:32.606475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 2 internal anchors

[1]

Training language models to 9 follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to 9 follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[2]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Collabllm: From passive responders to active collaborators

Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. Collabllm: From passive responders to active collaborators. arXiv preprint arXiv:2502.00640, 2025

work page arXiv 2025
[4]

Collective intelligence and braess’ paradox

Kagan Tumer and David H Wolpert. Collective intelligence and braess’ paradox. InAaai/iaai, pages 104–109, 2000

work page 2000
[5]

Counterfactual multi-agent policy gradients

Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson. Counterfactual multi-agent policy gradients. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018

work page 2018
[6]

Doubly Robust Policy Evaluation and Learning

Miroslav Dud´ık, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011

work page Pith review arXiv 2011
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Experiments Implementation Repository

Stela Tong, Elai Ben-Gal. Experiments Implementation Repository. https://colab. research.google.com/drive/1jag9nMNN0NJs193wYvGQX6og4VBzMic8?usp=sharing, 2026

work page 2026
[9]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10 A Proof of Proposition 1 Proof. Assume the router probabilities pi are given by Eq. (3) and that the score is a function of the candidate reward, sϕ(ht, a(i) t ) =σ(r(...

work page 2022
[10]

Therefore the two mechanisms produce the same reward distribution but different marginal contribu- tions for agent1

Under the second mechanism the reward depends only on a(2) t , so replacinga (1) t does not affect the reward: E[a(2) t −a (2) t |a (1) t ] = 0. Therefore the two mechanisms produce the same reward distribution but different marginal contribu- tions for agent1. Hence the shared utility U1 = 1 2 E[r(ht, yt)] does not identify the contribution of agent1. C ...

work page