The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment

Hongxin Ding; Junfeng Zhao; Rihong Qiu; Tianyu Jia; Xu Chu; Yasha Wang; Yue Fang; Zhibang Yang; Zhijing Wu

arxiv: 2606.27739 · v1 · pith:DDIXXFDKnew · submitted 2026-06-26 · 💻 cs.LG

The Weakest Link Tells It All: Outcome-Supervised Process Reward Modeling via Learnable Credit Assignment

Tianyu Jia , Yue Fang , Hongxin Ding , Rihong Qiu , Zhibang Yang , Zhijing Wu , Xu Chu , Junfeng Zhao

show 1 more author

Yasha Wang

This is my paper

Pith reviewed 2026-06-29 05:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords process reward modelsoutcome supervisioncredit assignmentmultiple instance learningweakest link assignmentLLM reasoningreward modelingBayes consistency

0 comments

The pith

Outcome-supervised process reward models improve by learning credit assignment from the weakest link in reasoning chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to train process reward models for large language models using only final-answer correctness, without expensive step-by-step labels. It identifies the core difficulty as credit assignment—deciding which reasoning steps deserve blame or credit for the outcome—and shows that prior uniform or causal methods fail to tie credit to actual step quality. The proposed solution models the task as multiple instance learning under a weakest-link rule, where the chain is only as good as its weakest step, and introduces a tailored pooling function to jointly optimize credit and rewards. A consistency proof is given under mild assumptions, and experiments report better performance than prior outcome-supervised methods across tasks and model sizes. A sympathetic reader would care because the approach removes the annotation bottleneck while still producing fine-grained step feedback that can guide LLM reasoning.

Core claim

We introduce LCA, an outcome-supervised framework that jointly learns credit assignment and reward modeling by formalizing the problem as multiple instance learning and applying Softmax-Weighted-Sum pooling under the Weakest Link Assignment principle. This resolves the mutual dependence between credit and reward estimation and yields Bayes-consistent estimates. The resulting models outperform existing outcome-supervised process reward models on multiple reasoning tasks and backbones.

What carries the argument

Softmax-Weighted-Sum (SWS) pooling inside the multiple instance learning formulation of outcome-supervised PRM, which enforces weakest-link credit assignment.

If this is right

Process reward models can be trained at scale using only final-answer correctness.
Credit assignment and reward estimation can be learned jointly rather than sequentially.
The weakest-link principle yields step-level signals that improve error identification over uniform or causal baselines.
Bayes consistency holds under mild assumptions, supporting reliability of the learned assignments.
The method applies across multiple reasoning tasks and LLM backbones without task-specific step annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same weakest-link MIL setup might transfer to other sequence tasks that observe only end-of-sequence outcomes, such as certain planning or verification problems.
If SWS pooling proves robust, variants could be tested on chains longer than those in current benchmarks to check sensitivity to redundancy.
The joint learning objective could be combined with small amounts of step-level data to create hybrid supervision regimes.
The approach suggests that credit assignment in reasoning may be more about identifying failure points than distributing positive credit evenly.

Load-bearing premise

The mutual dependence between credit assignment and reward modeling can be resolved by casting the task as multiple instance learning with softmax-weighted-sum pooling under the weakest-link principle.

What would settle it

A controlled test set with independent human step-correctness labels in which LCA's learned credit scores show no better alignment with true step quality than uniform assignment, or in which final task accuracy fails to exceed strong outcome-supervised baselines.

Figures

Figures reproduced from arXiv: 2606.27739 by Hongxin Ding, Junfeng Zhao, Rihong Qiu, Tianyu Jia, Xu Chu, Yasha Wang, Yue Fang, Zhibang Yang, Zhijing Wu.

**Figure 2.** Figure 2: Overview of LCA. The PRM outputs per-prefix error probabilities p, which are then aggregated via Softmax-Weighted-Sum (SWS) pooling. The PRM is trained end-to-end with a trajectory-level cross-entropy loss. • Assumption. By the definition of prefix correctness, the bag-level label is the maximum over instance labels (Eq. 2); this exactly matches the Standard Multi-Instance Assumption. • Learning objective… view at source ↗

**Figure 4.** Figure 4: Average F1 scores of different pooling tech [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Average predicted error probability and ground-truth error rate across relative step positions, computed over 1024 incorrect trajectories sampled from MATH split of Math-Shepherd. The PRM is trained on Math-Shepherd with max pooling for 1 epoch [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: F1 score on MATH split of ProcessBench during training the PRM on Math-Shepherd with max pooling. taining errors), which we refer to as PRM800KBALANCED. We then train PRMs on this dataset with different pooling techniques, using Qwen3-4B as the backbone. Training details are provided in Appendix C.2. For the stepwise-supervised baseline, we retain the step-level labels from PRM800K-BALANCED. Steps before … view at source ↗

read the original abstract

Process reward models (PRMs) enhance the reasoning capabilities of large language models (LLMs) by providing fine-grained feedback, yet training PRMs typically requires expensive stepwise annotations. Outcome-supervised PRMs offer a scalable alternative by learning from final-answer correctness alone, but this introduces a fundamental *credit assignment* challenge, i.e., attributing outcomes to responsible reasoning steps. Existing approaches rely on either uniform or causal assignment, both of which fail to anchor credit in step correctness and thus hinder process error identification. In this work, we propose Outcome-Supervised Process Reward Modeling via **L**earnable **C**redit **A**ssignment (**LCA**), an outcome-supervised PRM framework that jointly learns credit assignment and reward modeling under the principle of *Weakest Link Assignment: a reasoning chain is as strong as its weakest link*. To address mutual dependence between credit assignment and reward modeling, we formalize outcome-supervised PRM as a Multiple Instance Learning (MIL) problem and introduce Softmax-Weighted-Sum (SWS) pooling, an MIL pooling technique tailored for strong dependence and redundancy among reasoning states. We prove Bayes consistency of our algorithm under mild assumptions. Extensive experiments demonstrate that **LCA** consistently outperforms state-of-the-art outcome-supervised PRMs across multiple tasks and backbones. Code is available at https://anonymous.4open.science/r/LCA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LCA's MIL framing plus SWS pooling is a concrete attempt at learnable credit assignment for outcome-supervised PRMs, but the circular dependence may persist if the mild assumptions do not cover strong step redundancy.

read the letter

The paper's main move is to cast outcome-supervised PRM training as a multiple instance learning problem and introduce Softmax-Weighted-Sum pooling to implement weakest-link credit assignment. This lets them jointly optimize the reward model and the per-step credits without step-level labels. The SWS operator is the specific new piece, designed to handle the dependence and redundancy that exist among reasoning steps.

They lay out the problem cleanly: uniform and causal baselines do not tie credit to actual step quality, so the model cannot reliably spot process errors. Framing it as MIL with a tailored pooling function is a direct response to that gap, and the Bayes consistency claim under mild assumptions adds a theoretical anchor.

The soft spot is the joint learning loop itself. Credit assignment depends on the current reward model, and the reward model depends on the credits; the stress-test note is right to flag whether the mild assumptions actually break that circle when dependence among steps is strong. If the proof treats the reward model as roughly correct early on, or bounds the redundancy too tightly, the guarantee will not cover the practical regime the method targets. The abstract says the experiments show consistent gains, but without seeing the ablations on SWS versus other pooling choices or controls for the joint optimization, it is hard to know how much of the lift comes from the new formulation.

This work is for groups already running outcome-supervised PRM pipelines and looking for cheaper ways to get process-level signal. A reader working on MIL for sequential or chain-structured data might also pick up the pooling idea.

It has a clear technical proposal and a stated theoretical result, so it deserves a serious referee who can check the proof assumptions and the experimental attribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces LCA, an outcome-supervised PRM framework that jointly optimizes credit assignment and reward modeling by casting the problem as MIL with Softmax-Weighted-Sum (SWS) pooling under the weakest-link principle. It claims a Bayes-consistency proof under mild assumptions and reports consistent empirical outperformance over prior outcome-supervised PRMs across tasks and backbones.

Significance. If the consistency result and gains hold, the work supplies a scalable, annotation-light route to process-level supervision for LLM reasoning, with a novel MIL formulation that directly targets step dependence; the combination of a learnable credit mechanism and a stated consistency guarantee would be a substantive advance over uniform or causal baselines.

major comments (2)

[§4] §4 (Bayes consistency theorem): the proof is stated under 'mild assumptions,' yet these assumptions are not shown to be sufficient when steps exhibit the strong dependence and redundancy that SWS is explicitly designed to handle; without an explicit bound on inter-step dependence or a demonstration that the joint optimization converges even when the reward model is initially inaccurate, the guarantee does not clearly transfer to the regime motivating the method.
[§5] §5 (Experiments): the claim of consistent outperformance is presented without reported ablations isolating the contribution of learnable credit assignment versus the choice of SWS pooling or the MIL objective; this leaves open whether the gains are attributable to the central modeling choice or to other implementation details.

minor comments (2)

[Abstract] Abstract: the statement that LCA 'consistently outperforms' is given without any numerical deltas, task names, or backbone details, reducing immediate readability.
[§3] Notation: the definition of SWS pooling and its relation to standard MIL pooling functions could be stated more explicitly with a short comparison equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [§4] §4 (Bayes consistency theorem): the proof is stated under 'mild assumptions,' yet these assumptions are not shown to be sufficient when steps exhibit the strong dependence and redundancy that SWS is explicitly designed to handle; without an explicit bound on inter-step dependence or a demonstration that the joint optimization converges even when the reward model is initially inaccurate, the guarantee does not clearly transfer to the regime motivating the method.

Authors: The Bayes consistency result is established under mild assumptions that are compatible with the dependence and redundancy among steps, as these are directly encoded in the MIL formulation and the SWS pooling operator under the weakest-link principle. The proof already shows convergence of the joint optimization from arbitrary initial reward models. To address the request for greater clarity, we will add a short discussion paragraph in §4 of the revised manuscript explaining how the stated assumptions cover strong inter-step dependence without requiring additional explicit bounds. revision: partial
Referee: [§5] §5 (Experiments): the claim of consistent outperformance is presented without reported ablations isolating the contribution of learnable credit assignment versus the choice of SWS pooling or the MIL objective; this leaves open whether the gains are attributable to the central modeling choice or to other implementation details.

Authors: We agree that targeted ablations would strengthen the empirical claims. In the revised manuscript we will add ablation experiments that isolate the learnable credit assignment mechanism, the SWS pooling function, and the MIL objective, reporting their individual contributions to performance. revision: yes

Circularity Check

0 steps flagged

No circularity; new MIL formulation with SWS pooling stands as independent modeling choice.

full rationale

The abstract presents LCA as a novel outcome-supervised PRM framework that formalizes the problem as MIL and introduces SWS pooling under the weakest-link principle to handle credit assignment. Bayes consistency is asserted under mild assumptions, with empirical gains reported across tasks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce the central claims to tautological inputs by construction. The modeling choice addresses the stated dependence without evidence of self-definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the weakest-link assignment principle and the suitability of the MIL formulation with SWS pooling for dependent reasoning steps; these are domain assumptions rather than derived quantities.

axioms (2)

domain assumption Reasoning chains obey the weakest-link principle for credit assignment
Invoked to motivate the credit-assignment rule and the choice of pooling.
domain assumption SWS pooling resolves mutual dependence between credit assignment and reward modeling
Stated as the solution to the core technical challenge.

pith-pipeline@v0.9.1-grok · 5807 in / 1288 out tokens · 23126 ms · 2026-06-29T05:14:30.020625+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 6 canonical work pages · 4 internal anchors

[1]

In The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

Stop summation: Min-form credit assignment is all process reward model needs for reasoning. In The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman
[2]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, and 6 others

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit re- wards.Preprint, arXiv:2502.01456. Yuyang Ding, Xinyu Shi, Juntao Li, xiaobo liang, Zhaopeng Tu, and Min Zhang. 2026. SCAN: Self- denoising monte carlo annotation for robust process reward learning. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems. Joseph Early, Gavin Cheung, Kurt Cu...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

InProceedings of the 35th International Con- ference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 2127–

Attention-based deep multiple instance learn- ing. InProceedings of the 35th International Con- ference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 2127–
[5]

Mistral 7B

PMLR. Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, and Mor Geva. 2024. A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains. InPro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Xin Liu, Weijia Zhang, Wei Tang, Thuc Duy Le, Jiuy- ong Li, Lin Liu, and Min-Ling Zhang. 2025. From correlation to causation: Max-pooling-based multi- instance learning leads to more robust whole slide image classification.Preprint, arXiv:2408.09449. Liangchen...

work page arXiv 2025
[7]

InThirty-seventh Conference on Neural Information Processing Sys- tems

Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Sys- tems. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System opti- mizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th AC...

2020
[8]

Why is your language model a poor implicit reward model? InThe Fourteenth International Con- ference on Learning Representations. J. Salamon, B. McFee, and P. Li. 2017. Dcase 2017 submission: Multiple instance learning for sound event detection. Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonat...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

InProceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA

Transmil: transformer based correlated multi- ple instance learning for whole slide image classifi- cation. InProceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA. Curran Associates Inc. Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Avi- ral Kumar. 2025. Scaling LLM test-time compute...

2025
[10]

Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran- cis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins

Learning and interpreting multi-multi-instance learning networks.CoRR, abs/1810.11514. Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran- cis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process- and outcome- based feedback.Preprint, arXiv:2211.14275. Chaojie Wang, Yanchen Deng, ...

work page arXiv 2022
[11]

assess each reasoning step, providing finer- grained feedback. Recent advancements (Wang et al., 2024c; Luo et al., 2024; Wang et al., 2024a) have demonstrated the significant potential of PRMs in scaling test-time compute (Setlur et al., 2025; Zhang et al., 2025b) and post-training (Guan et al., 2025; Cheng et al., 2026). The main difficulty in training ...

2024
[12]

, N}: n∼P N(n), whereNis the maximum allowed bag size

The bag size n is drawn from a discrete distri- bution over{1,2, . . . , N}: n∼P N(n), whereNis the maximum allowed bag size
[13]

, xn) are jointly drawn according to some distributionP n onX n: (x1,

Conditional on the bag size n, the instances (x1, . . . , xn) are jointly drawn according to some distributionP n onX n: (x1, . . . , xn)∼P n. Therefore, the overall distribution of a bag B= (x1, . . . , xn)is P(B) =P N(n)·P n(x1, . . . , xn), where no independence assumption on the in- stances is made; they may follow an arbitrary joint distribution. Def...
[14]

Since the bag is positive, its contribution to the loss is −log ˆp(B)under f and −log ˆp′(B) under f ′

Repeated application of Lemma F.4 shows that ˆp′(B)−ˆp(B)≥δ 0, because each single replacement increases the soft- max weighted sum by at least δ0, and the sum is non-decreasing under further replacements. Since the bag is positive, its contribution to the loss is −log ˆp(B)under f and −log ˆp′(B) under f ′. The decrease in loss for this bag is log ˆp′(B)...

[1] [1]

In The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems

Stop summation: Min-form credit assignment is all process reward model needs for reasoning. In The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman

[2] [2]

Training verifiers to solve math word prob- lems.Preprint, arXiv:2110.14168. Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, and 6 others

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Process Reinforcement through Implicit Rewards

Process reinforcement through implicit re- wards.Preprint, arXiv:2502.01456. Yuyang Ding, Xinyu Shi, Juntao Li, xiaobo liang, Zhaopeng Tu, and Min Zhang. 2026. SCAN: Self- denoising monte carlo annotation for robust process reward learning. InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems. Joseph Early, Gavin Cheung, Kurt Cu...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

InProceedings of the 35th International Con- ference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 2127–

Attention-based deep multiple instance learn- ing. InProceedings of the 35th International Con- ference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 2127–

[5] [5]

Mistral 7B

PMLR. Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, and Mor Geva. 2024. A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains. InPro- ceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Pa...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Xin Liu, Weijia Zhang, Wei Tang, Thuc Duy Le, Jiuy- ong Li, Lin Liu, and Min-Ling Zhang. 2025. From correlation to causation: Max-pooling-based multi- instance learning leads to more robust whole slide image classification.Preprint, arXiv:2408.09449. Liangchen...

work page arXiv 2025

[7] [7]

InThirty-seventh Conference on Neural Information Processing Sys- tems

Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Sys- tems. Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System opti- mizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th AC...

2020

[8] [8]

Why is your language model a poor implicit reward model? InThe Fourteenth International Con- ference on Learning Representations. J. Salamon, B. McFee, and P. Li. 2017. Dcase 2017 submission: Multiple instance learning for sound event detection. Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonat...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

InProceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA

Transmil: transformer based correlated multi- ple instance learning for whole slide image classifi- cation. InProceedings of the 35th International Con- ference on Neural Information Processing Systems, NIPS ’21, Red Hook, NY , USA. Curran Associates Inc. Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Avi- ral Kumar. 2025. Scaling LLM test-time compute...

2025

[10] [10]

Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran- cis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins

Learning and interpreting multi-multi-instance learning networks.CoRR, abs/1810.11514. Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran- cis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. 2022. Solving math word problems with process- and outcome- based feedback.Preprint, arXiv:2211.14275. Chaojie Wang, Yanchen Deng, ...

work page arXiv 2022

[11] [11]

assess each reasoning step, providing finer- grained feedback. Recent advancements (Wang et al., 2024c; Luo et al., 2024; Wang et al., 2024a) have demonstrated the significant potential of PRMs in scaling test-time compute (Setlur et al., 2025; Zhang et al., 2025b) and post-training (Guan et al., 2025; Cheng et al., 2026). The main difficulty in training ...

2024

[12] [12]

, N}: n∼P N(n), whereNis the maximum allowed bag size

The bag size n is drawn from a discrete distri- bution over{1,2, . . . , N}: n∼P N(n), whereNis the maximum allowed bag size

[13] [13]

, xn) are jointly drawn according to some distributionP n onX n: (x1,

Conditional on the bag size n, the instances (x1, . . . , xn) are jointly drawn according to some distributionP n onX n: (x1, . . . , xn)∼P n. Therefore, the overall distribution of a bag B= (x1, . . . , xn)is P(B) =P N(n)·P n(x1, . . . , xn), where no independence assumption on the in- stances is made; they may follow an arbitrary joint distribution. Def...

[14] [14]

Since the bag is positive, its contribution to the loss is −log ˆp(B)under f and −log ˆp′(B) under f ′

Repeated application of Lemma F.4 shows that ˆp′(B)−ˆp(B)≥δ 0, because each single replacement increases the soft- max weighted sum by at least δ0, and the sum is non-decreasing under further replacements. Since the bag is positive, its contribution to the loss is −log ˆp(B)under f and −log ˆp′(B) under f ′. The decrease in loss for this bag is log ˆp′(B)...