Expected Gain-based Escalation in Vertical Federated Learning

Mohamad Mestoukirdi; Vincent Corlay

arxiv: 2606.31331 · v1 · pith:L7LLRGW3new · submitted 2026-06-30 · 💻 cs.LG

Expected Gain-based Escalation in Vertical Federated Learning

Mohamad Mestoukirdi , Vincent Corlay This is my paper

Pith reviewed 2026-07-01 06:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords vertical federated learningselective escalationexpected gaincollaborative inferencecalibrationroutingmulti-view classification

0 comments

The pith

An expected-gain analytical score enables selective escalation in vertical federated learning to balance accuracy and communication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for deciding when to perform a second round of embedding fusion in vertical federated learning inference. It formulates the decision as estimating the expected gain in correctness for a sample, using only held-out calibration data to build an analytical router. This avoids the need for a separately trained routing network and targets the trade-off between predictive performance and communication overhead. A reader would care because collaborative inference across agents often involves redundant communication for samples where local views already suffice.

Core claim

The central claim is that an interpretable router based on an expected-gain score, which merges a calibrated pooled posterior with classwise reliability estimates from calibration data, can decide escalation in a two-round VFL protocol such that communication is used only when it is expected to improve the final decision, yielding superior communication-accuracy trade-offs on multi-view benchmarks compared to baselines.

What carries the argument

the expected-gain score estimation that combines a calibrated pooled posterior with classwise reliability estimates of the VFL model from held-out calibration data

If this is right

The proposed router improves the communication-accuracy trade-off over confidence-, learned-gain-, and deferral-based baselines.
It requires no separately trained routing network.
It performs well in settings with test-time view degradation.
It applies to multi-view classification tasks in VFL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This calibration-based approach might reduce the need for additional training in other selective inference systems.
The score could be extended to incorporate more complex gain estimates beyond classwise reliability.
The method highlights the value of held-out data for routing in distributed learning without extra models.

Load-bearing premise

Held-out calibration data is representative of the deployment distribution so that classwise reliability estimates accurately predict the correctness improvement from second-round fusion.

What would settle it

A test showing that on data from a shifted distribution, the escalation decisions based on the score lead to worse accuracy-communication trade-off than always escalating or using a learned router.

Figures

Figures reproduced from arXiv: 2606.31331 by Mohamad Mestoukirdi, Vincent Corlay.

**Figure 1.** Figure 1: Comparison of the proposed analytical expected-gain router, the learned router, the oracle, and a confidence-based [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of the proposed analytical expected-gain router, the learned router, the oracle, and a confidence-based [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Collaborative inference can improve predictive performance by integrating complementary information across agents, but applying collaborative fusion to every sample can incur unnecessary communication and computational overhead. This trade-off is particularly relevant in vertical federated learning (VFL), where clients observe different views of the same sample and fusion typically requires transmitting intermediate representations to a server. We study selective escalation in a two-round VFL inference protocol, in which a low-cost first round produces a prediction from client posteriors and a second embedding-fusion round is invoked only when it is expected to improve the final decision. We formulate routing as expected-gain score estimation: a sample is escalated when a predicted improvement in correctness justifies the additional communication. The proposed analytical score combines a calibrated pooled posterior with classwise reliability estimates of the VFL model, both obtained from held-out calibration data, yielding an interpretable router that requires no separately trained routing network. Experiments on multi-view classification benchmarks, including controlled test--time view degradation settings, show that the proposed router improves the communication-accuracy trade-off over confidence-, learned-gain-, and deferral-based baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean analytical expected-gain router for selective escalation in two-round VFL that skips any learned routing network, but the central assumption about calibration data forecasting per-sample fusion gains is the part that needs real checking.

read the letter

The one thing to know is that this work replaces a trained router with an analytical expected-gain score built from a calibrated pooled posterior and classwise reliability numbers pulled from held-out data. That formulation looks new relative to the usual learned-gain or threshold baselines in the VFL setting.

What the paper does well is keep the router interpretable and training-free on the routing side while still targeting the communication-accuracy trade-off in the two-round protocol. Extending selective-inference style ideas to vertical FL with explicit expected improvement is a reasonable step, and the controlled view-degradation experiments sound like a useful testbed.

The soft spot is exactly the one the stress-test flags: the claim stands or falls on whether the classwise reliability estimates from calibration data actually predict when the second-round fusion will correct an individual sample. If the held-out set does not match deployment or if the aggregates miss instance-level variation, the router will escalate the wrong cases and the reported gains disappear. The abstract does not show the equations or the calibration procedure in enough detail to judge how robust this mapping is, and no error bars or ablation on the calibration set size appear in the summary.

This is for researchers already working on efficient collaborative inference or selective prediction in federated setups. It is narrow enough that it will not move the broader field, but the no-extra-network angle is practical enough that a serious referee should see the full methods and results to decide if the calibration assumption holds in the experiments. I would send it to review rather than desk-reject, with the expectation that the authors will need to strengthen the evidence around that weakest link.

Referee Report

2 major / 1 minor

Summary. The paper proposes an expected-gain score for selective escalation in a two-round VFL inference protocol. A sample is escalated to embedding fusion only when the analytically computed expected improvement in correctness (from a calibrated pooled posterior combined with classwise reliability estimates, both derived from held-out calibration data) justifies the added communication cost. This yields an interpretable router without a separately trained routing network. Experiments on multi-view classification benchmarks, including test-time view degradation, claim improved communication-accuracy trade-offs over confidence-, learned-gain-, and deferral-based baselines.

Significance. If the calibration-based estimates reliably predict per-sample gains, the work offers a lightweight, interpretable alternative to learned routers in VFL that could meaningfully reduce unnecessary communication in collaborative inference without sacrificing accuracy. The parameter-free nature of the router (no additional training) is a clear strength relative to learned baselines.

major comments (2)

[Method / Experiments] The central claim that the expected-gain router improves the communication-accuracy trade-off rests on the assumption that classwise reliability estimates from held-out calibration data accurately forecast per-sample correctness improvements from second-round fusion. The manuscript should include direct evidence for this mapping (e.g., a plot or table correlating estimated gains against observed accuracy deltas on held-out test samples) to substantiate the router's decisions.
[Abstract / Experiments] The abstract states improvements over three baseline classes on multi-view benchmarks but supplies no quantitative results, error bars, or details on expected-gain computation and calibration procedure. The experiments section must report these metrics (including statistical significance) to allow verification of the claimed trade-off gains.

minor comments (1)

[Method] Clarify whether the pooled posterior calibration and classwise reliability estimates share the same held-out set or use separate splits, and state any assumptions about distribution shift between calibration and deployment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional evidence and reporting as outlined.

read point-by-point responses

Referee: [Method / Experiments] The central claim that the expected-gain router improves the communication-accuracy trade-off rests on the assumption that classwise reliability estimates from held-out calibration data accurately forecast per-sample correctness improvements from second-round fusion. The manuscript should include direct evidence for this mapping (e.g., a plot or table correlating estimated gains against observed accuracy deltas on held-out test samples) to substantiate the router's decisions.

Authors: We agree that direct validation of the mapping between estimated gains and observed improvements would strengthen the central claim. We will add a new figure (or table) in the experiments section that plots or tabulates the correlation between the analytically computed expected-gain scores (from calibration data) and the actual per-sample accuracy deltas observed when escalation is performed on a separate held-out test set. This will provide the requested empirical support for the router's decisions. revision: yes
Referee: [Abstract / Experiments] The abstract states improvements over three baseline classes on multi-view benchmarks but supplies no quantitative results, error bars, or details on expected-gain computation and calibration procedure. The experiments section must report these metrics (including statistical significance) to allow verification of the claimed trade-off gains.

Authors: We acknowledge the need for quantitative detail. We will expand the experiments section to report specific trade-off metrics (e.g., accuracy at given communication budgets), error bars from repeated runs with different random seeds, full details of the expected-gain formula and calibration procedure, and statistical significance tests (e.g., paired t-tests or Wilcoxon tests) against the confidence, learned-gain, and deferral baselines. Key quantitative highlights will also be added to the abstract if space allows. revision: yes

Circularity Check

0 steps flagged

No circularity: expected-gain router derived from external held-out calibration data

full rationale

The paper formulates the routing score from a calibrated pooled posterior and classwise reliability estimates, both explicitly obtained from held-out calibration data treated as external to the router. No equations, self-citations, or fitted parameters are shown that reduce the gain score to its own inputs by construction. The central claim therefore retains independent content from the calibration step and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Approach rests on standard domain assumptions about calibration data; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Held-out calibration data is drawn from the same distribution as test data and is sufficient to estimate classwise reliability.
Invoked to obtain both the pooled posterior and the reliability estimates used in the router.

pith-pipeline@v0.9.1-grok · 5716 in / 1170 out tokens · 21555 ms · 2026-07-01T06:07:41.969286+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 11 canonical work pages · 3 internal anchors

[2]

Available: http://arxiv.org/abs/1912.04977

[Online]. Available: http://arxiv.org/abs/1912.04977

work page arXiv 1912
[3]

Vertical federated learning: A structured literature review,

A. Khan, M. ten Thij, and A. Wilbik, “Vertical federated learning: A structured literature review,” 2023. [Online]. Available: https: //arxiv.org/abs/2212.00622

work page arXiv 2023
[4]

Communication-efficient vertical federated learning,

——, “Communication-efficient vertical federated learning,”Algorithms, vol. 15, no. 8, 2022. [Online]. Available: https://www.mdpi.com/ 1999-4893/15/8/273

2022
[5]

Less-vfl: Communication-efficient feature selection for vertical federated learning,

T. Castiglia, Y . Zhou, S. Wang, S. Kadhe, N. Baracaldo, and S. Patterson, “Less-vfl: Communication-efficient feature selection for vertical federated learning,” 2023. [Online]. Available: https: //arxiv.org/abs/2305.02219

work page arXiv 2023
[6]

SparseVFL: Communication-efficient vertical federated learning based on sparsification of embeddings and gradients,

Y . Inoue, H. Moriya, Q. Zhang, and K. Skrinak, “SparseVFL: Communication-efficient vertical federated learning based on sparsification of embeddings and gradients,” 2023. [Online]. Available: https://openreview.net/forum?id=BVH3-XCRoN3

2023
[7]

Communication-efficient vertical federated learning with limited overlapping samples,

J. Sun, Z. Xu, D. Yang, V . Nath, W. Li, C. Zhao, D. Xu, Y . Chen, and H. R. Roth, “Communication-efficient vertical federated learning with limited overlapping samples,” 2023. [Online]. Available: https://arxiv.org/abs/2303.16270

work page arXiv 2023
[8]

Vfl-cafe: Communication-efficient vertical federated learning via dynamic caching and feature selection,

J. Zhou, H. Liang, T. Wu, X. Zhang, J. Yu, and C. Tan, “Vfl-cafe: Communication-efficient vertical federated learning via dynamic caching and feature selection,”Entropy, vol. 27, p. 66, 01 2025

2025
[9]

Optimal strategies for reject option classifiers,

V . Franc, D. Prusa, and V . V oracek, “Optimal strategies for reject option classifiers,”Journal of Machine Learning Research, vol. 24, no. 11, pp. 1–49, 2023. [Online]. Available: http://jmlr.org/papers/v24/21-0048.html

2023
[10]

Machine learning with a reject option: A survey,

K. Hendrickx, L. Perini, D. V . der Plas, W. Meert, and J. Davis, “Machine learning with a reject option: A survey,” 2024. [Online]. 10 Available: https://arxiv.org/abs/2107.11277

work page arXiv 2024
[11]

Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer

D. Madras, T. Pitassi, and R. Zemel, “Predict responsibly: Improving fairness and accuracy by learning to defer,” 2018. [Online]. Available: https://arxiv.org/abs/1711.06664

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Consistent estimators for learning to defer to an expert,

H. Mozannar and D. A. Sontag, “Consistent estimators for learning to defer to an expert,”CoRR, vol. abs/2006.01862, 2020. [Online]. Available: https://arxiv.org/abs/2006.01862

work page arXiv 2006
[13]

Two-stage learning to defer with multiple experts,

A. Mao, C. Mohri, M. Mohri, and Y . Zhong, “Two-stage learning to defer with multiple experts,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=GIlsH0T4b2

2023
[14]

Mitigating underfitting in learning to defer with consistent losses,

S. Liu, Y . Cao, Q. Zhang, L. Feng, and B. An, “Mitigating underfitting in learning to defer with consistent losses,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol. 238. PMLR, 02–04 May 2024, pp. 4816–4824. [Online]. ...

2024
[15]

Active classification based on value of classifier,

T. Gao and D. Koller, “Active classification based on value of classifier,” inProceedings of the 25th International Conference on Neural Informa- tion Processing Systems, ser. NIPS’11. Red Hook, NY , USA: Curran Associates Inc., 2011, p. 1062–1070

2011
[16]

When does confidence-based cascade deferral suffice?

W. Jitkrittum, N. Gupta, A. K. Menon, H. Narasimhan, A. S. Rawat, and S. Kumar, “When does confidence-based cascade deferral suffice?”
[17]

Available: https://arxiv.org/abs/2307.02764

[Online]. Available: https://arxiv.org/abs/2307.02764

work page arXiv
[18]

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration,

M. Kull, M. Perell ´o-Nieto, M. K ¨angsepp, T. de Menezes e Silva Filho, H. Song, and P. A. Flach, “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration,”CoRR, vol. abs/1910.12656, 2019. [Online]. Available: http://arxiv.org/abs/1910.12656

work page arXiv 1910
[19]

On Calibration of Modern Neural Networks

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,”CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,”
[21]

Available: https://api.semanticscholar.org/CorpusID: 18268744

[Online]. Available: https://api.semanticscholar.org/CorpusID: 18268744
[22]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385 BIOGRAPHY Mohamad Mestoukirdi(Member, IEEE) was born in Tyre, Lebanon, in 1995. He received a double degree in engineering from the Politecnico di Torino (Polito), Turin, Italy, and the L...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[1] [2]

Available: http://arxiv.org/abs/1912.04977

[Online]. Available: http://arxiv.org/abs/1912.04977

work page arXiv 1912

[2] [3]

Vertical federated learning: A structured literature review,

A. Khan, M. ten Thij, and A. Wilbik, “Vertical federated learning: A structured literature review,” 2023. [Online]. Available: https: //arxiv.org/abs/2212.00622

work page arXiv 2023

[3] [4]

Communication-efficient vertical federated learning,

——, “Communication-efficient vertical federated learning,”Algorithms, vol. 15, no. 8, 2022. [Online]. Available: https://www.mdpi.com/ 1999-4893/15/8/273

2022

[4] [5]

Less-vfl: Communication-efficient feature selection for vertical federated learning,

T. Castiglia, Y . Zhou, S. Wang, S. Kadhe, N. Baracaldo, and S. Patterson, “Less-vfl: Communication-efficient feature selection for vertical federated learning,” 2023. [Online]. Available: https: //arxiv.org/abs/2305.02219

work page arXiv 2023

[5] [6]

SparseVFL: Communication-efficient vertical federated learning based on sparsification of embeddings and gradients,

Y . Inoue, H. Moriya, Q. Zhang, and K. Skrinak, “SparseVFL: Communication-efficient vertical federated learning based on sparsification of embeddings and gradients,” 2023. [Online]. Available: https://openreview.net/forum?id=BVH3-XCRoN3

2023

[6] [7]

Communication-efficient vertical federated learning with limited overlapping samples,

J. Sun, Z. Xu, D. Yang, V . Nath, W. Li, C. Zhao, D. Xu, Y . Chen, and H. R. Roth, “Communication-efficient vertical federated learning with limited overlapping samples,” 2023. [Online]. Available: https://arxiv.org/abs/2303.16270

work page arXiv 2023

[7] [8]

Vfl-cafe: Communication-efficient vertical federated learning via dynamic caching and feature selection,

J. Zhou, H. Liang, T. Wu, X. Zhang, J. Yu, and C. Tan, “Vfl-cafe: Communication-efficient vertical federated learning via dynamic caching and feature selection,”Entropy, vol. 27, p. 66, 01 2025

2025

[8] [9]

Optimal strategies for reject option classifiers,

V . Franc, D. Prusa, and V . V oracek, “Optimal strategies for reject option classifiers,”Journal of Machine Learning Research, vol. 24, no. 11, pp. 1–49, 2023. [Online]. Available: http://jmlr.org/papers/v24/21-0048.html

2023

[9] [10]

Machine learning with a reject option: A survey,

K. Hendrickx, L. Perini, D. V . der Plas, W. Meert, and J. Davis, “Machine learning with a reject option: A survey,” 2024. [Online]. 10 Available: https://arxiv.org/abs/2107.11277

work page arXiv 2024

[10] [11]

Predict Responsibly: Improving Fairness and Accuracy by Learning to Defer

D. Madras, T. Pitassi, and R. Zemel, “Predict responsibly: Improving fairness and accuracy by learning to defer,” 2018. [Online]. Available: https://arxiv.org/abs/1711.06664

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [12]

Consistent estimators for learning to defer to an expert,

H. Mozannar and D. A. Sontag, “Consistent estimators for learning to defer to an expert,”CoRR, vol. abs/2006.01862, 2020. [Online]. Available: https://arxiv.org/abs/2006.01862

work page arXiv 2006

[12] [13]

Two-stage learning to defer with multiple experts,

A. Mao, C. Mohri, M. Mohri, and Y . Zhong, “Two-stage learning to defer with multiple experts,” inThirty-seventh Conference on Neural Information Processing Systems, 2023. [Online]. Available: https://openreview.net/forum?id=GIlsH0T4b2

2023

[13] [14]

Mitigating underfitting in learning to defer with consistent losses,

S. Liu, Y . Cao, Q. Zhang, L. Feng, and B. An, “Mitigating underfitting in learning to defer with consistent losses,” inProceedings of The 27th International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, S. Dasgupta, S. Mandt, and Y . Li, Eds., vol. 238. PMLR, 02–04 May 2024, pp. 4816–4824. [Online]. ...

2024

[14] [15]

Active classification based on value of classifier,

T. Gao and D. Koller, “Active classification based on value of classifier,” inProceedings of the 25th International Conference on Neural Informa- tion Processing Systems, ser. NIPS’11. Red Hook, NY , USA: Curran Associates Inc., 2011, p. 1062–1070

2011

[15] [16]

When does confidence-based cascade deferral suffice?

W. Jitkrittum, N. Gupta, A. K. Menon, H. Narasimhan, A. S. Rawat, and S. Kumar, “When does confidence-based cascade deferral suffice?”

[16] [17]

Available: https://arxiv.org/abs/2307.02764

[Online]. Available: https://arxiv.org/abs/2307.02764

work page arXiv

[17] [18]

Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration,

M. Kull, M. Perell ´o-Nieto, M. K ¨angsepp, T. de Menezes e Silva Filho, H. Song, and P. A. Flach, “Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with dirichlet calibration,”CoRR, vol. abs/1910.12656, 2019. [Online]. Available: http://arxiv.org/abs/1910.12656

work page arXiv 1910

[18] [19]

On Calibration of Modern Neural Networks

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,”CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [20]

Learning multiple layers of features from tiny images,

A. Krizhevsky, “Learning multiple layers of features from tiny images,”

[20] [21]

Available: https://api.semanticscholar.org/CorpusID: 18268744

[Online]. Available: https://api.semanticscholar.org/CorpusID: 18268744

[21] [22]

Deep Residual Learning for Image Recognition

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385 BIOGRAPHY Mohamad Mestoukirdi(Member, IEEE) was born in Tyre, Lebanon, in 1995. He received a double degree in engineering from the Politecnico di Torino (Polito), Turin, Italy, and the L...

work page internal anchor Pith review Pith/arXiv arXiv 2015