arxiv: 2605.09588 · v1 · submitted 2026-05-10 · 💻 cs.GT · cs.AI· cs.LG

Recognition: no theorem link

Efficient Ensemble Selection from Binary and Pairwise Feedback

Tzeh Yuan Neoh , Nicholas Teh , Je Qin Chooi , Paul W. Goldberg , Milind Tambe

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.LG

keywords ensemble selectionmultiwinner votingbinary feedbackpairwise feedbacksubmodular optimizationapproximation algorithmsquery complexity

0 comments

The pith

Failure-conditioned greedy selects ensembles from binary feedback with (1-1/e) guarantee and instance-dependent query savings, while pairwise feedback uses a submodular relaxation for theta-type guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames ensemble selection as a distributional multiwinner voting problem in which tasks arrive from an unknown distribution and a committee's value on a task equals its single best member's performance. Under binary correct/incorrect feedback the objective reduces to coverage; the authors supply matching lower bounds for exhaustive elicitation and a failure-conditioned greedy that retains the classic (1-1/e) factor yet asks fewer questions on easy instances. Under pairwise preference feedback they prove that theta-winning committees admit a PTAS but no EPTAS assuming Gap-ETH, that the objective is monotone yet not submodular, and that a weighted ordinal coverage relaxation restores submodularity so that a greedy oracle can be converted back into original guarantees via auditing or a minimax wrapper. This matters because organizations deploying multiple AI systems can avoid the cost of evaluating every model on every task.

Core claim

In the binary setting the induced objective is coverage; a failure-conditioned greedy algorithm preserves the (1-1/e) guarantee while obtaining instance-dependent query savings. In the pairwise setting full-information optimization admits a PTAS but no EPTAS under Gap-ETH, the objective is monotone but not submodular, and a weighted ordinal coverage relaxation supports a failure-conditioned greedy oracle that converts back to theta-type guarantees through finite-family auditing or a minimax wrapper.

What carries the argument

failure-conditioned greedy algorithm that chooses the next expert by its coverage conditioned on prior failures, yielding query savings while preserving the (1-1/e) approximation

Load-bearing premise

A committee's value on a task equals the performance of its single best member and tasks arrive independently from an unknown fixed distribution.

What would settle it

Measure the number of queries actually used by the greedy algorithm versus exhaustive elicitation on a benchmark dataset in which task outcomes are strongly correlated rather than i.i.d.; the claimed instance-dependent savings should disappear if the distributional assumption fails.

Figures

Figures reproduced from arXiv: 2605.09588 by Je Qin Chooi, Milind Tambe, Nicholas Teh, Paul W. Goldberg, Tzeh Yuan Neoh.

read the original abstract

Organizations increasingly deploy multiple AI systems across task domains, but selecting a small, high-performing ensemble can require costly model calls, benchmark runs, and human evaluation. We study this selection problem as a distributional variant of multiwinner voting: tasks are drawn from an unknown domain distribution, each task induces feedback over candidate experts, and a committee's value on a task is determined by its best-performing member. We analyze both binary feedback, for tasks with correct/incorrect outcomes, and pairwise feedback, for tasks where candidate outputs are compared by preference. In the binary setting, the induced objective is coverage. We give exhaustive-elicitation baselines and matching worst-case query lower bounds, and we design a failure-conditioned greedy algorithm that preserves the standard $(1-1/e)$ guarantee while obtaining instance-dependent query savings. In the pairwise setting, we study $\theta$-winning committees. We show that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, and that the objective is monotone but not submodular. This motivates a weighted ordinal coverage relaxation, which is submodular and supports a failure-conditioned greedy oracle under pairwise feedback. We then convert this oracle back into $\theta$-type guarantees through finite-family auditing or a minimax wrapper. We also provide small-scale LLM experiments illustrating the predicted query savings and the role of complementarity in committee selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts multiwinner voting to distributional ensemble selection and gives a failure-conditioned greedy that keeps the (1-1/e) guarantee while cutting queries on binary feedback.

read the letter

The main point is that they model ensemble selection as picking a committee where tasks arrive from an unknown distribution and value comes from the best member. For binary feedback they replace standard greedy with a version that conditions on observed failures to skip some queries, yet still proves the usual approximation. For pairwise feedback they show full-information optimization has a PTAS but no EPTAS under Gap-ETH, note the objective is monotone but not submodular, then relax to a weighted ordinal coverage function that is submodular, run the conditioned greedy, and recover theta-winning guarantees via auditing or a minimax wrapper. Small LLM experiments illustrate the query savings and complementarity effects. The adaptation is clean and the theoretical steps follow from standard submodular analysis and known voting reductions, so the claims hold up on the evidence given. The modeling choices are explicit: i.i.d. tasks and best-member value. Those are reasonable for the setting but will not match every real deployment where distributions shift or interactions are more complex. The experiments are small-scale and serve mainly to check the predicted savings rather than measure large-scale gains. No load-bearing gaps appear in the algorithmic arguments. This is for researchers working on efficient model evaluation or on algorithmic game theory applications to AI systems. A reader already familiar with submodular maximization or multiwinner voting will see the technical value quickly. It deserves a serious referee because the results are grounded and the application is timely. I would send it for peer review.

Referee Report

0 major / 3 minor

Summary. The manuscript frames ensemble selection for AI systems as a distributional multiwinner voting problem where tasks are drawn i.i.d. from an unknown domain and a committee's value on a task equals the performance of its best member. For binary (correct/incorrect) feedback the induced objective is coverage; the authors give exhaustive-elicitation baselines with matching query lower bounds and a failure-conditioned greedy algorithm that retains the classic (1-1/e) guarantee while realizing instance-dependent query savings. For pairwise (preference) feedback they study θ-winning committees, prove that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, establish monotonicity without submodularity, and introduce a weighted ordinal coverage relaxation that is submodular and admits a failure-conditioned greedy oracle; θ-type guarantees are recovered via finite-family auditing or a minimax wrapper. Small-scale LLM experiments illustrate the predicted query savings and the importance of complementarity.

Significance. If the stated approximation guarantees and hardness results hold, the work is a useful bridge between submodular optimization, multiwinner voting, and practical AI deployment. The failure-conditioned greedy is a natural, query-efficient adaptation of the standard greedy analysis for coverage functions; the pairwise PTAS/no-EPTAS dichotomy and the submodular relaxation provide clean complexity insights. The explicit modeling choices (best-member value function, i.i.d. tasks) are stated up front and do not introduce internal inconsistency. The small-scale experiments, while limited, supply concrete evidence of query savings.

minor comments (3)

[Abstract / Introduction] The abstract and introduction state that the failure-conditioned greedy “preserves the standard (1-1/e) guarantee”; a pointer to the precise theorem (presumably in the binary-feedback section) that carries out the marginal-gain analysis under the conditioning would help readers verify the claim without re-deriving it.
[Pairwise-feedback section] The conversion from the weighted ordinal coverage relaxation back to θ-winning guarantees via “finite-family auditing or a minimax wrapper” is central to the pairwise contribution; a short paragraph or lemma clarifying the sample complexity or computational overhead of the auditing step would strengthen the presentation.
[Experiments] The LLM experiments are described as “small-scale” and illustrative; adding a table or paragraph that reports the exact models, task distribution, number of queries saved, and how complementarity was quantified would improve reproducibility and allow readers to assess whether the observed savings match the instance-dependent analysis.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment of the modeling choices and technical contributions, and the recommendation for minor revision. The summary accurately captures the binary-feedback coverage results, the pairwise PTAS/no-EPTAS dichotomy, the submodular relaxation, and the failure-conditioned greedy approach. Since the report lists no specific major comments, we have no point-by-point rebuttals to provide at this stage.

Circularity Check

0 steps flagged

No significant circularity; derivation relies on standard external techniques

full rationale

The paper's algorithmic results (failure-conditioned greedy preserving 1-1/e for coverage, PTAS for θ-winning committees, monotone non-submodular objective, weighted ordinal coverage relaxation) are derived from classical submodular maximization and known voting-theory reductions. These steps use explicit modeling choices (i.i.d. tasks, best-member value) and standard marginal-gain analysis without reducing any prediction or guarantee to a fitted parameter, self-citation chain, or internal redefinition. No load-bearing self-citations or ansatz smuggling appear in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard mathematical assumptions from approximation algorithms and submodular set function theory; no free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Standard assumptions of submodular optimization and greedy algorithm analysis hold for the coverage objective.
Invoked for the (1-1/e) guarantee preservation in the failure-conditioned greedy.
domain assumption The objective functions are monotone (for both binary and pairwise settings).
Stated for the pairwise theta-winning committees and used to motivate the relaxation.

pith-pipeline@v0.9.0 · 5555 in / 1351 out tokens · 34927 ms · 2026-05-12T04:31:31.878930+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Amazon Web Services

URL https://arxiv.org/abs/2510.01499. Amazon Web Services. Beyond the basics: A comprehensive foundation model selection frame- work for generative AI. https://aws.amazon.com/blogs/machine-learning/beyond-the-basics-a- comprehensive-foundation-model-selection-framework-for-generative-ai/,

work page arXiv
[2]

https://a16z.com/ ai-enterprise-2025/,

work page 2025
[3]

Learning Unanimously Acceptable Lotteries via Queries

Extended version available as arXiv:2604.17505. Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8: 454–470,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., and Liu, T

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A benchmark for multi-LLM routing system.arXiv preprint arXiv:2403.12031,

work page arXiv
[5]

AceGPT, localizing large language models in Arabic

12 Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Mosen Alharthi, Bang An, Juncai He, Ziche Liu, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, and Jinchao Xu. AceGPT, localizing large language models in Arabic. InProceedings of the 2024 Conference of the North American Chapter...

work page 2024
[6]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto- Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, and Doina Precup

doi: 10.1007/978-3-031-09016-5. Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto- Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, and Doina Precup. Soft condorcet optimization for ranking of general agents. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1253–1262,

work page doi:10.1007/978-3-031-09016-5
[8]

Diverse Committees with Incomplete or Inaccurate Approval Ballots

13 Feline Lindeboom, Martijn Brehm, Davide Grossi, and Pradeep Murukannaiah. Diverse committees with incomplete or inaccurate approval ballots.arXiv preprint arXiv:2506.10843,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,

Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, et al. A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,

work page arXiv
[10]

A few good choices

Haoyu Song, Th `anh Nguyen, and Young-San Lin. A few good choices. InProceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4861–4874,

work page 2026
[11]

Query-based committee selection.arXiv preprint arXiv:2603.29729,

Itay Asher Zimet, Shiri Alouf-Heffetz, and Nimrod Talmon. Query-based committee selection.arXiv preprint arXiv:2603.29729,

work page arXiv
[12]

17 B.2 Additional Proofs in Section 4.2

15 Appendix Contents A Additional Related Work 16 B Supplementary Details for Section 4 17 B.1 Additional Proofs in Section 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Additional Proofs in Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.3 ADAPTIVE-FAIL-GREEDYand Theorem 4.5 . . . . . . . . . . . . . ...

work page 2001
[13]

We prove the slightly stronger statement that the hard law can realize any prescribed gapsγ a ∈(0,1/4],a∈A\ {c ∗}

Fixc ∗ ∈A. We prove the slightly stronger statement that the hard law can realize any prescribed gapsγ a ∈(0,1/4],a∈A\ {c ∗}. Under the base law, samples are i.i.d.; for each sampled, set u(d, b) = 0for allb∈C\A, and let the coordinates(u(d, a)) a∈A be independent with u(d, c∗)∼Bernoulli 1 2 , u(d, a)∼Bernoulli 1 2 −γ a for alla∈A\ {c ∗}. SinceA⊆C\S, we h...

work page 2025
[14]

pick the strongest singletons

as the per-round subroutine, and report the value of the lottery output (uniform distribution over theRinner committees). •ERM(pink, solid) — the random-sample baseline. Sampled-ERM samplesNsize-kcommittees uniformly without replacement and returns the one with the largest empirical training value. On binary feedback, scoring one sampled committee on the ...

work page 2022
[15]

As a cross-check we also solve a max-coverage ILP via the HiGHS solver inscipy.optimize.milp: the ILP and brute-force values agree to all four reported decimals

We enumerate combinations in chunks with a Numba-JIT kernel; total wall-time across allkis under five minutes on a single node. As a cross-check we also solve a max-coverage ILP via the HiGHS solver inscipy.optimize.milp: the ILP and brute-force values agree to all four reported decimals. Top-5 mask (Figure 1b).The masked variant removes the five candidat...

work page 2025
[16]

We evaluate committees on the held-out test ranks via θtest(bS) = min x /∈bS Ei∼test 1 min c∈ bS ranki(c)≤rank i(x)

The same split is used by every method. We evaluate committees on the held-out test ranks via θtest(bS) = min x /∈bS Ei∼test 1 min c∈ bS ranki(c)≤rank i(x) . Algorithms and hyperparameters.WOFG (the per-step inner subroutine of the Minimax wrapper) sweepsε∈ {3.0,1.5,0.65,0.35,0.13,0.05,0.02}. The miss-rate upper bound¯ρ i uses Hoeffding’s in- equality app...

work page 2024