Recognition: no theorem link
Efficient Ensemble Selection from Binary and Pairwise Feedback
Pith reviewed 2026-05-12 04:31 UTC · model grok-4.3
The pith
Failure-conditioned greedy selects ensembles from binary feedback with (1-1/e) guarantee and instance-dependent query savings, while pairwise feedback uses a submodular relaxation for theta-type guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the binary setting the induced objective is coverage; a failure-conditioned greedy algorithm preserves the (1-1/e) guarantee while obtaining instance-dependent query savings. In the pairwise setting full-information optimization admits a PTAS but no EPTAS under Gap-ETH, the objective is monotone but not submodular, and a weighted ordinal coverage relaxation supports a failure-conditioned greedy oracle that converts back to theta-type guarantees through finite-family auditing or a minimax wrapper.
What carries the argument
failure-conditioned greedy algorithm that chooses the next expert by its coverage conditioned on prior failures, yielding query savings while preserving the (1-1/e) approximation
Load-bearing premise
A committee's value on a task equals the performance of its single best member and tasks arrive independently from an unknown fixed distribution.
What would settle it
Measure the number of queries actually used by the greedy algorithm versus exhaustive elicitation on a benchmark dataset in which task outcomes are strongly correlated rather than i.i.d.; the claimed instance-dependent savings should disappear if the distributional assumption fails.
Figures
read the original abstract
Organizations increasingly deploy multiple AI systems across task domains, but selecting a small, high-performing ensemble can require costly model calls, benchmark runs, and human evaluation. We study this selection problem as a distributional variant of multiwinner voting: tasks are drawn from an unknown domain distribution, each task induces feedback over candidate experts, and a committee's value on a task is determined by its best-performing member. We analyze both binary feedback, for tasks with correct/incorrect outcomes, and pairwise feedback, for tasks where candidate outputs are compared by preference. In the binary setting, the induced objective is coverage. We give exhaustive-elicitation baselines and matching worst-case query lower bounds, and we design a failure-conditioned greedy algorithm that preserves the standard $(1-1/e)$ guarantee while obtaining instance-dependent query savings. In the pairwise setting, we study $\theta$-winning committees. We show that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, and that the objective is monotone but not submodular. This motivates a weighted ordinal coverage relaxation, which is submodular and supports a failure-conditioned greedy oracle under pairwise feedback. We then convert this oracle back into $\theta$-type guarantees through finite-family auditing or a minimax wrapper. We also provide small-scale LLM experiments illustrating the predicted query savings and the role of complementarity in committee selection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames ensemble selection for AI systems as a distributional multiwinner voting problem where tasks are drawn i.i.d. from an unknown domain and a committee's value on a task equals the performance of its best member. For binary (correct/incorrect) feedback the induced objective is coverage; the authors give exhaustive-elicitation baselines with matching query lower bounds and a failure-conditioned greedy algorithm that retains the classic (1-1/e) guarantee while realizing instance-dependent query savings. For pairwise (preference) feedback they study θ-winning committees, prove that full-information optimization admits a PTAS but no EPTAS under Gap-ETH, establish monotonicity without submodularity, and introduce a weighted ordinal coverage relaxation that is submodular and admits a failure-conditioned greedy oracle; θ-type guarantees are recovered via finite-family auditing or a minimax wrapper. Small-scale LLM experiments illustrate the predicted query savings and the importance of complementarity.
Significance. If the stated approximation guarantees and hardness results hold, the work is a useful bridge between submodular optimization, multiwinner voting, and practical AI deployment. The failure-conditioned greedy is a natural, query-efficient adaptation of the standard greedy analysis for coverage functions; the pairwise PTAS/no-EPTAS dichotomy and the submodular relaxation provide clean complexity insights. The explicit modeling choices (best-member value function, i.i.d. tasks) are stated up front and do not introduce internal inconsistency. The small-scale experiments, while limited, supply concrete evidence of query savings.
minor comments (3)
- [Abstract / Introduction] The abstract and introduction state that the failure-conditioned greedy “preserves the standard (1-1/e) guarantee”; a pointer to the precise theorem (presumably in the binary-feedback section) that carries out the marginal-gain analysis under the conditioning would help readers verify the claim without re-deriving it.
- [Pairwise-feedback section] The conversion from the weighted ordinal coverage relaxation back to θ-winning guarantees via “finite-family auditing or a minimax wrapper” is central to the pairwise contribution; a short paragraph or lemma clarifying the sample complexity or computational overhead of the auditing step would strengthen the presentation.
- [Experiments] The LLM experiments are described as “small-scale” and illustrative; adding a table or paragraph that reports the exact models, task distribution, number of queries saved, and how complementarity was quantified would improve reproducibility and allow readers to assess whether the observed savings match the instance-dependent analysis.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment of the modeling choices and technical contributions, and the recommendation for minor revision. The summary accurately captures the binary-feedback coverage results, the pairwise PTAS/no-EPTAS dichotomy, the submodular relaxation, and the failure-conditioned greedy approach. Since the report lists no specific major comments, we have no point-by-point rebuttals to provide at this stage.
Circularity Check
No significant circularity; derivation relies on standard external techniques
full rationale
The paper's algorithmic results (failure-conditioned greedy preserving 1-1/e for coverage, PTAS for θ-winning committees, monotone non-submodular objective, weighted ordinal coverage relaxation) are derived from classical submodular maximization and known voting-theory reductions. These steps use explicit modeling choices (i.i.d. tasks, best-member value) and standard marginal-gain analysis without reducing any prediction or guarantee to a fitted parameter, self-citation chain, or internal redefinition. No load-bearing self-citations or ansatz smuggling appear in the provided derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard assumptions of submodular optimization and greedy algorithm analysis hold for the coverage objective.
- domain assumption The objective functions are monotone (for both binary and pairwise settings).
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2510.01499. Amazon Web Services. Beyond the basics: A comprehensive foundation model selection frame- work for generative AI. https://aws.amazon.com/blogs/machine-learning/beyond-the-basics-a- comprehensive-foundation-model-selection-framework-for-generative-ai/,
-
[2]
https://a16z.com/ ai-enterprise-2025/,
work page 2025
-
[3]
Learning Unanimously Acceptable Lotteries via Queries
Extended version available as arXiv:2604.17505. Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. TyDi QA: A benchmark for information-seeking question answering in typologically diverse languages.Transactions of the Association for Computational Linguistics, 8: 454–470,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. RouterBench: A benchmark for multi-LLM routing system.arXiv preprint arXiv:2403.12031,
-
[5]
AceGPT, localizing large language models in Arabic
12 Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Mosen Alharthi, Bang An, Juncai He, Ziche Liu, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, and Jinchao Xu. AceGPT, localizing large language models in Arabic. InProceedings of the 2024 Conference of the North American Chapter...
work page 2024
-
[6]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vard- hamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Za- haria, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
doi: 10.1007/978-3-031-09016-5. Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto- Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, and Doina Precup. Soft condorcet optimization for ranking of general agents. InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems, pages 1253–1262,
-
[8]
Diverse Committees with Incomplete or Inaccurate Approval Ballots
13 Feline Lindeboom, Martijn Brehm, Davide Grossi, and Pradeep Murukannaiah. Diverse committees with incomplete or inaccurate approval ballots.arXiv preprint arXiv:2506.10843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,
Shiwen Ni, Guhong Chen, Shuaimin Li, Xuanang Chen, Siyi Li, Bingli Wang, Qiyao Wang, Xingjian Wang, Yifan Zhang, Liyang Fan, et al. A survey on large language model benchmarks.arXiv preprint arXiv:2508.15361,
-
[10]
Haoyu Song, Th `anh Nguyen, and Young-San Lin. A few good choices. InProceedings of the 2026 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 4861–4874,
work page 2026
-
[11]
Query-based committee selection.arXiv preprint arXiv:2603.29729,
Itay Asher Zimet, Shiri Alouf-Heffetz, and Nimrod Talmon. Query-based committee selection.arXiv preprint arXiv:2603.29729,
-
[12]
17 B.2 Additional Proofs in Section 4.2
15 Appendix Contents A Additional Related Work 16 B Supplementary Details for Section 4 17 B.1 Additional Proofs in Section 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Additional Proofs in Section 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.3 ADAPTIVE-FAIL-GREEDYand Theorem 4.5 . . . . . . . . . . . . . ...
work page 2001
-
[13]
Fixc ∗ ∈A. We prove the slightly stronger statement that the hard law can realize any prescribed gapsγ a ∈(0,1/4],a∈A\ {c ∗}. Under the base law, samples are i.i.d.; for each sampled, set u(d, b) = 0for allb∈C\A, and let the coordinates(u(d, a)) a∈A be independent with u(d, c∗)∼Bernoulli 1 2 , u(d, a)∼Bernoulli 1 2 −γ a for alla∈A\ {c ∗}. SinceA⊆C\S, we h...
work page 2025
-
[14]
as the per-round subroutine, and report the value of the lottery output (uniform distribution over theRinner committees). •ERM(pink, solid) — the random-sample baseline. Sampled-ERM samplesNsize-kcommittees uniformly without replacement and returns the one with the largest empirical training value. On binary feedback, scoring one sampled committee on the ...
work page 2022
-
[15]
We enumerate combinations in chunks with a Numba-JIT kernel; total wall-time across allkis under five minutes on a single node. As a cross-check we also solve a max-coverage ILP via the HiGHS solver inscipy.optimize.milp: the ILP and brute-force values agree to all four reported decimals. Top-5 mask (Figure 1b).The masked variant removes the five candidat...
work page 2025
-
[16]
The same split is used by every method. We evaluate committees on the held-out test ranks via θtest(bS) = min x /∈bS Ei∼test 1 min c∈ bS ranki(c)≤rank i(x) . Algorithms and hyperparameters.WOFG (the per-step inner subroutine of the Minimax wrapper) sweepsε∈ {3.0,1.5,0.65,0.35,0.13,0.05,0.02}. The miss-rate upper bound¯ρ i uses Hoeffding’s in- equality app...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.