arxiv: 2601.02535 · v2 · submitted 2026-01-05 · 💻 cs.CL · cs.AI

Recognition: no theorem link

ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

Hyeong Kyu Choi , Sharon Li

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords ModeXbest-of-Nspectral clusteringevaluator-freesemantic consensusLLM generationopen-ended tasks

0 comments

The pith

ModeX selects the representative output from LLM generations using spectral clustering on a similarity graph without any evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Selecting one good response from multiple random generations is hard for language models on open-ended work like writing summaries or solving math problems. ModeX solves this by linking all the generated texts in a graph according to how similar they are in meaning and then repeatedly clustering the graph to find the central text of the main group. This central text is treated as the modal or most agreed-upon answer. The method and its lighter version do better than just taking one generation or basic voting on several standard tasks.

Core claim

ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. This generalizes majority voting to open-ended text by identifying the dominant semantic consensus among generated texts.

What carries the argument

Similarity graph constructed from candidate generations, processed via recursive spectral clustering to extract the modal centroid.

If this is right

ModeX works without needing extra inference steps or auxiliary models for evaluation.
It outperforms single-path and multi-path baselines on text summarization, code generation, and mathematical reasoning.
ModeX-Lite adds early pruning to improve efficiency while keeping the performance benefits.
The approach extends majority voting ideas to cases where exact string matches do not apply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the similarity measure fails to capture semantics accurately, the selected centroid may not represent true consensus.
This selection principle could be tested on other generation domains such as dialogue responses or creative writing.
Combining the graph construction with different clustering algorithms might yield further improvements in selection accuracy.

Load-bearing premise

The output identified as the modal semantic consensus via spectral clustering on the similarity graph will be the highest-quality generation when no ground-truth answer or external evaluator exists.

What would settle it

Human evaluation where the ModeX centroid is rated worse than randomly chosen generations or other baselines on open-ended tasks would falsify the claim that it selects the best output.

Figures

Figures reproduced from arXiv: 2601.02535 by Hyeong Kyu Choi, Sharon Li.

**Figure 1.** Figure 1: Single Path Generation vs. Mode Extraction (ModeX). While single-path text generation commits to a single trajectory, ModeX leverages the structural information across multiple generation paths to select a “modal” output. than relying on an external evaluator, ModeX operates directly within the set of generated texts to identify a representative, high-quality solution. Concretely, ModeX builds a graph in w… view at source ↗

**Figure 2.** Figure 2: Overview of the ModeX framework. In standard ModeX, (1) adjacency matrix construction and (2) spectral graph clustering are iterated recursively as long as ϕ ≤ τ . Then (3) centroid selection is performed. In the ModeX–Lite variant, (1) → (2) is performed only once without recursion for each pruning interval. 2 Discovering the Mode of Text Can a single high-quality output be selected from multiple text gen… view at source ↗

**Figure 3.** Figure 3: Qualitative Examination. In the text summarization task, “rejected” samples often miss keywords, include incorrect or less precise information, and contain repetitive and verbose text, whereas samples “chosen” by our method are overall concise. 2.2 Qualitative Examination To assess whether ModeX indeed selects a representative/modal output, we qualitatively compare the responses that are ultimately “chosen… view at source ↗

**Figure 4.** Figure 4: Math reasoning accuracy at various stages of text generation. Our mode selection approach consistently identifies high-quality samples early in the trajectory, maintaining high accuracy even with partial outputs. prunes non-representative text paths at fixed intervals of T steps (T=100 unless stated otherwise). At each pruning interval, we apply graph spectral clustering to the partially generated trajecto… view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis. ModeX–Lite shows performance consistently above the single-path baseline in all settings. 5.1 Sensitivity Analysis We analyze the sensitivity of ModeX–Lite to three key design choices: (a) graph partitioning objective, (b) spectral threshold τ (Eq. (4)), and (c) pruning frequency T. In the top panel of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Selecting a single high-quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open-ended tasks where no canonical answer exists. While Best-of-N and self-consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string-match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX-Lite, an improved version of ModeX with early pruning for efficiency. Across open-ended tasks -- including text summarization, code generation, and mathematical reasoning -- our approaches consistently outperform standard single- and multi-path baselines, providing a computationally efficient solution for robust open-ended text generation. Code is released in https://github.com/deeplearning-wisc/ModeX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ModeX picks a representative generation via spectral clustering on a similarity graph, which is a neat evaluator-free trick, but the experiments only test it on tasks where consensus metrics already favor the modal answer.

read the letter

The core idea is straightforward: generate N samples, build a similarity graph, run recursive spectral clustering to find the biggest or most central cluster, and output its centroid as the selected response. This generalizes majority voting without needing string matches or an extra reward model, and they have a pruned lite version for speed. Code is released, which helps anyone who wants to try it directly on summarization, code generation, or math reasoning tasks. On those benchmarks it beats single-sample and basic multi-sample baselines using ROUGE, pass@k, and exact match. That part looks solid enough for what it is. The real limitation is that those metrics all reward outputs close to a reference or to each other, so the results do not actually probe the no-ground-truth regime the paper advertises. In genuinely open-ended settings where several distinct high-quality answers can coexist, the largest cluster could just be the safest or most common phrasing rather than the best one. No human judgments or creative-task results are described to check this. The similarity function and clustering hyperparameters are also left as free choices without much sensitivity analysis. Overall this is the kind of practical inference paper that a reading group could discuss in one session to see if the clustering step holds up on new data. It is worth sending to peer review because the method is simple, the code is public, and the gap it targets is real, even if the current evidence is narrower than the abstract suggests.

Referee Report

3 major / 2 minor

Summary. The paper proposes ModeX, an evaluator-free Best-of-N selection method for open-ended LLM generation. It builds a similarity graph over multiple candidate outputs and recursively applies spectral clustering to identify the modal semantic consensus, selecting the representative centroid as the final output. A lighter variant, ModeX-Lite, adds early pruning for efficiency. The approach is evaluated on text summarization, code generation, and mathematical reasoning tasks, where it reportedly outperforms single-generation and standard multi-path baselines without requiring external evaluators, reward models, or exact-match voting.

Significance. If the central results hold under rigorous validation, ModeX would provide a practical, low-overhead way to aggregate stochastic generations in open-ended settings by generalizing majority voting to semantic similarity spaces. The code release supports reproducibility. However, the significance is tempered by reliance on proxy metrics that inherently favor consensus, leaving the no-ground-truth quality claim under-tested.

major comments (3)

[§3] §3 (Method): The core assumption that the centroid of the largest spectral cluster corresponds to highest quality (rather than merely the most frequent semantic variant) is load-bearing for the evaluator-free claim but receives no theoretical justification or counterexample analysis. In open-ended regimes where multiple distinct high-quality outputs coexist, surface similarity may cluster safe but mediocre phrasings; this is not addressed.
[§4] §4 (Experiments): Performance is reported on proxy metrics (ROUGE, pass@k, exact match) that reward consensus by construction. These do not test the no-ground-truth regime advertised in the abstract; a human preference study or quality annotation on truly open-ended outputs (e.g., creative summarization) is required to substantiate the central claim.
[§3.1] §3.1 (Similarity graph construction): The method is described as parameter-free, yet clustering hyperparameters (number of clusters, recursion depth, similarity threshold) are listed as free parameters. The paper must clarify whether these are fixed across tasks or tuned, as this directly affects the reproducibility and generality of the reported gains.

minor comments (2)

[Abstract] Abstract and §1: The phrasing 'without requiring additional inference or auxiliary models' is slightly overstated if the similarity graph relies on embeddings from a separate encoder; clarify the exact embedding source.
[§4] §4: Include error bars or multiple random seeds for the clustering step, as spectral clustering can be sensitive to initialization.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [§3] §3 (Method): The core assumption that the centroid of the largest spectral cluster corresponds to highest quality (rather than merely the most frequent semantic variant) is load-bearing for the evaluator-free claim but receives no theoretical justification or counterexample analysis. In open-ended regimes where multiple distinct high-quality outputs coexist, surface similarity may cluster safe but mediocre phrasings; this is not addressed.

Authors: We agree that the manuscript provides no formal theoretical justification for equating the largest cluster centroid with highest quality, relying instead on the empirical generalization of majority voting to semantic similarity. In cases with multiple distinct high-quality outputs, the method could indeed favor a frequent but mediocre variant. In the revision we will add a dedicated limitations subsection in §3 discussing this assumption, potential counterexamples, and conditions under which it is expected to hold, supported by a brief synthetic-data illustration. This constitutes a partial revision. revision: partial
Referee: [§4] §4 (Experiments): Performance is reported on proxy metrics (ROUGE, pass@k, exact match) that reward consensus by construction. These do not test the no-ground-truth regime advertised in the abstract; a human preference study or quality annotation on truly open-ended outputs (e.g., creative summarization) is required to substantiate the central claim.

Authors: We acknowledge that ROUGE, pass@k, and exact-match metrics inherently favor consensus and therefore do not fully substantiate quality claims in a true no-ground-truth setting. While these proxies are standard for the evaluated tasks, a human preference study on open-ended outputs would indeed provide stronger evidence. Such a study lies outside the scope and resources of the current revision; we will instead expand the limitations and discussion sections to explicitly note the reliance on proxy metrics and flag human evaluation as important future work. revision: no
Referee: [§3.1] §3.1 (Similarity graph construction): The method is described as parameter-free, yet clustering hyperparameters (number of clusters, recursion depth, similarity threshold) are listed as free parameters. The paper must clarify whether these are fixed across tasks or tuned, as this directly affects the reproducibility and generality of the reported gains.

Authors: We thank the referee for identifying this inconsistency. The method uses fixed default hyperparameters across all tasks and experiments: cosine similarity threshold of 0.75 on sentence embeddings, recursion depth capped at 2, and initial spectral clustering into 3 clusters. These values were selected once on a small validation split and held constant; no per-task tuning was performed. In the revised manuscript we will update §3.1 to state these fixed values explicitly, justify their choice, and confirm that the released code uses precisely these settings. revision: yes

standing simulated objections not resolved

Conducting a new human preference study on truly open-ended outputs to validate the no-ground-truth quality claim.

Circularity Check

0 steps flagged

No significant circularity; selection procedure defined independently of reported outcomes

full rationale

The paper defines ModeX directly as a similarity-graph construction followed by recursive spectral clustering to extract a centroid representing semantic consensus; this algorithmic specification stands on its own without reference to any quality metric, ground-truth label, or fitted parameter that would later be called a prediction. Performance claims are obtained by running the fixed procedure on standard benchmarks (ROUGE, pass@k, exact match) and comparing against baselines, which constitutes an external empirical test rather than a reduction to the input data by construction. No self-citation chain, ansatz smuggling, or renaming of known results is required for the core derivation, and the method remains falsifiable on any new open-ended task.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; main unstated assumptions concern the reliability of pairwise similarity for capturing semantic consensus and the validity of clustering as a quality proxy.

free parameters (1)

clustering hyperparameters
Spectral clustering typically requires choices such as number of clusters or similarity threshold that may be tuned to data.

axioms (1)

domain assumption Pairwise similarity between generated texts accurately reflects semantic equivalence without external models
Required to construct the similarity graph that drives the selection.

pith-pipeline@v0.9.0 · 5494 in / 1174 out tokens · 51491 ms · 2026-05-16T17:21:37.624085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Qwen2 Technical Report

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[7]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[8]

Slim-sc: Thought pruning for efficient scaling with self-consistency

Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, and Dmitrii Ustiugov. Slim-sc: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34488–34505, 2025

work page 2025
[9]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[12]

Algebraic connectivity of graphs.Czechoslovak mathematical journal, 23(2):298–305, 1973

Miroslav Fiedler. Algebraic connectivity of graphs.Czechoslovak mathematical journal, 23(2):298–305, 1973

work page 1973
[13]

Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024, 2024

Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024, 2024

work page 2024
[14]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2023

work page 2023
[15]

Voting or consensus? decision-making in multi-agent debate.arXiv e-prints, pages arXiv–2502, 2025

Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Voting or consensus? decision-making in multi-agent debate.arXiv e-prints, pages arXiv–2502, 2025

work page 2025
[16]

Debate or vote: Which yields better decisions in multi-agent large language models? InAdvances in Neural Information Processing Systems, 2025

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InAdvances in Neural Information Processing Systems, 2025. 12

work page 2025
[17]

Improved bounds for mixing rates of markov chains and multicommodity flow

Alistair Sinclair. Improved bounds for mixing rates of markov chains and multicommodity flow. Combinatorics, probability and Computing, 1(4):351–370, 1992

work page 1992
[18]

Teaching machines to read and comprehend

KarlMoritzHermann, TomásKociský, EdwardGrefenstette, LasseEspeholt, WillKay, MustafaSuleyman, and Phil Blunsom. Teaching machines to read and comprehend. InNIPS[19], pages 1693–1701

work page
[19]

Liu, and Christopher D

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics

work page 2017
[20]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[21]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[22]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Self-refine: Iterative refinement with self-feedback

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023

work page 2023
[24]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

work page 2023
[25]

Scalable best-of-n selection for large language models via self-certainty.Advances in neural information processing systems, 2025

Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.Advances in neural information processing systems, 2025

work page 2025
[26]

Normalized cuts and image segmentation.IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000

Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000

work page 2000
[27]

Deal: Decoding-time alignment for large language models

James Y Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. Deal: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26280–26300, 2025

work page 2025
[28]

Reward-guided tree search for inference time alignment of large language models

Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Reward-guided tree search for inference time alignment of large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12575–12593, 2025

work page 2025
[29]

Args: Alignment as reward-guided search

Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. InThe Twelfth International Conference on Learning Representations, 2024. 13

work page 2024
[30]

Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model

Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11781–11791, 2023

work page 2023
[31]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[32]

Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.arXiv preprint arXiv:2410.02725, 2024

Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.arXiv preprint arXiv:2410.02725, 2024

work page arXiv 2024
[33]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[34]

Collab: Controlled decoding using mixture of agents for llm alignment

Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of agents for llm alignment. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[35]

Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

work page 2024
[36]

Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models

Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Tianxiang Sun, Cheng Chang, Qinyuan Cheng, Ding Wang, Xiaofeng Mou, et al. Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and ...

work page 2024
[37]

Do we truly need so many samples? multi-llm repeated sampling efficiently scales test-time compute.arXiv preprint arXiv:2504.00762, 2025

Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, et al. Do we truly need so many samples? multi-llm repeated sampling efficiently scales test-time compute.arXiv preprint arXiv:2504.00762, 2025

work page arXiv 2025
[38]

Clue: Non-parametric verification from experience via hidden-state clustering.arXiv preprint arXiv:2510.01591, 2025

Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, and Dong Yu. Clue: Non-parametric verification from experience via hidden-state clustering.arXiv preprint arXiv:2510.01591, 2025

work page arXiv 2025
[39]

Mixture-of-agentsenhanceslarge language model capabilities

JunlinWang, WANGJue, BenAthiwaratkun, CeZhang, andJamesZou. Mixture-of-agentsenhanceslarge language model capabilities. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[40]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Qwen2.5 Technical Report

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

work page arXiv 2025
[44]

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301, 2025. 14 Appendix Table of Contents A Experimental Details 15 A.1 Benchmark Details . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

First, briefly state your step-by-step reasoning

Better explanations or solutions Provide your refined response. First, briefly state your step-by-step reasoning. Then, <TASK-SPECIFIC INSTRUCTIONS> LLM Judge Question: <TASK> Below areNdifferent responses from different agents: Response 1 (from Agent 1) <RESPONSE 1> Response 2 (from Agent 2) <RESPONSE 2> ... ResponseN(from AgentN) <RESPONSE N> Instructio...

work page
[48]

Accuracy and correctness

work page
[49]

Clarity and completeness

work page
[50]

Quality of reasoning

work page
[51]

valley” of low probability (low similarity), the edges bridging these regions will have low weights (Auv → 0). This creates a “bottleneck,

Overall quality Your response should be ONLY the number (1, 2, 3, etc.) corresponding to the best response. For example, if you think Response 2 is the best, respond with just “2". C Why the Second Eigenvector of the Laplacian Acts as a Clusterer? Let G = ( V, E)be a graph with adjacency matrix A and degree matrix D, and define the unnormalized Laplacian ...

work page 1996