pith. machine review for the scientific record. sign in

arxiv: 2601.02535 · v2 · submitted 2026-01-05 · 💻 cs.CL · cs.AI

Recognition: no theorem link

ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords ModeXbest-of-Nspectral clusteringevaluator-freesemantic consensusLLM generationopen-ended tasks
0
0 comments X

The pith

ModeX selects the representative output from LLM generations using spectral clustering on a similarity graph without any evaluators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Selecting one good response from multiple random generations is hard for language models on open-ended work like writing summaries or solving math problems. ModeX solves this by linking all the generated texts in a graph according to how similar they are in meaning and then repeatedly clustering the graph to find the central text of the main group. This central text is treated as the modal or most agreed-upon answer. The method and its lighter version do better than just taking one generation or basic voting on several standard tasks.

Core claim

ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. This generalizes majority voting to open-ended text by identifying the dominant semantic consensus among generated texts.

What carries the argument

Similarity graph constructed from candidate generations, processed via recursive spectral clustering to extract the modal centroid.

If this is right

  • ModeX works without needing extra inference steps or auxiliary models for evaluation.
  • It outperforms single-path and multi-path baselines on text summarization, code generation, and mathematical reasoning.
  • ModeX-Lite adds early pruning to improve efficiency while keeping the performance benefits.
  • The approach extends majority voting ideas to cases where exact string matches do not apply.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the similarity measure fails to capture semantics accurately, the selected centroid may not represent true consensus.
  • This selection principle could be tested on other generation domains such as dialogue responses or creative writing.
  • Combining the graph construction with different clustering algorithms might yield further improvements in selection accuracy.

Load-bearing premise

The output identified as the modal semantic consensus via spectral clustering on the similarity graph will be the highest-quality generation when no ground-truth answer or external evaluator exists.

What would settle it

Human evaluation where the ModeX centroid is rated worse than randomly chosen generations or other baselines on open-ended tasks would falsify the claim that it selects the best output.

Figures

Figures reproduced from arXiv: 2601.02535 by Hyeong Kyu Choi, Sharon Li.

Figure 1
Figure 1. Figure 1: Single Path Generation vs. Mode Extraction (ModeX). While single-path text generation commits to a single trajectory, ModeX leverages the structural information across multiple generation paths to select a “modal” output. than relying on an external evaluator, ModeX operates directly within the set of generated texts to identify a representative, high-quality solution. Concretely, ModeX builds a graph in w… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the ModeX framework. In standard ModeX, (1) adjacency matrix construction and (2) spectral graph clustering are iterated recursively as long as ϕ ≤ τ . Then (3) centroid selection is performed. In the ModeX–Lite variant, (1) → (2) is performed only once without recursion for each pruning interval. 2 Discovering the Mode of Text Can a single high-quality output be selected from multiple text gen… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Examination. In the text summarization task, “rejected” samples often miss keywords, include incorrect or less precise information, and contain repetitive and verbose text, whereas samples “chosen” by our method are overall concise. 2.2 Qualitative Examination To assess whether ModeX indeed selects a representative/modal output, we qualitatively compare the responses that are ultimately “chosen… view at source ↗
Figure 4
Figure 4. Figure 4: Math reasoning accuracy at various stages of text generation. Our mode selection approach consistently identifies high-quality samples early in the trajectory, maintaining high accuracy even with partial outputs. prunes non-representative text paths at fixed intervals of T steps (T=100 unless stated otherwise). At each pruning interval, we apply graph spectral clustering to the partially generated trajecto… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis. ModeX–Lite shows performance consistently above the single-path baseline in all settings. 5.1 Sensitivity Analysis We analyze the sensitivity of ModeX–Lite to three key design choices: (a) graph partitioning objective, (b) spectral threshold τ (Eq. (4)), and (c) pruning frequency T. In the top panel of [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Selecting a single high-quality output from multiple stochastic generations remains a fundamental challenge for large language models (LLMs), particularly in open-ended tasks where no canonical answer exists. While Best-of-N and self-consistency methods show that aggregating multiple generations can improve performance, existing approaches typically rely on external evaluators, reward models, or exact string-match voting, limiting their applicability and efficiency. We propose Mode Extraction (ModeX), an evaluator-free Best-of-N selection framework that generalizes majority voting to open-ended text generation by identifying the modal output representing the dominant semantic consensus among generated texts. ModeX constructs a similarity graph over candidate generations and recursively applies spectral clustering to select a representative centroid, without requiring additional inference or auxiliary models. We further instantiate this selection principle as ModeX-Lite, an improved version of ModeX with early pruning for efficiency. Across open-ended tasks -- including text summarization, code generation, and mathematical reasoning -- our approaches consistently outperform standard single- and multi-path baselines, providing a computationally efficient solution for robust open-ended text generation. Code is released in https://github.com/deeplearning-wisc/ModeX.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ModeX, an evaluator-free Best-of-N selection method for open-ended LLM generation. It builds a similarity graph over multiple candidate outputs and recursively applies spectral clustering to identify the modal semantic consensus, selecting the representative centroid as the final output. A lighter variant, ModeX-Lite, adds early pruning for efficiency. The approach is evaluated on text summarization, code generation, and mathematical reasoning tasks, where it reportedly outperforms single-generation and standard multi-path baselines without requiring external evaluators, reward models, or exact-match voting.

Significance. If the central results hold under rigorous validation, ModeX would provide a practical, low-overhead way to aggregate stochastic generations in open-ended settings by generalizing majority voting to semantic similarity spaces. The code release supports reproducibility. However, the significance is tempered by reliance on proxy metrics that inherently favor consensus, leaving the no-ground-truth quality claim under-tested.

major comments (3)
  1. [§3] §3 (Method): The core assumption that the centroid of the largest spectral cluster corresponds to highest quality (rather than merely the most frequent semantic variant) is load-bearing for the evaluator-free claim but receives no theoretical justification or counterexample analysis. In open-ended regimes where multiple distinct high-quality outputs coexist, surface similarity may cluster safe but mediocre phrasings; this is not addressed.
  2. [§4] §4 (Experiments): Performance is reported on proxy metrics (ROUGE, pass@k, exact match) that reward consensus by construction. These do not test the no-ground-truth regime advertised in the abstract; a human preference study or quality annotation on truly open-ended outputs (e.g., creative summarization) is required to substantiate the central claim.
  3. [§3.1] §3.1 (Similarity graph construction): The method is described as parameter-free, yet clustering hyperparameters (number of clusters, recursion depth, similarity threshold) are listed as free parameters. The paper must clarify whether these are fixed across tasks or tuned, as this directly affects the reproducibility and generality of the reported gains.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrasing 'without requiring additional inference or auxiliary models' is slightly overstated if the similarity graph relies on embeddings from a separate encoder; clarify the exact embedding source.
  2. [§4] §4: Include error bars or multiple random seeds for the clustering step, as spectral clustering can be sensitive to initialization.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The core assumption that the centroid of the largest spectral cluster corresponds to highest quality (rather than merely the most frequent semantic variant) is load-bearing for the evaluator-free claim but receives no theoretical justification or counterexample analysis. In open-ended regimes where multiple distinct high-quality outputs coexist, surface similarity may cluster safe but mediocre phrasings; this is not addressed.

    Authors: We agree that the manuscript provides no formal theoretical justification for equating the largest cluster centroid with highest quality, relying instead on the empirical generalization of majority voting to semantic similarity. In cases with multiple distinct high-quality outputs, the method could indeed favor a frequent but mediocre variant. In the revision we will add a dedicated limitations subsection in §3 discussing this assumption, potential counterexamples, and conditions under which it is expected to hold, supported by a brief synthetic-data illustration. This constitutes a partial revision. revision: partial

  2. Referee: [§4] §4 (Experiments): Performance is reported on proxy metrics (ROUGE, pass@k, exact match) that reward consensus by construction. These do not test the no-ground-truth regime advertised in the abstract; a human preference study or quality annotation on truly open-ended outputs (e.g., creative summarization) is required to substantiate the central claim.

    Authors: We acknowledge that ROUGE, pass@k, and exact-match metrics inherently favor consensus and therefore do not fully substantiate quality claims in a true no-ground-truth setting. While these proxies are standard for the evaluated tasks, a human preference study on open-ended outputs would indeed provide stronger evidence. Such a study lies outside the scope and resources of the current revision; we will instead expand the limitations and discussion sections to explicitly note the reliance on proxy metrics and flag human evaluation as important future work. revision: no

  3. Referee: [§3.1] §3.1 (Similarity graph construction): The method is described as parameter-free, yet clustering hyperparameters (number of clusters, recursion depth, similarity threshold) are listed as free parameters. The paper must clarify whether these are fixed across tasks or tuned, as this directly affects the reproducibility and generality of the reported gains.

    Authors: We thank the referee for identifying this inconsistency. The method uses fixed default hyperparameters across all tasks and experiments: cosine similarity threshold of 0.75 on sentence embeddings, recursion depth capped at 2, and initial spectral clustering into 3 clusters. These values were selected once on a small validation split and held constant; no per-task tuning was performed. In the revised manuscript we will update §3.1 to state these fixed values explicitly, justify their choice, and confirm that the released code uses precisely these settings. revision: yes

standing simulated objections not resolved
  • Conducting a new human preference study on truly open-ended outputs to validate the no-ground-truth quality claim.

Circularity Check

0 steps flagged

No significant circularity; selection procedure defined independently of reported outcomes

full rationale

The paper defines ModeX directly as a similarity-graph construction followed by recursive spectral clustering to extract a centroid representing semantic consensus; this algorithmic specification stands on its own without reference to any quality metric, ground-truth label, or fitted parameter that would later be called a prediction. Performance claims are obtained by running the fixed procedure on standard benchmarks (ROUGE, pass@k, exact match) and comparing against baselines, which constitutes an external empirical test rather than a reduction to the input data by construction. No self-citation chain, ansatz smuggling, or renaming of known results is required for the core derivation, and the method remains falsifiable on any new open-ended task.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; main unstated assumptions concern the reliability of pairwise similarity for capturing semantic consensus and the validity of clustering as a quality proxy.

free parameters (1)
  • clustering hyperparameters
    Spectral clustering typically requires choices such as number of clusters or similarity threshold that may be tuned to data.
axioms (1)
  • domain assumption Pairwise similarity between generated texts accurately reflects semantic equivalence without external models
    Required to construct the similarity graph that drives the selection.

pith-pipeline@v0.9.0 · 5494 in / 1174 out tokens · 51491 ms · 2026-05-16T17:21:37.624085+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  3. [3]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  4. [4]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  5. [5]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  6. [6]

    Self-consistency improves chain of thought reasoning in language models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  7. [7]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

  8. [8]

    Slim-sc: Thought pruning for efficient scaling with self-consistency

    Colin Hong, Xu Guo, Anand Chaanan Singh, Esha Choukse, and Dmitrii Ustiugov. Slim-sc: Thought pruning for efficient scaling with self-consistency. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34488–34505, 2025

  9. [9]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314, 2024

  10. [10]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  11. [11]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023

  12. [12]

    Algebraic connectivity of graphs.Czechoslovak mathematical journal, 23(2):298–305, 1973

    Miroslav Fiedler. Algebraic connectivity of graphs.Czechoslovak mathematical journal, 23(2):298–305, 1973

  13. [13]

    Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024, 2024

    Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Generating with confidence: Uncertainty quantification for black-box large language models.Transactions on Machine Learning Research, 2024, 2024

  14. [14]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first International Conference on Machine Learning, 2023

  15. [15]

    Voting or consensus? decision-making in multi-agent debate.arXiv e-prints, pages arXiv–2502, 2025

    Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. Voting or consensus? decision-making in multi-agent debate.arXiv e-prints, pages arXiv–2502, 2025

  16. [16]

    Debate or vote: Which yields better decisions in multi-agent large language models? InAdvances in Neural Information Processing Systems, 2025

    Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InAdvances in Neural Information Processing Systems, 2025. 12

  17. [17]

    Improved bounds for mixing rates of markov chains and multicommodity flow

    Alistair Sinclair. Improved bounds for mixing rates of markov chains and multicommodity flow. Combinatorics, probability and Computing, 1(4):351–370, 1992

  18. [18]

    Teaching machines to read and comprehend

    KarlMoritzHermann, TomásKociský, EdwardGrefenstette, LasseEspeholt, WillKay, MustafaSuleyman, and Phil Blunsom. Teaching machines to read and comprehend. InNIPS[19], pages 1693–1701

  19. [19]

    Liu, and Christopher D

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics

  20. [20]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  21. [21]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  22. [22]

    Let's Verify Step by Step

    Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

  23. [23]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36:46534–46594, 2023

  24. [24]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36:46595–46623, 2023

  25. [25]

    Scalable best-of-n selection for large language models via self-certainty.Advances in neural information processing systems, 2025

    Zhewei Kang, Xuandong Zhao, and Dawn Song. Scalable best-of-n selection for large language models via self-certainty.Advances in neural information processing systems, 2025

  26. [26]

    Normalized cuts and image segmentation.IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000

    Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000

  27. [27]

    Deal: Decoding-time alignment for large language models

    James Y Huang, Sailik Sengupta, Daniele Bonadiman, Yi-an Lai, Arshit Gupta, Nikolaos Pappas, Saab Mansour, Katrin Kirchhoff, and Dan Roth. Deal: Decoding-time alignment for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26280–26300, 2025

  28. [28]

    Reward-guided tree search for inference time alignment of large language models

    Chia-Yu Hung, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Reward-guided tree search for inference time alignment of large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 12575–12593, 2025

  29. [29]

    Args: Alignment as reward-guided search

    Maxim Khanov, Jirayu Burapacheep, and Yixuan Li. Args: Alignment as reward-guided search. InThe Twelfth International Conference on Learning Representations, 2024. 13

  30. [30]

    Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model

    Haikang Deng and Colin Raffel. Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11781–11791, 2023

  31. [31]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  32. [32]

    Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.arXiv preprint arXiv:2410.02725, 2024

    Rohin Manvi, Anikait Singh, and Stefano Ermon. Adaptive inference-time compute: Llms can predict if they can do better, even mid-generation.arXiv preprint arXiv:2410.02725, 2024

  33. [33]

    Dola: Decoding by contrasting layers improves factuality in large language models

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R Glass, and Pengcheng He. Dola: Decoding by contrasting layers improves factuality in large language models. InThe Twelfth International Conference on Learning Representations, 2023

  34. [34]

    Collab: Controlled decoding using mixture of agents for llm alignment

    Souradip Chakraborty, Sujay Bhatt, Udari Madhushani Sehwag, Soumya Suvra Ghosal, Jiahao Qiu, Mengdi Wang, Dinesh Manocha, Furong Huang, Alec Koppel, and Sumitra Ganesh. Collab: Controlled decoding using mixture of agents for llm alignment. InThe Thirteenth International Conference on Learning Representations, 2025

  35. [35]

    Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

    Hanshi Sun, Momin Haider, Ruiqi Zhang, Huitao Yang, Jiahao Qiu, Ming Yin, Mengdi Wang, Peter Bartlett, and Andrea Zanette. Fast best-of-n decoding via speculative rejection.Advances in Neural Information Processing Systems, 37:32630–32652, 2024

  36. [36]

    Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models

    Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Xiaonan Li, Tianxiang Sun, Cheng Chang, Qinyuan Cheng, Ding Wang, Xiaofeng Mou, et al. Aggregation of reasoning: A hierarchical framework for enhancing answer selection in large language models. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and ...

  37. [37]

    Do we truly need so many samples? multi-llm repeated sampling efficiently scales test-time compute.arXiv preprint arXiv:2504.00762, 2025

    Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Hangfan Zhang, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, et al. Do we truly need so many samples? multi-llm repeated sampling efficiently scales test-time compute.arXiv preprint arXiv:2504.00762, 2025

  38. [38]

    Clue: Non-parametric verification from experience via hidden-state clustering.arXiv preprint arXiv:2510.01591, 2025

    Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, and Dong Yu. Clue: Non-parametric verification from experience via hidden-state clustering.arXiv preprint arXiv:2510.01591, 2025

  39. [39]

    Mixture-of-agentsenhanceslarge language model capabilities

    JunlinWang, WANGJue, BenAthiwaratkun, CeZhang, andJamesZou. Mixture-of-agentsenhanceslarge language model capabilities. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  41. [41]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  42. [42]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023

  43. [43]

    Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

    Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, and Yahui Zhou. Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352, 2025

  44. [44]

    The Lessons of Developing Process Reward Models in Mathematical Reasoning

    Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning.arXiv preprint arXiv:2501.07301, 2025. 14 Appendix Table of Contents A Experimental Details 15 A.1 Benchmark Details . . . . . . . . . . . . . . . . . . . . . . ...

  45. [47]

    First, briefly state your step-by-step reasoning

    Better explanations or solutions Provide your refined response. First, briefly state your step-by-step reasoning. Then, <TASK-SPECIFIC INSTRUCTIONS> LLM Judge Question: <TASK> Below areNdifferent responses from different agents: Response 1 (from Agent 1) <RESPONSE 1> Response 2 (from Agent 2) <RESPONSE 2> ... ResponseN(from AgentN) <RESPONSE N> Instructio...

  46. [48]

    Accuracy and correctness

  47. [49]

    Clarity and completeness

  48. [50]

    Quality of reasoning

  49. [51]

    valley” of low probability (low similarity), the edges bridging these regions will have low weights (Auv → 0). This creates a “bottleneck,

    Overall quality Your response should be ONLY the number (1, 2, 3, etc.) corresponding to the best response. For example, if you think Response 2 is the best, respond with just “2". C Why the Second Eigenvector of the Laplacian Acts as a Clusterer? Let G = ( V, E)be a graph with adjacency matrix A and degree matrix D, and define the unnormalized Laplacian ...