FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
arXiv preprint arXiv:2502.01839 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
citing papers explorer
-
FUSE: Ensembling Verifiers with Zero Labeled Data
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
-
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
-
Pseudo-Formalization for Automatic Proof Verification
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.
-
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
- When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models