arXiv preprint arXiv:2502.01839 , year=

Eric Zhao, Pranjal Awasthi, Sreenivas Gollapudi · 2025 · arXiv 2502.01839

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

representative citing papers

FUSE: Ensembling Verifiers with Zero Labeled Data

stat.ML · 2026-04-20 · unverdicted · novelty 6.0

FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

cs.AI · 2025-06-07 · unverdicted · novelty 6.0

LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.

Pseudo-Formalization for Automatic Proof Verification

cs.LO · 2026-05-19 · unverdicted · novelty 5.0

Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

cs.CL · 2026-03-15 · unverdicted · novelty 5.0

Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

cs.CL · 2026-05-01

citing papers explorer

Showing 5 of 5 citing papers.

FUSE: Ensembling Verifiers with Zero Labeled Data stat.ML · 2026-04-20 · unverdicted · none · ref 24
FUSE ensembles verifiers unsupervisedly by controlling their conditional dependencies to improve spectral ensembling algorithms, matching or exceeding semi-supervised baselines on benchmarks including GPQA Diamond and Humanity's Last Exam.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity cs.AI · 2025-06-07 · unverdicted · none · ref 21
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
Pseudo-Formalization for Automatic Proof Verification cs.LO · 2026-05-19 · unverdicted · none · ref 37
Pseudo-Formalization decomposes natural language proofs into modular blocks for independent LLM verification via Block Verification, outperforming LLM-as-judge baselines on error detection in olympiad and research math benchmarks.
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space cs.CL · 2026-03-15 · unverdicted · none · ref 41
Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models cs.CL · 2026-05-01 · unreviewed · ref 21

arXiv preprint arXiv:2502.01839 , year=

fields

years

verdicts

representative citing papers

citing papers explorer