Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417

Aryo Pradipta Gema, Alexander Hägele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, et al · 2025 · arXiv 2507.14417

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

representative citing papers

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations

cs.SE · 2026-02-05 · unverdicted · novelty 7.0

Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

cs.SE · 2025-12-16 · unverdicted · novelty 7.0

A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.

Stateful Reasoning via Insight Replay

cs.AI · 2026-05-14 · unverdicted · novelty 6.0 · 2 refs

InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.

Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

cs.AI · 2026-05-10 · unverdicted · novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.

Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.

citing papers explorer

Showing 5 of 5 citing papers.

Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations cs.SE · 2026-02-05 · unverdicted · none · ref 16
Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings cs.SE · 2025-12-16 · unverdicted · none · ref 23
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
Stateful Reasoning via Insight Replay cs.AI · 2026-05-14 · unverdicted · none · ref 7 · 2 links
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities cs.AI · 2026-05-10 · unverdicted · none · ref 4
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models cs.CL · 2026-04-07 · unverdicted · none · ref 7
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.

Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417

fields

years

verdicts

representative citing papers

citing papers explorer