Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.
Inverse scaling in test-time compute.arXiv preprint arXiv:2507.14417
5 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 5representative citing papers
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.
citing papers explorer
-
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Agentic LLMs remain robust to renaming and insertion but degrade on composed transformations and deeper obfuscation in CTF tasks, enabled by a new Evolve-CTF tool for generating equivalent challenge families.
-
Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings
A new dataset and nine-metric majority-vote procedure show that existing code-reasoning benchmarks are dominated by lower-complexity problems that do not reflect real-world code.
-
Stateful Reasoning via Insight Replay
InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.
-
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
-
Understanding Performance Gap Between Parallel and Sequential Sampling in Large Reasoning Models
Lack of exploration from conditioning on prior answers is the primary reason parallel sampling outperforms sequential sampling in large reasoning models.