hub Mixed citations

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al · 2025 · arXiv 2502.08235

Mixed citation behavior. Most common role is background (67%).

15 Pith papers citing it

Background 67% of classified citations

read on arXiv browse 15 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 other 1

citation-polarity summary

background 4 unclear 2

representative citing papers

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.

AcademiClaw: When Students Set Challenges for AI Agents

cs.AI · 2026-05-04 · unverdicted · novelty 7.0

AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.

Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

cs.SE · 2026-04-05 · accept · novelty 7.0

SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.

Dynamic Rollout Editing for Reducing Overthinking in RL-Trained Reasoning Models

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

Dynamic Rollout Editing reduces overthinking in RL-trained LLMs by editing post-answer continuations in successful rollouts and preferring the edited versions within GRPO groups.

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Stopping large reasoning models at the first correct reasoning prefix improves accuracy up to 21% by avoiding harmful overthinking that destabilizes correct trajectories.

Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

cs.CR · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.

Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

cs.AI · 2026-05-04 · unverdicted · novelty 6.0 · 2 refs

FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.

Efficient Test-Time Scaling via Temporal Reasoning Aggregation

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.

Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.

Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs

cs.CL · 2026-04-11 · unverdicted · novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.

When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning

cs.CL · 2026-04-08 · unverdicted · novelty 5.0

DTSR enables large reasoning models to dynamically assess chain-of-thought sufficiency via reflection signals and a sufficiency check, reducing reasoning length by 28.9-34.9% with minimal performance loss on Qwen3 models.

Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation

cs.AI · 2025-09-08 · conditional · novelty 5.0

Introduces PAS and FAS task abstractions plus the LLM-S^3 benchmark to evaluate LLMs on generating sociodemographic survey responses across 11 real datasets and multiple models.

Self-Aligned Reward: Towards Effective and Efficient Reasoners

cs.LG · 2025-09-05 · unverdicted · novelty 5.0

Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

cs.AI · 2025-03-12 · unverdicted · novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

citing papers explorer

Showing 7 of 7 citing papers after filters.

TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints cs.AI · 2026-05-13 · unverdicted · none · ref 5
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
AcademiClaw: When Students Set Challenges for AI Agents cs.AI · 2026-05-04 · unverdicted · none · ref 3
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models cs.AI · 2026-06-01 · unverdicted · none · ref 6
Stopping large reasoning models at the first correct reasoning prefix improves accuracy up to 21% by avoiding harmful overthinking that destabilizes correct trajectories.
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training cs.AI · 2026-05-08 · unverdicted · none · ref 19
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval cs.AI · 2026-05-04 · unverdicted · none · ref 5 · 2 links
FitText embeds evolutionary retrieval of tool descriptions into the agent loop, yielding 2.7-10.6 point NDCG@5 gains on ToolRet and 26.7-point pass-rate gains on StableToolBench.
Efficient Test-Time Scaling via Temporal Reasoning Aggregation cs.AI · 2026-04-19 · unverdicted · none · ref 48
TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 145
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer