{"total":13,"items":[{"citing_arxiv_id":"2605.13414","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints","primary_cat":"cs.AI","submitted_at":"2026-05-13T12:10:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"As Voracle and Vrandom approach each other (the denominator shrinks toward zero from above), ηM can take arbitrarily large negative values for any fixedVM < V random. We therefore characterizeη M ∈(−∞,1]. D.5. Sensitivity check: regret as an alternative normalization The asymmetric range ofη M motivates a second, bounded metric for robustness: the normalized regret ˜RM = Voracle −V M Voracle ,(5) defined whenever Voracle ≥1 . By construction ˜RM ∈[0,1] , with ˜RM = 0 at oracle and ˜RM = 1 when the planner captures none of the oracle's value. Unlike ηM, ˜RM has no denominator singularity, is continuous at α= 1 , and admits the same interpretation across all α. We report ˜RM alongside ηM and confirm that the conclusions are robust to the choice of"},{"citing_arxiv_id":"2605.13338","ref_index":5,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models","primary_cat":"cs.CR","submitted_at":"2026-05-13T10:57:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"This term captures the degree to which the model produces extended or unnecessarily long reasoning chains. Reflective Marker Score. Overthinking often manifests through explicit hesitation or self-correction markers. Let V be a predefined vocabulary of overthinking markers. The reflective marker score is defined as score2(x) = X w∈V Count(w, R(x)),(5) where Count(w, R(x)) denotes the frequency of token w appearing in the generated reasoning trace. This term quan- tifies explicit signs of hesitation or re-evaluation.(Ge et al., 2025) Both score1 and score2 are normalized within each genera- tion usingz-scorenormalization. For a given score scorei of individuali, the normalized value[scorei is defined as:"},{"citing_arxiv_id":"2605.07316","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training","primary_cat":"cs.AI","submitted_at":"2026-05-08T06:25:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and knowledge benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[17] Chenrui Fan, Ming Li, Lichao Sun, and Tianyi Zhou. Missing premise exacerbates overthinking: Are reasoning models losing critical thinking skill?arXiv preprint arXiv:2504.06514, 2025. [18] Renfei Dang, Zhening Li, Shujian Huang, and Jiajun Chen. The first impression problem: Internal bias triggers overthinking in reasoning models.arXiv preprint arXiv:2505.16448, 2025. [19] Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, et al. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.arXiv preprint arXiv:2502.08235, 2025. [20] Daman Arora and Andrea Zanette. Training language models to reason efficiently."},{"citing_arxiv_id":"2605.02661","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AcademiClaw: When Students Set Challenges for AI Agents","primary_cat":"cs.AI","submitted_at":"2026-05-04T14:40:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02411","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FitText: Evolving Agent Tool Ecologies via Memetic Retrieval","primary_cat":"cs.AI","submitted_at":"2026-05-04T10:01:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"LLM cost per lineage: nt calls; retrievals: nt +1. IBI (Iterative Bootstrapped Improvement).IBI applies a single global refinement prompt across all descriptions simultaneously. It accumulates exemplar tool blurbs across iterations and optionally uses an LLM judge to check sufficiency. LetD (0) be the initial description set. At each turnt: B(t) =B (t−1) ∪ [ d∈D (t−1) {blurb(r):r∈RET(d,T,k)}, (12) D(t) =PARSEBLOCKS \u0010 LLMrefine(D(t−1),B (t),q) \u0011 . (13) LLM cost (worst case, nt turns, m descriptions): nt refinement calls +nt optional judge calls; retrievals:(n t +1)·m. Scattershot.Scattershot introduces population-level diversity by sampling multiple can- didate descriptions per ancestor in parallel at high temperature, then aggregating via population-level voting. 18 Preprint. Under review. Algorithm 2DFSDT Reasoning Loop with Dynamic Retrieval Require:Noden,D max,W,Q max, answer setS, memoryM, function schemaF Ensure:Backtrack distanceb; modifiesS,M,F 1:ifδ(n)≥D max orn.prunedthen 2:returnℓ prune 3:end if 4:ifn.terminalthen 5:S.add(n) 6:returnℓ ans 7:end if 8:fori←1 toWdo 9:if|S | ≥ S max orQ>Q max then return∞ 10:end if 11:ifch(n)̸=∅then 12:Inject diversity prompt summarising prior children 13:end if 14:o←LLM(H(n),F(n));Q←Q+1 15:ifocontains pseudo-tool blocksthen 16:n th ←Thought child 17:ifnot duplicate inMthen 18:tools←SELECTANDRUNSTRATEGY(o,F, LLM,M) 19:F(n th)← F(n th)∪tools 20:M.update(o) 21:end if 22:end if 23:foreach function call(a,x)∈odo 24:n a ←Action child;n x ←ActionInput child 25:(obs,κ)←EXEC(a,x,F(n)) 26:ifκ=1thenreplaceawith sentinel 27:end if 28:ifκ=3thenn x.terminal←True 29:end if 30:ifκ∈ {1, 2, 4}thenn x.pruned←True 31:end if 32:end for 33:foreach childc∈ch(n)do 34:b←DFSDT(c,D max,W,Q max,S,M,F) 35:if|S | ≥ S max then return∞ 36:end if 37:ifb>1then returnb−1 38:end if 39:end for 40:end for 41:return1 The following subroutines are shared between Scattershot and Memetic (Algorithm 1): For each ancestor a, let Ea ={blurb(r):r∈RET(a , T , k)"},{"citing_arxiv_id":"2604.17304","ref_index":48,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Test-Time Scaling via Temporal Reasoning Aggregation","primary_cat":"cs.AI","submitted_at":"2026-04-19T07:39:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRACE aggregates answer consistency and confidence trajectory over multiple reasoning steps to decide when to halt inference, reducing token usage by 25-30% while keeping accuracy within 1-2% of full reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04066","ref_index":249,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04065","ref_index":264,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06787","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-08T07:56:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DTSR enables large reasoning models to dynamically assess chain-of-thought sufficiency via reflection signals and a sufficiency check, reducing reasoning length by 28.9-34.9% with minimal performance loss on Qwen3 models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04009","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":false,"paper_title":"Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding","primary_cat":"cs.SE","submitted_at":"2026-04-05T07:54:18+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SADU benchmark shows top VLMs reach only 70% accuracy on software architecture diagram tasks, revealing gaps in visual reasoning for engineering artifacts.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"gemini-3.1-flash-lite-preview 51.36% 60.97% 69.47% 60.47% gemini-3-flash-preview 46.05% 49.17% 52.78% 49.28% (68.40%). Overall, these results suggest that increasing the thinking level does not consistently improve performance, and that a lower thinking setting may be more effective in this task configuration. These results suggest that overthinking [24, 27] is a major source of error in this setting. To further investigate its impact, we conduct a deeper analysis of all results betweenlowandhighthinking levels. Through two authors' manual analysis of the responses, we identify 470 cases exhibiting overthinking-related errors. For counting prob- lems (262 cases), overthinking mainly manifests in two forms."},{"citing_arxiv_id":"2509.06337","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation","primary_cat":"cs.AI","submitted_at":"2025-09-08T04:59:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces PAS and FAS task abstractions plus the LLM-S^3 benchmark to evaluate LLMs on generating sociodemographic survey responses across 11 real datasets and multiple models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05489","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Aligned Reward: Towards Effective and Efficient Reasoners","primary_cat":"cs.LG","submitted_at":"2025-09-05T20:39:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.09567","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-03-12T17:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Tianyu Yu, Qixin Xu, Weize Chen, et al. Process reinforcement through implicit rewards. arXiv preprint arXiv:2502.01456, 2025. [144] Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025. [145] Yingqian Cui, Pengfei He, Jingying Zeng, Hui Liu, Xianfeng Tang, Zhenwei Dai, Yan Han, Chen Luo, Jing Huang, Zhen Li, et al. Stepwise perplexity-guided refinement for efficient chain-of-thought reasoning in large language models. arXiv preprint arXiv:2502.13260, 2025. [146] Yu Cui and Cong Zuo. Practical reasoning interruption attacks on reasoning large language"}],"limit":50,"offset":0}