{"total":68,"items":[{"citing_arxiv_id":"2606.05976","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Self-Correction Illusion: LLMs Correct Others but Not Themselves","primary_cat":"cs.AI","submitted_at":"2026-06-04T10:17:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Relabeling an identical erroneous claim from the model's own thought role to an external chat role increases explicit correction rates by 23-93 percentage points across 13 model-domain cells, indicating a chat-template artifact rather than a cognitive deficit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01046","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents","primary_cat":"cs.AI","submitted_at":"2026-05-31T06:29:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TravelEval is a new benchmark with a six-dimensional evaluation framework, realistic data sandbox, and simulation-based global assessment for LLM-powered travel planning agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00726","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Reward Steering: An Adaptive Inference-Time Framework that Implicitly Promotes Cognitive Behaviors in Reasoning LLMs","primary_cat":"cs.AI","submitted_at":"2026-05-30T13:38:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LRS trains a latent reward model on final-answer correctness to steer SAE states during inference, improving reasoning performance and implicitly encouraging better cognitive behaviors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30512","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PhyDrawGen: Physically Grounded Diagram Generation from Natural Language","primary_cat":"cs.AI","submitted_at":"2026-05-28T19:49:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PhyDrawGen is a neuro-symbolic pipeline that extracts typed scene graphs via LLM, converts them to physically constrained PSLGs via deterministic solver, and refines via fine-tuned Qwen-VL, claiming superior performance over GPT-5-image and Gemini models on 1,449 physics problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29491","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF","primary_cat":"cs.AI","submitted_at":"2026-05-28T07:18:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DistractionIF benchmark reveals inverse scaling in LLM robustness to distractors in reference text, with GRPO RL as a mitigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29027","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mind Your Tone: Does Tone Alter LLM Performance?","primary_cat":"cs.AI","submitted_at":"2026-05-27T19:23:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Tonal variations in prompts cause systematic but model-dependent accuracy changes in LLMs on objective multiple-choice questions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20924","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Strategy-Induct: Task-Level Strategy Induction for Instruction Generation","primary_cat":"cs.CL","submitted_at":"2026-05-20T09:10:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Strategy-Induct induces task-level instructions from question-only examples by generating reasoning strategies first, then using those pairs to create a guiding instruction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19859","ref_index":34,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:50:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19824","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T13:18:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Temporal conditioning in three LLM-based planner architectures for AV scene-to-plan reasoning yields no statistically significant gains on NLP correctness metrics but enables predictive hazard reasoning and stable corrections on BDD-X subsets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19627","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Helpful is LLM Assistance in Network Operations? A Case Study at a Large Demonstration Network","primary_cat":"cs.NI","submitted_at":"2026-05-19T10:06:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A case study with 105 network engineers found that an LLM chatbot with RAG, CLI control, and ticket access received positive evaluations in 68.1% of interactions while assisting with building and operating a large demonstration network.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16205","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP","primary_cat":"cs.AI","submitted_at":"2026-05-15T17:23:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"In CybORG CAGE-2, programmatic state abstraction improves mean return up to 76% over raw observations while adding deliberation tools to hierarchies degrades performance up to 3.4x and increases token use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15156","ref_index":1,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MeMo: Memory as a Model","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06638","ref_index":21,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06225","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-07T13:19:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Memory Inception is a training-free method that injects latent KV banks at chosen layers to steer LLMs, achieving superior control-drift balance and up to 118x storage reduction on personality and structured-reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05715","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes","primary_cat":"cs.AI","submitted_at":"2026-05-07T05:58:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01048","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","primary_cat":"cs.CL","submitted_at":"2026-05-01T19:23:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02939","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework","primary_cat":"cs.LG","submitted_at":"2026-05-01T07:57:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming prior static methods on a public dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13398","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-15T01:55:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ABSA-R1 uses RL with a cognition-aligned reward model and rejection sampling to generate consistent reasoning paths for sentiment predictions, improving interpretability and performance on ABSA benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13371","ref_index":42,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints","primary_cat":"cs.CL","submitted_at":"2026-04-15T00:35:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Large reasoning models exhibit reasoning collapse, with accuracy dropping sharply beyond task-specific complexity thresholds in controlled versions of nine classical reasoning tasks using strict validity validators.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10693","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FACT-E: Causality-Inspired Evaluation for Trustworthy Chain-of-Thought Reasoning","primary_cat":"cs.AI","submitted_at":"2026-04-12T15:35:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FACT-E uses controlled perturbations as an instrumental signal to measure intra-chain faithfulness in CoT reasoning and combines it with answer consistency to select trustworthy trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08094","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MedThink: Enhancing Diagnostic Accuracy in Small Models via Teacher-Guided Reasoning Correction","primary_cat":"cs.CY","submitted_at":"2026-04-09T18:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MedThink, a two-stage teacher-guided reasoning correction distillation framework, boosts small language models' medical diagnostic accuracy by up to 12.7% on benchmarks and achieves 56.4% on a gastroenterology dataset.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Winther, Can large language models reason about medical questions?, Patterns 5 (2024) 100943. [2] C.-Y. Hsieh, C.-L. Li, C.-K. Yeh, H. Nakano, Y. Lee, M. Nezhurina, A. Is- cen, X. Zhang, H. Pfister, Distilling step-by-step! outperforming larger lan- guagemodelswithlesstrainingdataandsmallermodelsizes,arXivpreprint arXiv:2305.02301 (2023).arXiv:2305.02301. [3] T.Kojima,S.S.Gu,M.Reid,Y.Matsuo,Y.Iwasawa,Largelanguagemodels are zero-shot reasoners, arXiv:2205.11916 (2022). [4] J.Wei,X.Wang,D.Schuurmans,M.Bosma,B.Ichter,F.Xia,E.Chi,Q.Le, D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, arXiv:2201.11903 (2022). [5] Y.Tian,Y.Han,X.Chen,W.Wang,N.V.Chawla,Tinyllm: Learningasmall student from multiple large language models, arXiv:2402."},{"citing_arxiv_id":"2604.04854","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software","primary_cat":"cs.SE","submitted_at":"2026-04-06T16:57:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs match or exceed state-of-the-art traditional methods for stabilizing numerical expressions in scientific software, succeeding on 97.9% of expressions where baselines fail to improve accuracy, but struggle with control flow and high-precision literals.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"merical expressions and programs to reduce rounding error and improve stability. These approaches rely on domain-specific algo- rithms and handcrafted search strategies, forming a strong baseline for evaluating numerical correctness and stability. Large language models (LLMs) have demonstrated strong per- formance in symbolic reasoning and structured problem solving [21, 28]. However, their reliability in scientific computing, particu- larly numerical reasoning over floating-point expressions, remains largely underexplored. Existing evidence suggests that performance on numerical tasks is sensitive to task complexity [11], and incor- rect outputs in testing or repair workflows can have serious conse- quences [14]."},{"citing_arxiv_id":"2604.04852","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework","primary_cat":"cs.CR","submitted_at":"2026-04-06T16:53:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In terms of prompt strategy design, we di- vided it onto two sections: input structure and output sensitivity control. Al- though zero-shot prompting, structured prompting, role-play prompting, and other techniques have been extensively explored, there still has been limited evaluation of the reasoning quality and practical applicability in cybersecurity practice [41-43]. Second, human-centered demands such as the interpretability and traceability of evidence receive insufficient attention. With these considera- tions, our study views prompt design as a structured control mechanism used to constrain the reasoning process and maintain logical integrity in locally hosted LLMs. We propose three representative strategies (see Fig."},{"citing_arxiv_id":"2604.16421","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Representation Robustness in Large Language Models for Geometry","primary_cat":"cs.CL","submitted_at":"2026-04-03T11:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capacity models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the same problem, limiting invariance assessment. GeoRepEval fills this gap with problem-level, representation-aligned evaluation and paired statistical testing. 2.3 Prompting and Reasoning Strategies Chain-of-thought prompting [ 36], self-consistency [ 35], tree-of-thought [ 37], and tool-augmented methods [28] have improved reasoning performance, with zero-shot and least-to-most strategies [15, 39] unlocking latent ability without exemplars. However, these studies implicitly assume a fixed representation and focus onhowmodels reason rather thanwhat formreasoning is conditioned on. Our work isolates representation as a controlled variable, testing whether LLM reasoning is representation-invariant. 3 Methodology GeoRepEval probes whether LLMs exhibit invariant reasoning across equivalent geometric represen-"},{"citing_arxiv_id":"2604.02699","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trivial Vocabulary Bans Improve LLM Reasoning More Than Deep Linguistic Constraints","primary_cat":"cs.CL","submitted_at":"2026-04-03T03:48:27+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Banning filler words like 'very' and 'just' improved LLM reasoning by 6.7 percentage points while E-Prime improved it by only 3.7, with gains ranking in exact inverse order of theoretical depth across models and tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14197","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure","primary_cat":"cs.CL","submitted_at":"2026-04-03T03:06:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.22816","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring and curing reasoning rigidity: from decorative chain-of-thought to genuine faithfulness","primary_cat":"cs.CL","submitted_at":"2026-03-24T05:38:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SLRC quantifies genuine step necessity in LLM reasoning as a causal estimator, LC-CoSR training reduces rigidity with stability guarantees, and evaluations reveal a faithfulness-sycophancy paradox across frontier models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12358","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles","primary_cat":"cs.CV","submitted_at":"2026-01-18T11:32:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"An agentic LLM/LVM framework generates adaptive behavior trees on-the-fly for AV navigation in CARLA+Nav2 simulation, succeeding in obstacle avoidance where static BTs fail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.08919","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs as Assessors: Right for the Right Reason?","primary_cat":"cs.IR","submitted_at":"2026-01-13T19:01:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs judge document relevance at a level comparable to humans but frequently highlight different passages, indicating they are often not right for the right reasons and cannot fully replace human assessors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.19078","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2025-11-24T13:18:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GraphMind models multi-step reasoning as an evolving heterogeneous graph, using GNN encoding and semantic matching to select theorems and generate conclusions iteratively, reporting performance gains over baselines on QA datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.17171","ref_index":27,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle","primary_cat":"cs.CV","submitted_at":"2025-11-21T11:45:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FireScope trains a VLM on US data to output wildfire risk rasters with reasoning traces and shows improved cross-continental performance on European events compared with prior approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.05746","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems","primary_cat":"cs.AI","submitted_at":"2025-10-07T10:04:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.21465","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Talking Trees: Reasoning-Assisted Induction of Decision Trees for Tabular Data","primary_cat":"cs.LG","submitted_at":"2025-09-25T19:30:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Reasoning LLMs with minimal tools for tree construction and analysis induce decision trees that outperform CART, compete with ensembles on low-resource tabular data, and provide human-readable reasoning traces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09505","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference","primary_cat":"cs.AR","submitted_at":"2025-09-11T14:49:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PLENA introduces a co-designed system with three optimization pathways for long-context agentic LLM inference, claiming up to 2.23x throughput over A100 and 4.04x energy efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.18864","ref_index":145,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Towards an AI co-scientist","primary_cat":"cs.AI","submitted_at":"2025-02-26T06:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.13171","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compressed Chain of Thought: Efficient Reasoning Through Dense Representations","primary_cat":"cs.CL","submitted_at":"2024-12-17T18:50:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CCoT generates variable-length continuous contemplation tokens that compress explicit reasoning chains, enabling additional dense reasoning and accuracy gains in off-the-shelf language models while allowing adaptive control of token count.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.05229","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2024-10-07T17:36:37+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs display high variance and major accuracy drops on GSM-Symbolic variants of grade-school math problems, indicating they replicate training patterns rather than execute logical reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.00724","ref_index":173,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models","primary_cat":"cs.AI","submitted_at":"2024-08-01T17:16:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2310.03714","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines","primary_cat":"cs.CL","submitted_at":"2023-10-05T17:37:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.16797","ref_index":183,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution","primary_cat":"cs.CL","submitted_at":"2023-09-28T19:01:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2309.03409","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Models as Optimizers","primary_cat":"cs.LG","submitted_at":"2023-09-07T00:07:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2307.05973","ref_index":132,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models","primary_cat":"cs.RO","submitted_at":"2023-07-12T07:40:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoxPoser uses LLMs to compose 3D value maps via VLM interaction for model-based synthesis of robust robot trajectories on open-set language-specified manipulation tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022. [131] Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022. [132] T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa. Large language models are zero- shot reasoners. arXiv preprint arXiv:2205.11916, 2022. [133] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y . Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023. [134] S."},{"citing_arxiv_id":"2306.13549","ref_index":184,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Survey on Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2023-06-23T15:21:52+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"As the pioneer work [8] points out, CoT is \"a series of intermediate reasoning steps\", which has been proven to be effective in complex reasoning tasks [8], [182], [183]. The main idea of CoT is to prompt LLMs to output not only the final answer but also the reasoning process that leads to the answer, resembling the cognitive process of humans. Inspired by the success in NLP , multiple works [184], [185], [186], [187] have been proposed to extend the uni- modal CoT to Multimodal CoT (M-CoT). We first introduce different paradigms for acquiring the M-CoT ability (§7.2.1). Then, we delineate more specific aspects of M-CoT, includ- ing the chain configuration (§7.2.2) and the pattern (§7.2.3). 7.2.1 Learning Paradigms The learning paradigm is also an aspect worth investigating."},{"citing_arxiv_id":"2305.20050","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Let's Verify Step by Step","primary_cat":"cs.LG","submitted_at":"2023-05-31T17:24:00+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Process supervision significantly outperforms outcome supervision for training models on the MATH dataset, achieving 78% accuracy on a representative test subset with active learning and a released 800k step-label dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16264","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Data-Constrained Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-25T17:18:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66-71, Brussels, Belgium. Association for Computational Linguistics. [53] Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. 2020. WikiLin- gua: A new benchmark dataset for cross-lingual abstractive summarization. arXiv preprint arXiv:2010.03093. [54] Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro V on Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, et al. 2022. The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track."},{"citing_arxiv_id":"2305.15334","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gorilla: Large Language Model Connected with Massive APIs","primary_cat":"cs.CL","submitted_at":"2023-05-24T16:48:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Iyer, A., Natarajan, N., Parthasarathy, S., Rajamani, S., and Sharma, R. (2022). Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering , pages 1219-1231. [17] Kim, G., Baldi, P., and McAleer, S. (2023). Language models can solve computer tasks. arXiv preprint arXiv:2303.17491. [18] Kojima, T., Gu, S. S., Reid, M., Matsuo, Y ., and Iwasawa, Y . (2022). Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916. [19] Komeili, M., Shuster, K., and Weston, J. (2021). Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566. [20] Lachaux, M.-A., Roziere, B., Chanussot, L., and Lample, G. (2020)."},{"citing_arxiv_id":"2305.14992","ref_index":114,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reasoning with Language Model is Planning with World Model","primary_cat":"cs.CL","submitted_at":"2023-05-24T10:28:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.14325","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Improving Factuality and Reasoning in Language Models through Multiagent Debate","primary_cat":"cs.CL","submitted_at":"2023-05-23T17:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multiagent debate among LLMs improves mathematical reasoning, strategic reasoning, and factual accuracy while reducing hallucinations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"and our multi-agent debate over six benchmarks (chess move optimality reported as a normalized score) light of the responses of other agents. The resulting quorum of models can hold and maintain multiple chains of reasoning and possible answers simultaneously before proposing the final answer. We find that our debate approach outperforms single model baselines such as zero-shot chain of thought [ 11] and reflection [ 26, 18] on a variety of six reasoning, factuality, and question-answering tasks. Using both multiple model agents and multiple rounds of debate are important to achieve the best performance. Given an initial query, we find that individual model instances propose a diverse range of answers despite being the same model class (although we also investigate the case of"},{"citing_arxiv_id":"2305.09617","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Expert-Level Medical Question Answering with Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-05-16T17:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.06161","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StarCoder: may the source be with you!","primary_cat":"cs.CL","submitted_at":"2023-05-09T08:16:42+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}