pith. sign in

hub Canonical reference

Agentic Reasoning for Large Language Models

Canonical reference. 86% of citing Pith papers cite this work as background.

18 Pith papers citing it
Background 86% of classified citations
abstract

Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, they struggle in open-ended and dynamic environments. Agentic reasoning marks a paradigm shift by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we organize agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning, which establishes core single-agent capabilities including planning, tool use, and search in stable environments; self-evolving agentic reasoning, which studies how agents refine these capabilities through feedback, memory, and adaptation; and collective multi-agent reasoning, which extends intelligence to collaborative settings involving coordination, knowledge sharing, and shared goals. Across these layers, we distinguish in-context reasoning, which scales test-time interaction through structured orchestration, from post-training reasoning, which optimizes behaviors via reinforcement learning and supervised fine-tuning. We further review representative agentic reasoning frameworks across real-world applications and benchmarks, including science, robotics, healthcare, autonomous research, and mathematics. This survey synthesizes agentic reasoning methods into a unified roadmap bridging thought and action, and outlines open challenges and future directions, including personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance for real-world deployment.

hub tools

citation-role summary

background 6 dataset 1

citation-polarity summary

years

2026 18

verdicts

UNVERDICTED 18

clear filters

representative citing papers

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

Learning Agentic Policy from Action Guidance

cs.CL · 2026-05-12 · unverdicted · novelty 7.0

ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

Confidence Estimation in Automatic Short Answer Grading with LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

A hybrid confidence framework for LLM-based short answer grading combines model signals with aleatoric uncertainty from semantic clustering of responses and improves selective grading reliability over single-source methods.

Agentic Frameworks for Reasoning Tasks: An Empirical Study

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.

Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

cs.AI · 2026-05-07 · unverdicted · novelty 5.0 · 3 refs

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency variation to credit distillation, outperforming baselines on ALFWorld and WebShop.

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

cs.SE · 2026-04-29 · unverdicted · novelty 5.0

An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code generation.

ActionNex: A Virtual Outage Manager for Cloud Computing

cs.AI · 2026-04-03 · unverdicted · novelty 4.0

ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • TDD Governance for Multi-Agent Code Generation via Prompt Engineering cs.SE · 2026-04-29 · unverdicted · none · ref 23 · internal anchor

    An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code generation.