hub Canonical reference

Towards Reasoning in Large Language Models: A Survey

Jie Huang, Kevin Chen-Chuan Chang · 2022 · cs.CL · arXiv 2212.10403

Canonical reference. 91% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 91% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 dataset 2

citation-polarity summary

background 10 use dataset 1

representative citing papers

OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

cs.SE · 2025-04-22 · accept · novelty 7.0

OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

cs.LG · 2024-10-05 · unverdicted · novelty 7.0

TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.

STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.

OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

cs.AI · 2026-05-09 · unverdicted · novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

GEM: Graph-Enhanced Mixture-of-Experts with ReAct Agents for Dialogue State Tracking

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.

Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.

TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models

cs.CL · 2026-03-13 · unverdicted · novelty 6.0

TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.

CURE-Med: Curriculum-Informed Reinforcement Learning for Multilingual Medical Reasoning

cs.AI · 2026-01-19 · unverdicted · novelty 6.0

CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

cs.CL · 2025-11-25 · unverdicted · novelty 6.0

Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.

A Survey on Large Language Model based Autonomous Agents

cs.AI · 2023-08-22 · accept · novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.

Reasoning with Language Model is Planning with World Model

cs.CL · 2023-05-24 · unverdicted · novelty 6.0

RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.

Multimodal Chain-of-Thought Reasoning in Language Models

cs.CL · 2023-02-02 · accept · novelty 6.0

Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.

Semantic-Aware Logical Reasoning via a Semiotic Framework

cs.AI · 2025-09-29 · conditional · novelty 5.0

LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.

Direct Reasoning Optimization: Token-Level Reasoning Reflectivity Meets Rubric Gates for Unverifiable Tasks

cs.CL · 2025-06-16 · unverdicted · novelty 5.0

Direct Reasoning Optimization applies token-level Reasoning Reflection Reward (R3) focused on high-variance tokens and rubric-gating constraints to improve sample-efficient RL training of LLMs on unverifiable tasks.

On the Diagram of Thought

cs.CL · 2024-09-16 · unverdicted · novelty 5.0

Diagram of Thought (DoT) is a controller-light framework in which an LLM builds typed reasoning diagrams validated online and interpreted as diagrams in a slice topos whose synthesis is a finite limit.

CHESS: Contextual Harnessing for Efficient SQL Synthesis

cs.LG · 2024-05-27 · conditional · novelty 5.0

CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.

Agentic Reasoning for Large Language Models

cs.AI · 2026-01-18 · unverdicted · novelty 4.0

The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.

Aligning Perception, Reasoning, Modeling and Interaction: A Survey on Physical AI

cs.AI · 2025-10-06 · unverdicted · novelty 4.0

A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

cs.CV · 2025-03-13 · unverdicted · novelty 4.0

R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.

Large Language Model-Brained GUI Agents: A Survey

cs.AI · 2024-11-27 · unverdicted · novelty 4.0

A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

cs.CL · 2025-03-27 · accept · novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

A Survey on Large Language Models for Code Generation

cs.CL · 2024-06-01 · unverdicted · novelty 3.0

A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

Large Language Models: A Survey

cs.CL · 2024-02-09 · accept · novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

citing papers explorer

Showing 3 of 3 citing papers after filters.

TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis cs.LG · 2024-10-05 · unverdicted · none · ref 36 · internal anchor
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds cs.LG · 2026-06-30 · unverdicted · none · ref 18 · internal anchor
Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.
CHESS: Contextual Harnessing for Efficient SQL Synthesis cs.LG · 2024-05-27 · conditional · none · ref 77 · internal anchor
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.

Towards Reasoning in Large Language Models: A Survey

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer