OpenClassGen supplies 324,843 real-world Python classes with self-contained skeletons and static metrics to support LLM class generation research and evaluation.
hub Canonical reference
Towards Reasoning in Large Language Models: A Survey
Canonical reference. 91% of citing Pith papers cite this work as background.
abstract
Reasoning is a fundamental aspect of human intelligence that plays a crucial role in activities such as problem solving, decision making, and critical thinking. In recent years, large language models (LLMs) have made significant progress in natural language processing, and there is observation that these models may exhibit reasoning abilities when they are sufficiently large. However, it is not yet clear to what extent LLMs are capable of reasoning. This paper provides a comprehensive overview of the current state of knowledge on reasoning in LLMs, including techniques for improving and eliciting reasoning in these models, methods and benchmarks for evaluating reasoning abilities, findings and implications of previous research in this field, and suggestions on future directions. Our aim is to provide a detailed and up-to-date review of this topic and stimulate meaningful discussion and future work.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.
STOP uses structured on-policy analysis to prune long reasoning traces to their earliest correct node, cutting token usage 19-42% with little accuracy loss on math benchmarks.
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
GEM achieves 65.19% joint goal accuracy on MultiWOZ 2.2 by routing between a graph neural network expert for dialogue structure and a T5 expert for sequences, plus ReAct agents for value generation, outperforming prior SOTA methods.
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.
TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
CURE-MED pairs a new 13-language medical reasoning benchmark with curriculum RL to raise logical correctness to 70% and language consistency to 95% at 32B scale while outperforming baselines.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
RAP turns LLMs into dual world-model and planning agents via MCTS to generate better reasoning paths, outperforming CoT baselines and achieving 33% relative gains over GPT-4 CoT using LLaMA-33B on plan generation.
Multimodal-CoT achieves state-of-the-art on ScienceQA by using a two-stage process that incorporates vision into chain-of-thought rationale generation for models under 1 billion parameters.
LogicAgent uses a semiotic-square-guided approach to enhance logical reasoning in LLMs on the new RepublicQA benchmark and others, reporting average gains of 6.25% and 7.05% respectively.
Direct Reasoning Optimization applies token-level Reasoning Reflection Reward (R3) focused on high-variance tokens and rubric-gating constraints to improve sample-efficient RL training of LLMs on unverifiable tasks.
Diagram of Thought (DoT) is a controller-light framework in which an LLM builds typed reasoning diagrams validated online and interpreted as diagrams in a slice topos whose synthesis is a finite limit.
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.
The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.
A survey of physical AI that distinguishes theoretical physics reasoning from applied understanding and synthesizes advances in symbolic reasoning, embodied systems, and generative models to advocate for physics-grounded world models.
R1-Onevision turns images into structured text for multimodal reasoning, trains on a custom dataset with RL, and claims SOTA results on an educational benchmark.
A survey consolidating frameworks, data practices, large action models, benchmarks, applications, and research gaps in LLM-brained GUI agents.
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
citing papers explorer
-
TS-Reasoner: Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis
TS-Reasoner is a domain-oriented agent using LLMs, computational tools, and error feedback for multi-step time series inference, showing better performance than general LLMs on understanding and reasoning benchmarks.
-
Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds
Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.
-
CHESS: Contextual Harnessing for Efficient SQL Synthesis
CHESS deploys four LLM agents to retrieve information, prune schemas, generate refined SQL candidates, and validate via unit tests, reporting up to 71.10% accuracy on BIRD with 83% fewer calls than leading proprietary baselines.