super hub Mixed citations

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Brian Ichter, Dale Schuurmans, Fei Xia, Jason Wei, Maarten Bosma, Xuezhi Wang · 2022 · cs.CL · arXiv 2201.11903

Mixed citation behavior. Most common role is background (65%).

345 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 345 citing papers more from Brian Ichter arXiv PDF

abstract

We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter language model with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 50 method 13 baseline 5

citation-polarity summary

background 44 use method 13 baseline 5 support 4 unclear 2

claims ledger

abstract We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of large language models to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently large language models via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. Th

authors

Brian Ichter Dale Schuurmans Fei Xia Jason Wei Maarten Bosma Xuezhi Wang

co-cited works

representative citing papers

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

cs.AI · 2026-06-04 · accept · novelty 8.0

Across 30 LLMs and 205 TLA+ tasks, syntactic correctness reaches at most 26.6% and semantic correctness 8.6%, with all successes limited to progressive prompting and no advantage from larger models.

Slot Machines: How LLMs Keep Track of Multiple Entities

cs.CL · 2026-04-22 · unverdicted · novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.

Stability and Generalization in Looped Transformers

cs.LG · 2026-04-16 · unverdicted · novelty 8.0

Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.

GIANTS: Generative Insight Anticipation from Scientific Literature

cs.CL · 2026-04-10 · unverdicted · novelty 8.0

GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.

On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication

cs.LG · 2026-03-30 · unverdicted · novelty 8.0

Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator

physics.acc-ph · 2025-09-21 · unverdicted · novelty 8.0

A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

cs.CL · 2023-05-17 · accept · novelty 8.0

Tree of Thoughts enables language models to solve complex planning tasks by generating, evaluating, and searching over coherent intermediate thoughts in a tree, raising Game of 24 success from 4% to 74% with GPT-4.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

PAL: Program-aided Language Models

cs.CL · 2022-11-18 · conditional · novelty 8.0

PAL improves few-shot reasoning accuracy by having LLMs generate executable programs rather than text-based chains of thought, outperforming much larger models on math and logic benchmarks.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

cs.CL · 2022-02-25 · accept · novelty 8.0

Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

TO-Master: an LLM-agent framework for automated topology optimization

cs.CE · 2026-07-02 · unverdicted · novelty 7.0

TO-Master is an LLM agent framework that orchestrates finite-element topology optimization from conversational inputs, supporting 2D/3D compliance, thermal, stress-constrained, and multi-load cases while reproducing benchmarks without user code.

DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.

What Drives Interactive Improvement from Feedback?

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.

Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents

cs.AI · 2026-06-26 · unverdicted · novelty 7.0 · 2 refs

GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.

Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation

physics.soc-ph · 2026-06-20 · unverdicted · novelty 7.0

A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.

A Verifiable Search Is Not a Learnable Chain-of-Thought

cs.LG · 2026-06-20 · unverdicted · novelty 7.0

Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

physics.comp-ph · 2026-06-17 · unverdicted · novelty 7.0

PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.

Beyond Prediction: Tail-Aware Scheduling for LLM Inference

cs.LG · 2026-06-16 · unverdicted · novelty 7.0

Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.

citing papers explorer

Showing 50 of 286 citing papers after filters.

Slot Machines: How LLMs Keep Track of Multiple Entities cs.CL · 2026-04-22 · unverdicted · none · ref 18 · internal anchor
LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Stability and Generalization in Looped Transformers cs.LG · 2026-04-16 · unverdicted · none · ref 21 · internal anchor
Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant performs competitively or better.
GIANTS: Generative Insight Anticipation from Scientific Literature cs.CL · 2026-04-10 · unverdicted · none · ref 29 · internal anchor
GIANTS-4B, trained with RL on a new 17k-example benchmark of parent-to-child paper insights, achieves 34% relative improvement over gemini-3-pro in LM-judge similarity and is rated higher-impact by a citation predictor.
On the Mirage of Long-Range Dependency, with an Application to Integer Multiplication cs.LG · 2026-03-30 · unverdicted · none · ref 25 · internal anchor
Long-range dependency in integer multiplication is a mirage from 1D representation; a 2D grid reduces it to local 3x3 operations, letting a 321-parameter neural cellular automaton generalize perfectly to inputs 683 times longer than training while Transformers fail.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 12 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Agentic AI for Multi-Stage Physics Experiments at a Large-Scale User Facility Particle Accelerator physics.acc-ph · 2025-09-21 · unverdicted · none · ref 13 · internal anchor
A language-model-driven agentic AI system autonomously executes multi-stage physics experiments at a production synchrotron light source, reducing preparation time by two orders of magnitude while upholding safety constraints.
Instruction Tuning with GPT-4 cs.CL · 2023-04-06 · unverdicted · none · ref 13 · internal anchor
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
TO-Master: an LLM-agent framework for automated topology optimization cs.CE · 2026-07-02 · unverdicted · none · ref 11 · internal anchor
TO-Master is an LLM agent framework that orchestrates finite-element topology optimization from conversational inputs, supporting 2D/3D compliance, thermal, stress-constrained, and multi-load cases while reproducing benchmarks without user code.
DART: Difficulty-Adaptive Routing for Zero-Shot Video Temporal Grounding cs.CV · 2026-07-01 · unverdicted · none · ref 55 · internal anchor
DART routes zero-shot video temporal grounding queries by difficulty using DPP entropy, achieving up to 3.5 mIoU gains with 7x fewer frames on Charades-STA and ActivityNet Captions.
MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning cs.AI · 2026-06-30 · unverdicted · none · ref 32 · internal anchor
MultiUAV-Plat supplies a new RESTful simulation platform and 1500-task benchmark where Agent4Drone reaches 57.9% task pass rate versus 30.6% for ReAct baseline across 75 multi-UAV missions.
What Drives Interactive Improvement from Feedback? cs.AI · 2026-06-29 · unverdicted · none · ref 21 · internal anchor
Controlled student-teacher experiments across four benchmarks show interactive gains are driven more by the student's ability to use feedback than by teacher quality, with self-feedback adding little beyond unguided retries.
Grounded Iterative Language Planning: How Parameterized World Models Reduce Hallucination Propagation in LLM Agents cs.AI · 2026-06-26 · unverdicted · none · ref 3 · 2 links · internal anchor
GILP trains a parameterized backbone for valid actions and state predictions, then uses a consistency gate with LLM drafts to reduce hallucinated-state rate from 0.176 to 0.035 on GPT-4o-mini while raising success from 0.668 to 0.838.
Lost in Aggregation: A Multi-Scale Diagnostic Benchmark for LLM Spatial Navigation physics.soc-ph · 2026-06-20 · unverdicted · none · ref 27 · internal anchor
A new diagnostic benchmark decomposes LLM spatial navigation into three cognitive scales and shows that cross-scale aggregation, not single-level deficits, causes failure beyond small mazes.
A Verifiable Search Is Not a Learnable Chain-of-Thought cs.LG · 2026-06-20 · unverdicted · none · ref 32 · internal anchor
Verifiable search procedures cannot be learned as forward chain-of-thought by language models; they instead learn memorization, verification, or require precomputed catalogs.
Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark physics.comp-ph · 2026-06-17 · unverdicted · none · ref 9 · internal anchor
PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.
Beyond Prediction: Tail-Aware Scheduling for LLM Inference cs.LG · 2026-06-16 · unverdicted · none · ref 48 · internal anchor
Presents a distribution-aware scheduling framework for LLM inference that reduces P99 TTLT by 35-50% and TTFT by 34-47% versus SRPT with perfect length knowledge using statistical signals instead of predictions.
Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text cs.AI · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
Optical reasoning encodes rationales in images rather than text, matching or exceeding text-based performance on math, science, and multimodal benchmarks while cutting tokens by 28.57% on language tasks and 16% on multimodal tasks.
Self-Harness: Harnesses That Improve Themselves cs.CL · 2026-06-08 · unverdicted · none · ref 25 · internal anchor
Self-Harness lets LLM agents autonomously refine their interaction harnesses through weakness mining, proposal generation, and validation, raising held-out pass rates on Terminal-Bench-2.0 from 40.5% to 61.9%, 23.8% to 38.1%, and 42.9% to 57.1% across three models.
Continuous Language Diffusion as a Decoder-Interface Problem cs.CL · 2026-06-07 · unverdicted · none · ref 72 · internal anchor
Continuous language diffusion works by entering high-margin decoder basins where frozen T5 embeddings recover 93-96% of native decisions and linear readouts reach 97.9% agreement, implying models should be evaluated as representation-decoder systems.
SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents cs.AI · 2026-06-06 · unverdicted · none · ref 5 · internal anchor
SciTrace embeds cumulative safety deliberation and trajectory-aware verification into scientific agent pipelines, claiming SOTA safety gains and detection of 78.8% of compositional risks missed by single-step checks.
Toward Calibrated, Fair, and accurate Deepfake Detection cs.LG · 2026-06-03 · unverdicted · none · ref 36 · internal anchor
Face-Feature Tuning is a label-free logit remapping method that reduces FPR/TPR gaps across groups in deepfake detection while preserving overall accuracy.
Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks cs.CR · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
An automatic numeric-remapping attack generator reveals 12-26 point accuracy drops on GSM8K for three LLMs while MAWPS and MultiArith stay near 98%.
Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling cs.CL · 2026-06-01 · unverdicted · none · ref 93 · internal anchor
Audit finds 36-39% incorrect FOL labels in FOLIO and MALLS; corrections raise LLM accuracy 9-22 points and an LLM-guided review framework achieves 90% dataset quality after checking fewer than 24% of examples.
The Regularizing Power of Language-Training Deepfake Detectors cs.CV · 2026-05-29 · unverdicted · none · ref 61 · internal anchor
A dual-encoder deepfake detector pairs a frozen specialist with a LoRA-tuned MLLM, trained first via binary alignment then via RL to reward explain-then-classify behavior, yielding improved cross-dataset performance and interpretability.
Social Reasoning in Machines: Investigating Collective Truth-Seeking Dynamics in Large Language Model Debate cs.MA · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
LLM multi-agent debate with epistemically diverse models improves truth-seeking performance on questionnaires and is claimed to be mechanistically grounded in the Argumentative Theory of Reasoning.
Memory Architectures for Multi-Turn Text-to-SQL: A Benchmark and Empirical Study cs.CL · 2026-05-25 · unverdicted · none · ref 8 · internal anchor
EnterpriseMem-Bench shows stateless multi-turn Text-to-SQL accuracy drops to zero by turn 3, working memory is the main driver of gains, and additional memory components yield model- and dataset-dependent effects from +14 to -16 percentage points.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling cs.LG · 2026-05-25 · unverdicted · none · ref 28 · internal anchor
ARBITER models reasoning trajectory basins in test-time sampling and uses model-internal signals to correct majority-vote failures, recovering part of the oracle gap on math benchmarks.
Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems cs.IR · 2026-05-23 · unverdicted · none · ref 14 · internal anchor
Spectral Retrieval uses multi-scale sinc convolutions on token embeddings to interpolate between per-token MaxSim and mean-pooling, achieving large gains on synthetic and LIMIT-small benchmarks for localized retrieval.
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems cs.AI · 2026-05-22 · unverdicted · none · ref 53 · internal anchor
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 25 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 47 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels cs.LG · 2026-05-18 · unverdicted · none · ref 10 · internal anchor
Counterfactual likelihood tests detect indirect influence through public channels in private reasoning models, validated on a 7B role-channel model showing asymmetric A-to-B influence and complete pathway identification via graph-separation controls.
Test-Time Hinting for Black-Box Vision-Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
Test-Time Hinting trains a hint generator to prepend contextual guidance to VLM prompts, improving accuracy on natural-image VQA benchmarks with generalization to unseen tasks and models.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 61 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 22 · internal anchor
Frontier LLMs achieve 95-100% accuracy on AMC/AIME problems but recover far fewer distinct valid strategies than human references, while collectively generating 50 novel strategies.
CA-SQL: Complexity-Aware Inference Time Reasoning for Text-to-SQL via Exploration and Compute Budget Allocation cs.CL · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
CA-SQL achieves 51.72% execution accuracy on the challenging tier of the BIRD benchmark using GPT-4o-mini by scaling exploration breadth according to estimated task difficulty, evolutionary prompt seeding, and candidate voting.
KL for a KL: On-Policy Distillation with Control Variate Baseline cs.LG · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensive full-vocabulary methods.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation cs.AI · 2026-04-27 · unverdicted · none · ref 26 · internal anchor
XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow cs.SE · 2026-04-24 · unverdicted · none · ref 50 · internal anchor
RAG-Reflect achieves F1=0.78 on valid comment-edit prediction using retrieval-augmented reasoning and self-reflection, outperforming baselines and approaching fine-tuned models without retraining.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 16 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
Navigating the Conceptual Multiverse cs.HC · 2026-04-20 · unverdicted · none · ref 22 · internal anchor
The conceptual multiverse system with a verification framework for decision structures helps users in philosophy, AI alignment, and poetry build clearer working maps of open-ended problems by making implicit LLM choices explicit and changeable.
RAGognizer: Hallucination-Aware Fine-Tuning via Detection Head Integration cs.CL · 2026-04-17 · unverdicted · none · ref 35 · internal anchor
RAGognizer adds a detection head to LLMs for joint training on generation and token-level hallucination detection, yielding SOTA detection and fewer hallucinations in RAG while preserving output quality.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees cs.LG · 2026-04-17 · unverdicted · none · ref 28 · internal anchor
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
IE as Cache: Information Extraction Enhanced Agentic Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 41 · internal anchor
IE-as-Cache framework repurposes information extraction as a dynamic cognitive cache to improve agentic reasoning accuracy in LLMs on challenging benchmarks.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 39 · internal anchor
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
An End-to-End Approach for Fixing Concurrency Bugs via SHB-Based Context Extractor cs.SE · 2026-04-07 · unverdicted · none · ref 84 · internal anchor
ConFixAgent repairs diverse concurrency bugs end-to-end by using Static Happens-Before graphs to extract relevant code context for LLMs, outperforming prior tools in benchmarks.
Profile-Then-Reason: Bounded Semantic Complexity for Tool-Augmented Language Agents cs.AI · 2026-04-05 · unverdicted · none · ref 1 · internal anchor
PTR framework profiles a workflow upfront then executes it deterministically with bounded verification and repair, limiting LM calls to 2-3 while outperforming ReAct in 16 of 24 tested configurations.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 23 · internal anchor
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer