hub Canonical reference

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa · 2022

Canonical reference. 83% of citing Pith papers cite this work as background.

15 Pith papers citing it

Background 83% of classified citations

browse 15 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 5 method 1

citation-polarity summary

background 5 use method 1

representative citing papers

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.

CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.

Query-Conditioned Test-Time Self-Training for Large Language Models

cs.CL · 2026-05-13 · conditional · novelty 7.0 · 2 refs

QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

cs.AI · 2026-05-13 · unverdicted · novelty 6.0

MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.

Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs

cs.CL · 2026-04-14 · unverdicted · novelty 6.0

RALP learns string-based chain-of-thought prompts as scoring functions for knowledge graph triples using Bayesian optimization from fewer than 30 examples, improving link prediction MRR by over 5% and achieving over 88% Jaccard similarity on complex OWL reasoning tasks.

Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate

cs.MA · 2026-04-03 · unverdicted · novelty 6.0

HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.

Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning

cs.CL · 2026-03-29 · unverdicted · novelty 6.0

A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.

Evaluating LLM-generated code for domain-specific languages: molecular dynamics with LAMMPS

cs.SE · 2026-03-21 · unverdicted · novelty 6.0

LLM syntax accuracy for LAMMPS scripts improved to 91% parser pass rate, yet only 1/80 scripts were scientifically correct on the hardest prompt; an agentic verification skill raised success to 5/6.

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

cs.RO · 2026-02-09 · unverdicted · novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

cs.AI · 2025-06-07 · unverdicted · novelty 6.0

LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.

SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening

cs.CV · 2026-05-17 · unverdicted · novelty 5.0

SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.

Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

EMS: Multi-Agent Voting via Efficient Majority-then-Stopping

cs.AI · 2026-04-03 · unverdicted · novelty 4.0

EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

citing papers explorer

Showing 15 of 15 citing papers.

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving cs.CV · 2026-05-22 · unverdicted · none · ref 105
DriveSpatial benchmark shows the best of 15 VLMs trails humans by 28.4 points on spatiotemporal driving tasks, with cognitive scene construction as the main failure mode.
CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning cs.AI · 2026-05-15 · unverdicted · none · ref 17
CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.
Query-Conditioned Test-Time Self-Training for Large Language Models cs.CL · 2026-05-13 · conditional · none · ref 12 · 2 links
QueST adapts LLMs at test time by generating query-specific problem-solution pairs for self-supervised fine-tuning, improving reasoning performance without external data.
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning cs.AI · 2026-05-13 · unverdicted · none · ref 13
MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 24
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
Retrieve, Integrate, and Synthesize: Spatial-Semantic Grounded Latent Visual Reasoning cs.CL · 2026-05-08 · unverdicted · none · ref 2
RIS improves MLLM latent visual reasoning by retrieving spatial-semantic evidence, integrating it via attention bottlenecks, and synthesizing it with language transition tokens, yielding gains on V*, HRBench, MMVP, and BLINK benchmarks.
Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs cs.CL · 2026-04-14 · unverdicted · none · ref 11
RALP learns string-based chain-of-thought prompts as scoring functions for knowledge graph triples using Bayesian optimization from fewer than 30 examples, improving link prediction MRR by over 5% and achieving over 88% Jaccard similarity on complex OWL reasoning tasks.
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate cs.MA · 2026-04-03 · unverdicted · none · ref 1
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning cs.CL · 2026-03-29 · unverdicted · none · ref 14
A new counterfactual multi-agent framework improves LLM diagnostic accuracy by quantifying confidence shifts from edited clinical findings and guiding specialist discussions.
Evaluating LLM-generated code for domain-specific languages: molecular dynamics with LAMMPS cs.SE · 2026-03-21 · unverdicted · none · ref 24
LLM syntax accuracy for LAMMPS scripts improved to 91% parser pass rate, yet only 1/80 scripts were scientifically correct on the hardest prompt; an agentic verification skill raised success to 5/6.
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning cs.RO · 2026-02-09 · unverdicted · none · ref 47
R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.
The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity cs.AI · 2025-06-07 · unverdicted · none · ref 18
LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.
SafeLens: Deliberate and Efficient Video Guardrails with Fast-and-Slow Screening cs.CV · 2026-05-17 · unverdicted · none · ref 60
SafeLens presents a fast-and-slow video guardrail framework that filters the SafeWatch dataset to 2.4% and adds Chain-of-Thought traces to achieve state-of-the-art moderation performance at reduced inference cost.
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models cs.CV · 2026-04-09 · unverdicted · none · ref 11
HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.
EMS: Multi-Agent Voting via Efficient Majority-then-Stopping cs.AI · 2026-04-03 · unverdicted · none · ref 1
EMS reduces the average number of agents invoked for majority voting by 32% via reliability-aware prioritization and early stopping on six benchmarks.

Large language models are zero-shot reasoners.Advances in neural information processing systems, 35:22199–22213

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer