hub

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou · 2023

13 Pith papers cite this work. Polarity classification is still indexing.

13 Pith papers citing it

browse 13 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Benchmarking LLM-Driven Network Configuration Repair

cs.NI · 2026-04-24 · unverdicted · novelty 8.0

Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.

Latent Chain-of-Thought Improves Structured-Data Transformers

cs.LG · 2026-05-11 · conditional · novelty 7.0 · 2 refs

Latent chain-of-thought via recurrent feedback tokens from compressed hidden states improves transformer performance on time-series forecasting and tabular prediction across 36 datasets.

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

cs.AI · 2025-10-11 · conditional · novelty 7.0

StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

BEDTime: A Unified Benchmark for Automatically Describing Time Series

cs.CL · 2025-09-05 · conditional · novelty 6.0

BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

cs.LG · 2024-07-31 · unverdicted · novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

cs.CL · 2024-02-20 · conditional · novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.

Llemma: An Open Language Model For Mathematics

cs.CL · 2023-10-16 · unverdicted · novelty 6.0

Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.

Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation

cs.CL · 2026-03-15 · unverdicted · novelty 5.0

A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.

Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

cs.LG · 2025-06-09 · unverdicted · novelty 5.0

Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.

TableMaster: A Recipe to Advance Table Understanding with Language Models

cs.CL · 2025-01-31 · unverdicted · novelty 5.0

TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.

Reinforcement Learning for LLM Post-Training: A Survey

cs.CL · 2024-07-23 · unverdicted · novelty 3.0

A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

cs.AI · 2023-06-05 · unverdicted · novelty 2.0

The paper introduces a collaborative multi-agent framework for LLMs and applies it conceptually to existing models like Auto-GPT, BabyAGI, and Gorilla through case studies in domains such as courtroom simulations and software development.

citing papers explorer

Showing 13 of 13 citing papers.

Benchmarking LLM-Driven Network Configuration Repair cs.NI · 2026-04-24 · unverdicted · none · ref 36
Cornetto is the first benchmark that synthesizes 231 network misconfiguration problems across topologies of 20-754 nodes and uses formal verification to show that nine state-of-the-art LLMs often introduce regressions and degrade at scale.
Latent Chain-of-Thought Improves Structured-Data Transformers cs.LG · 2026-05-11 · conditional · none · ref 4 · 2 links
Latent chain-of-thought via recurrent feedback tokens from compressed hidden states improves transformer performance on time-series forecasting and tabular prediction across 36 datasets.
StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis cs.AI · 2025-10-11 · conditional · none · ref 37
StepFly automates TSG execution via TSG Mentor, LLM-based DAG extraction with QPPs, and a DAG-guided parallel scheduler, reaching 94% success on GPT-4.1 with 32.9-70.4% time savings on parallelizable guides.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 17
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
BEDTime: A Unified Benchmark for Automatically Describing Time Series cs.CL · 2025-09-05 · conditional · none · ref 52
BEDTime benchmark tests 17 models on describing time series structure and finds vision-language models outperform dedicated time-series-language models and language-only approaches, with all models fragile to robustness tests.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 64
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 62
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Llemma: An Open Language Model For Mathematics cs.CL · 2023-10-16 · unverdicted · none · ref 190
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
Tool-MCoT: Tool Augmented Multimodal Chain-of-Thought for Content Safety Moderation cs.CL · 2026-03-15 · unverdicted · none · ref 11
A small language model fine-tuned on tool-augmented chain-of-thought data generated by a larger LLM learns to selectively call tools, delivering better content moderation accuracy at lower inference cost.
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning cs.LG · 2025-06-09 · unverdicted · none · ref 39
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
TableMaster: A Recipe to Advance Table Understanding with Language Models cs.CL · 2025-01-31 · unverdicted · none · ref 14
TableMaster improves LM table understanding by verbalizing tables with enriched semantics and using adaptive textual-symbolic reasoning, reaching 78.13% accuracy on WikiTQ with GPT-4o-mini.
Reinforcement Learning for LLM Post-Training: A Survey cs.CL · 2024-07-23 · unverdicted · none · ref 46
A survey deriving a unified policy gradient framework for LLM post-training methods and providing technical comparisons of PPO, GRPO, DPO variants.
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents cs.AI · 2023-06-05 · unverdicted · none · ref 7
The paper introduces a collaborative multi-agent framework for LLMs and applies it conceptually to existing models like Auto-GPT, BabyAGI, and Gorilla through case studies in domains such as courtroom simulations and software development.

Chain-of-thought prompting elicits reasoning in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer