Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Reasoning or reciting? exploring the capabilities, limitations of language models through counterfactual tasks , author= · 2024 · arXiv 2307.02477

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

cs.LG · 2026-06-06 · unverdicted · novelty 7.0

DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

cs.SE · 2025-10-16 · unverdicted · novelty 7.0

LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.

CodeMind: Evaluating Large Language Models for Code Reasoning

cs.SE · 2024-02-15 · unverdicted · novelty 7.0

CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.

Consistency Training while Mitigating Obfuscation via Rate Matching

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.

Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts

cs.CL · 2026-05-08 · conditional · novelty 6.0

Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models

cs.AI · 2026-06-29 · unverdicted · novelty 5.0

CRISTAL is a neurosymbolic framework that synthesizes interpretable probabilistic world models from language priors for full Bayesian analysis and budget-aware data acquisition, claiming Bayes-optimal accuracy on synthetic equity classification with 5 examples.

Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures

cs.SE · 2026-04-15 · unverdicted · novelty 4.0

Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.

citing papers explorer

Showing 8 of 8 citing papers after filters.

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination cs.LG · 2026-06-06 · unverdicted · none · ref 70
DICE formalizes multi-agent LLM coordination as discounted incomplete-information Markov games and introduces Heterogeneous Quantal Response Equilibrium (HQRE) to achieve unique stable equilibria with bounded regret, demonstrated via prompt-control and fine-tuning algorithms on eleven benchmarks.
Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models cs.SE · 2025-10-16 · unverdicted · none · ref 46
LLMs achieve 81% coherent execution simulation on HumanEval but show mostly random or weak consistency across tests, with frontier models relying on natural language shortcuts instead of true program analysis.
CodeMind: Evaluating Large Language Models for Code Reasoning cs.SE · 2024-02-15 · unverdicted · none · ref 6
CodeMind evaluates ten LLMs on four benchmarks using three new code reasoning tasks, finding performance varies by model size and drops with complexity while showing no correlation with bug repair ability.
Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds cs.LG · 2026-06-30 · unverdicted · none · ref 44
Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 71
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 83
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
The CRISTAL Method: Neurosymbolic analysis from AI-synthesized world models cs.AI · 2026-06-29 · unverdicted · none · ref 3
CRISTAL is a neurosymbolic framework that synthesizes interpretable probabilistic world models from language priors for full Bayesian analysis and budget-aware data acquisition, claiming Bayes-optimal accuracy on synthetic equity classification with 5 examples.
Towards Enabling An Artificial Self-Construction Software Life-cycle via Autopoietic Architectures cs.SE · 2026-04-15 · unverdicted · none · ref 67
Proposes autopoietic architectures for self-constructing software as a fundamental shift in the SDLC, leveraging foundation models for autonomous evolution and maintenance.

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer