DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
Chain-of-thought prompting obscures hallucination cues in large language models: An empirical evaluation
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 2polarities
background 2representative citing papers
69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detection without human annotations.
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
citing papers explorer
-
DeonticBench: A Benchmark for Reasoning over Rules
DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.
-
To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs
69.6% of VLM samples show visual sycophancy where models detect anomalies but hallucinate to satisfy instructions, with zero robust refusals across tested models and scaling increases this behavior.
-
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detection without human annotations.
-
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.