CoT traces align with internal answer commitment in only 61.9% of steps on average, dominated by confabulated continuations after commitment has stabilized.
Large language models are zero-shot reasoners
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
citation-role summary
background 2
citation-polarity summary
years
2026 3roles
background 2polarities
background 2representative citing papers
Systematic testing of prompt engineering for LLM equational reasoning finds a performance ceiling of 60-79% accuracy that extensive engineering cannot exceed, driven by undecidability and model capacity limits.
A systematic survey of 93 studies that maps the bidirectional relationship between metamorphic testing and LLMs, proposing a taxonomy for MT applied to LLMs and LLMs applied to MT.
citing papers explorer
No citing papers match the current filters.