Evaluating step-by-step reasoning traces: A survey

Jinu Lee, Julia Hockenmaier · 2025 · arXiv 2502.12289

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

cs.CL · 2026-04-09 · unverdicted · novelty 5.0

Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.

Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

cs.CR · 2026-04-06 · unverdicted · novelty 5.0

A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.

The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models

cs.AI · 2025-10-23

citing papers explorer

Showing 4 of 4 citing papers.

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning cs.AI · 2025-11-04 · unverdicted · none · ref 10
DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spatial reasoning in LLMs.
Decomposing the Delta: What Do Models Actually Learn from Preference Pairs? cs.CL · 2026-04-09 · unverdicted · none · ref 10
Larger differences in generator capability between chosen and rejected reasoning traces improve out-of-domain performance, while filtering pairs by sample-level quality deltas enables more data-efficient training.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework cs.CR · 2026-04-06 · unverdicted · none · ref 35
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.
The Shape of Reasoning: Topological Analysis of Reasoning Traces in Large Language Models cs.AI · 2025-10-23 · unreviewed · ref 7

Evaluating step-by-step reasoning traces: A survey

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer