pith. sign in

arxiv: 2510.09351 · v2 · submitted 2025-10-10 · 💻 cs.CL

ReTraceQA: Evaluating Reasoning Traces of Small Language Models in Commonsense Question Answering

Pith reviewed 2026-05-18 08:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords small language modelscommonsense reasoningreasoning evaluationquestion answeringprocess supervisionbenchmark
0
0 comments X

The pith

Small language models often reach correct commonsense answers through flawed reasoning steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ReTraceQA, a benchmark that adds expert-annotated checks on the full reasoning trace rather than only the final answer in commonsense question answering. It reports that small language models produce correct answers despite invalid reasoning in 14-24 percent of cases across tested models and datasets. Standard accuracy metrics therefore overestimate model capabilities. When stronger large language models serve as judges of reasoning quality, measured performance falls by as much as 25 points. The central finding is that answer-only evaluation hides weaknesses in how these models actually reason.

Core claim

Expert-annotated traces in ReTraceQA show that small language models deliver correct final answers on commonsense questions yet rely on flawed reasoning processes in 14-24 percent of instances, so answer-only accuracy metrics systematically overestimate their true reasoning ability.

What carries the argument

ReTraceQA benchmark, which supplies expert-annotated reasoning traces paired with commonsense question-answer pairs for process-level evaluation.

Load-bearing premise

Expert annotations of reasoning traces are consistent and accurate enough to establish the reported rate of flawed reasoning.

What would settle it

A replication study in which a second independent group of experts re-annotates the same reasoning traces and obtains a markedly different flawed-reasoning rate would falsify the central percentage claim.

read the original abstract

While Small Language Models (SLMs) have demonstrated promising performance on an increasingly wide array of commonsense reasoning benchmarks, current evaluation practices rely almost exclusively on the accuracy of their final answers, neglecting the validity of the reasoning processes that lead to those answers. To address this issue, we present ReTraceQA, a novel benchmark that introduces process-level evaluation for commonsense reasoning tasks. Our expert-annotated dataset reveals that in a substantial portion of instances (14-24%), SLMs provide correct final answers despite flawed reasoning processes, suggesting that the capabilities of SLMs are often overestimated by evaluation metrics that focus only on comparing the final answer with the ground truth. Indeed, we show that, when employing strong Large Language Models (LLMs) as automated judges for reasoning-aware evaluation rather than answer-only metrics, SLM performance drops significantly across all models and datasets, with scores decreasing by up to 25%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces ReTraceQA, a benchmark for process-level evaluation of reasoning traces produced by small language models (SLMs) on commonsense question answering. Expert annotations indicate that in 14-24% of instances, SLMs arrive at correct final answers through flawed reasoning. The authors further show that employing LLMs as automated judges for reasoning validity results in performance drops of up to 25% relative to standard answer-only metrics.

Significance. This work highlights a potential overestimation of SLM capabilities when evaluations focus solely on final answers rather than the validity of reasoning processes. The creation of an expert-annotated dataset for reasoning traces is a valuable contribution that could promote more rigorous assessment practices in commonsense reasoning research. Strengths include the use of human annotations and the demonstration of discrepancies with LLM judges.

major comments (2)
  1. [Dataset Construction and Annotation] The headline result of 14-24% correct but flawed reasoning cases is derived from expert labeling of reasoning traces. However, the manuscript provides no inter-annotator agreement metrics (e.g., Cohen's kappa) or explicit annotation guidelines and adjudication procedures. This is a load-bearing issue for the central claim, as the percentage is sensitive to subjective definitions of 'flawed' reasoning.
  2. [LLM Judge Evaluation] Details on the prompts provided to the LLM judges and any calibration against expert labels are missing. Since the reported up to 25% performance drop depends on these judges inheriting the expert labeling standard, lack of transparency here affects the reliability of the comparison to answer-only metrics.
minor comments (2)
  1. [Abstract] Specify the exact models and datasets achieving the maximum 25% drop for clarity.
  2. [Introduction] Ensure all acronyms (SLM, LLM) are defined on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and will incorporate the requested clarifications and additions in the revised version to improve transparency and strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Dataset Construction and Annotation] The headline result of 14-24% correct but flawed reasoning cases is derived from expert labeling of reasoning traces. However, the manuscript provides no inter-annotator agreement metrics (e.g., Cohen's kappa) or explicit annotation guidelines and adjudication procedures. This is a load-bearing issue for the central claim, as the percentage is sensitive to subjective definitions of 'flawed' reasoning.

    Authors: We agree that reporting inter-annotator agreement and explicit annotation guidelines is essential for establishing the reliability of the expert annotations underlying our headline result. The original submission omitted these details to keep the main text concise. In the revised manuscript we will add a new subsection (and corresponding appendix material) that describes the full annotation guidelines provided to experts, the adjudication procedure used to resolve disagreements, and quantitative inter-annotator agreement statistics including Cohen's kappa. These additions will directly support the validity of the reported 14-24% range. revision: yes

  2. Referee: [LLM Judge Evaluation] Details on the prompts provided to the LLM judges and any calibration against expert labels are missing. Since the reported up to 25% performance drop depends on these judges inheriting the expert labeling standard, lack of transparency here affects the reliability of the comparison to answer-only metrics.

    Authors: We concur that full transparency regarding the LLM judge prompts and their calibration against expert labels is necessary to justify the reported performance drops. The revised version will include the exact prompts used for the LLM judges in an appendix. We will also add calibration results that quantify agreement between the LLM judgments and the expert annotations (e.g., agreement rates and Cohen's kappa), thereby demonstrating how closely the automated judges follow the human standard. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results from new annotations and external judges

full rationale

The paper's core empirical claims (14-24% correct-answer/flawed-reasoning cases and up to 25% performance drop under LLM judges) are produced by a newly created expert-annotated dataset and evaluations using external LLMs as judges. These elements are independent of any prior fitted parameters, self-definitional constructs, or load-bearing self-citations from the authors' previous work. The derivation chain does not reduce any reported quantity to its own inputs by construction; the benchmark and annotation process supply fresh data against which the final-answer-only metrics are compared. This is the most common honest finding for papers that introduce new human-annotated resources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central percentages rest on the assumption that human experts can reliably label reasoning traces as flawed or valid; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Expert annotations reliably distinguish valid from flawed reasoning traces in commonsense QA.
    The 14-24% figure and the performance drop depend directly on this judgment quality.

pith-pipeline@v0.9.0 · 5699 in / 1152 out tokens · 42556 ms · 2026-05-18T08:05:01.028711+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.