pith. sign in

arxiv: 2603.07267 · v2 · pith:IZLFCKOUnew · submitted 2026-03-07 · 💻 cs.CR

How to Steal Reasoning Without Reasoning Traces

Pith reviewed 2026-05-15 14:21 UTC · model grok-4.3

classification 💻 cs.CR
keywords trace inversionreasoning traceschain of thoughtmodel distillationblack-box LLMsLLM securitysynthetic datafine-tuning
0
0 comments X

The pith

Trace inversion models recover detailed reasoning traces from only the inputs and final answers of black-box LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLMs can hide their full reasoning traces yet still expose enough information for others to reconstruct similar reasoning steps. Trace inversion models are trained to generate synthetic detailed traces using only the query, the target model's answer, and optionally its short summary. When ground-truth traces exist, the synthetic versions overlap substantially with them. Fine-tuning student models on these inverted traces measurably raises their reasoning accuracy and transfers capabilities from closed proprietary models.

Core claim

We introduce trace inversion models that, given only the inputs, answers, and optional reasoning summaries from a target model, generate detailed synthetic reasoning traces. These traces show high overlap with ground-truth reasoning when available, and fine-tuning student models on the inverted traces substantially improves their reasoning performance while enabling distillation from proprietary black-box LLMs.

What carries the argument

Trace inversion models that generate synthetic reasoning traces from the limited outputs exposed by a target LLM.

If this is right

  • Fine-tuning on inverted traces substantially improves reasoning performance in student models.
  • The method enables distillation of reasoning capabilities from proprietary black-box LLMs without access to their internal traces.
  • Synthetic traces exhibit high overlap with ground-truth reasoning traces when those are available.
  • Hiding full reasoning traces does not fully prevent extraction of a model's reasoning abilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open models could approach the performance of closed ones by inverting outputs from the closed models.
  • Model providers may require additional protections against inversion attacks to safeguard proprietary reasoning processes.
  • The technique could be tested for cross-family transfer, such as inverting traces from one model family to improve models from another.
  • Widespread use might shift incentives toward releasing more reasoning traces or toward building inversion-resistant output formats.

Load-bearing premise

Synthetic traces produced without any ground-truth reasoning steps are still high enough in quality to transfer genuine reasoning improvements during fine-tuning.

What would settle it

A student model fine-tuned on the inverted traces shows no accuracy gain on reasoning benchmarks compared to the same model fine-tuned on the raw inputs and answers without any synthetic traces.

read the original abstract

Many large language models (LLMs) use reasoning to generate responses but do not reveal their full reasoning traces (a.k.a. chains of thought), instead outputting only final answers and brief reasoning summaries. To demonstrate that hiding reasoning traces does not prevent users from "stealing" a model's reasoning capabilities, we introduce trace inversion models that, given only the inputs, answers, and (optionally) reasoning summaries exposed by a target model, generate detailed, synthetic reasoning traces. We show that (1) traces synthesized by trace inversion have high overlap with the ground-truth reasoning traces (when available), and (2) fine-tuning student models on inverted traces substantially improves their reasoning and enables distillation from proprietary, black-box LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces trace inversion models that, given only inputs, final answers, and optional brief reasoning summaries from a target LLM, generate detailed synthetic reasoning traces. It claims these traces show high overlap with ground-truth traces (when available) and that fine-tuning student models on the inverted traces substantially boosts their reasoning performance, enabling effective distillation from proprietary black-box LLMs.

Significance. If the empirical claims hold after addressing the noted gaps, the work is significant for LLM security and intellectual property: it shows that concealing full reasoning traces does not prevent extraction of reasoning capabilities, with direct implications for how proprietary models can be protected or distilled. The approach could influence deployment practices for closed-source reasoning models.

major comments (2)
  1. Abstract: the central empirical claims of high overlap with ground-truth traces and substantial student-model gains are stated without any metrics, baselines, data splits, or significance tests, leaving the support for the distillation result only moderately grounded.
  2. Experiments (implied by the abstract's performance claims): no ablation compares fine-tuning on (input, answer) pairs alone versus (input, answer, inverted trace). Without this isolation, gains could arise simply from mimicking the exposed answer distribution rather than from the quality of the synthetic reasoning chains, directly weakening the claim that inverted traces transfer genuine reasoning.
minor comments (2)
  1. Provide explicit details on the trace inversion model architecture, training data construction, and exact overlap metrics (e.g., token-level F1 or step-level accuracy) used for validation.
  2. Clarify the scope of 'proprietary black-box LLMs' and how the method scales when no ground-truth traces exist for any validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment point by point below and will revise the manuscript to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: Abstract: the central empirical claims of high overlap with ground-truth traces and substantial student-model gains are stated without any metrics, baselines, data splits, or significance tests, leaving the support for the distillation result only moderately grounded.

    Authors: We agree that the abstract would be strengthened by including specific quantitative metrics. In the revised manuscript, we will update the abstract to report key results such as the average overlap with ground-truth traces (e.g., token overlap or BLEU scores), the magnitude of student-model gains on reasoning benchmarks, the data splits used, and any statistical significance tests performed. This will provide clearer grounding for the distillation claims. revision: yes

  2. Referee: Experiments (implied by the abstract's performance claims): no ablation compares fine-tuning on (input, answer) pairs alone versus (input, answer, inverted trace). Without this isolation, gains could arise simply from mimicking the exposed answer distribution rather than from the quality of the synthetic reasoning chains, directly weakening the claim that inverted traces transfer genuine reasoning.

    Authors: We acknowledge the importance of isolating the contribution of the inverted traces. In the revised version, we will add an ablation study that directly compares fine-tuning student models on (input, answer) pairs alone versus (input, answer, inverted trace). The results will demonstrate that performance gains are attributable to the reasoning content in the traces rather than answer distribution alone, with appropriate baselines and controls included. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical pipeline

full rationale

The paper describes an empirical pipeline: training inversion models on observable (input, answer, optional summary) pairs to produce synthetic traces, then fine-tuning student models and measuring downstream accuracy gains. No equations, predictions, or first-principles derivations are present that reduce by construction to the inputs. Claims rest on reported overlap metrics and performance deltas rather than self-definitional loops or fitted parameters renamed as predictions. No load-bearing self-citations or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach assumes standard supervised fine-tuning can transfer reasoning from synthetic data and that overlap with ground-truth traces correlates with downstream utility; no free parameters, axioms, or invented entities are introduced beyond ordinary ML components.

pith-pipeline@v0.9.0 · 5415 in / 973 out tokens · 33247 ms · 2026-05-15T14:21:31.248944+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.