Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Jing Lei; Mengyu Yang; Ruiying Peng; Xiaohui Li; Xinlei Chen; Xueyu Wu

arxiv: 2605.09270 · v2 · pith:GIDL7EDInew · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Ruiying Peng , Mengyu Yang , Jing Lei , Xiaohui Li , Xueyu Wu , Xinlei Chen This is my paper

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords supervised fine-tuningmathematical reasoninggeneralizationtheorem applicationmemorizationSFTMATH benchmarkGeoQA

0 comments

The pith

Supervised fine-tuning for math reasoning succeeds when models learn to apply theorems explicitly instead of memorizing individual problem-answer pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that SFT harms generalization in mathematical reasoning because it leads models to latch onto surface correlations in specific examples rather than general rules. Theorem-SFT counters this by redirecting training signals to show how theorems are invoked during problem solving. This shift produces measurable gains on MATH and GeoQA across several model families and works even when only MLP layers are updated. The authors conclude that the problem with SFT is not memorization itself but the choice of what gets memorized.

Core claim

Vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to input variations; Theorem-SFT reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like, yielding consistent gains such as +8.8% on MATH and +20.27% on GeoQA while implicating feed-forward components as the primary locus of reasoning rules.

What carries the argument

Theorem-SFT, which shifts supervision from answer patterns in individual instances to explicit teaching of theorem invocation rules.

Load-bearing premise

The reported performance gains are caused by the shift to theorem-level supervision rather than by other unspecified differences in data construction, prompting, or training hyperparameters.

What would settle it

A controlled replication that holds data volume, format, prompting, and all hyperparameters fixed while changing only the supervision target from answer pairs to theorem applications and observes no accuracy difference.

Figures

Figures reproduced from arXiv: 2605.09270 by Jing Lei, Mengyu Yang, Ruiying Peng, Xiaohui Li, Xinlei Chen, Xueyu Wu.

**Figure 2.** Figure 2: Errors in single-step reasoning. We identify two main failure modes: premise oversight and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Model responses to the applicability of the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Object referential error on a geometric right-angle identification task. Here, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of theorem extraction and theorem-chain construction from problem-solution [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: This figure compares standard responses in problem-solution datasets and Theorem-SFT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Where are theorems stored? Evidence from parameter-efficient tuning. The figure compares [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

read the original abstract

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that supervised fine-tuning (SFT) impairs reasoning generalization because models memorize spurious surface correlations in problem-solution pairs rather than general rules. It introduces Theorem-SFT to reorient supervision toward explicit theorem application and invocation, reporting consistent gains of +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct). It further finds that fine-tuning MLP layers alone matches full-layer performance, implicating feed-forward components as the primary site for encoding reasoning rules, and reframes generalization failures as arising from memorizing incorrect inductive targets.

Significance. If the reported gains can be isolated to the theorem-supervision shift, the work would offer a useful reframing of SFT limitations in mathematical reasoning and a practical intervention that avoids modality-specific retraining. The MLP ablation provides a mechanistic hint about where rules may be stored, which could inform targeted fine-tuning strategies. The empirical scope across two benchmarks and multiple models is a strength, though the absence of detailed controls limits immediate impact.

major comments (3)

[Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.
[Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.
[Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.

minor comments (1)

[Abstract] Abstract: The phrase 'consistent gains across benchmarks and model families' would be clearer if the exact model-benchmark pairs and the definition of 'Theorem-SFT' were stated more explicitly.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor needed to support our claims about Theorem-SFT. We respond to each major comment below, indicating revisions where the manuscript will be updated to provide clearer controls and details.

read point-by-point responses

Referee: [Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.

Authors: We agree that the manuscript must demonstrate the two regimes differed only in the supervision target. Both conditions used identical base models, the same number of examples sampled from the same problem distributions, matching prompt templates (differing solely in target output format), and the same training hyperparameters. We will add an explicit comparison table and description in the Methods section to document these controls. revision: yes
Referee: [Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.

Authors: The MLP ablation was intended as a mechanistic follow-up to locate where rules are encoded after the main comparison. We accept that extending it to the vanilla SFT baseline with matched controls would strengthen the link. In revision we will report MLP-only fine-tuning results under the vanilla SFT regime for direct comparison. revision: yes
Referee: [Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.

Authors: Data distributions were matched by sampling from identical problem sources with equal example counts; we will clarify this in Methods. However, multiple independent runs were not performed owing to compute limits, so statistical significance and variance were not computed. We will add a limitations statement and, if resources allow, include variance from additional seeds in an appendix. revision: partial

standing simulated objections not resolved

Run-to-run variance and statistical significance, which cannot be supplied without new multi-seed experiments beyond the original study.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study with no derivation chain

full rationale

The paper is an empirical investigation comparing vanilla SFT against Theorem-SFT on mathematical reasoning benchmarks (MATH, GeoQA). It reports performance deltas (+8.8%, +20.27%) and an MLP-layer ablation but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claim rests on experimental contrasts rather than any self-referential logical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on experimental comparisons rather than formal axioms or derivations; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1016 out tokens · 26625 ms · 2026-05-12T04:47:26.502142+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like... decompose theorems into <Cond,Map,Exec>
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.