Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning
Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3
The pith
Supervised fine-tuning for math reasoning succeeds when models learn to apply theorems explicitly instead of memorizing individual problem-answer pairs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to input variations; Theorem-SFT reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like, yielding consistent gains such as +8.8% on MATH and +20.27% on GeoQA while implicating feed-forward components as the primary locus of reasoning rules.
What carries the argument
Theorem-SFT, which shifts supervision from answer patterns in individual instances to explicit teaching of theorem invocation rules.
Load-bearing premise
The reported performance gains are caused by the shift to theorem-level supervision rather than by other unspecified differences in data construction, prompting, or training hyperparameters.
What would settle it
A controlled replication that holds data volume, format, prompting, and all hyperparameters fixed while changing only the supervision target from answer pairs to theorem applications and observes no accuracy difference.
Figures
read the original abstract
Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that supervised fine-tuning (SFT) impairs reasoning generalization because models memorize spurious surface correlations in problem-solution pairs rather than general rules. It introduces Theorem-SFT to reorient supervision toward explicit theorem application and invocation, reporting consistent gains of +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct). It further finds that fine-tuning MLP layers alone matches full-layer performance, implicating feed-forward components as the primary site for encoding reasoning rules, and reframes generalization failures as arising from memorizing incorrect inductive targets.
Significance. If the reported gains can be isolated to the theorem-supervision shift, the work would offer a useful reframing of SFT limitations in mathematical reasoning and a practical intervention that avoids modality-specific retraining. The MLP ablation provides a mechanistic hint about where rules may be stored, which could inform targeted fine-tuning strategies. The empirical scope across two benchmarks and multiple models is a strength, though the absence of detailed controls limits immediate impact.
major comments (3)
- [Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.
- [Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.
- [Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.
minor comments (1)
- [Abstract] Abstract: The phrase 'consistent gains across benchmarks and model families' would be clearer if the exact model-benchmark pairs and the definition of 'Theorem-SFT' were stated more explicitly.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor needed to support our claims about Theorem-SFT. We respond to each major comment below, indicating revisions where the manuscript will be updated to provide clearer controls and details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.
Authors: We agree that the manuscript must demonstrate the two regimes differed only in the supervision target. Both conditions used identical base models, the same number of examples sampled from the same problem distributions, matching prompt templates (differing solely in target output format), and the same training hyperparameters. We will add an explicit comparison table and description in the Methods section to document these controls. revision: yes
-
Referee: [Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.
Authors: The MLP ablation was intended as a mechanistic follow-up to locate where rules are encoded after the main comparison. We accept that extending it to the vanilla SFT baseline with matched controls would strengthen the link. In revision we will report MLP-only fine-tuning results under the vanilla SFT regime for direct comparison. revision: yes
-
Referee: [Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.
Authors: Data distributions were matched by sampling from identical problem sources with equal example counts; we will clarify this in Methods. However, multiple independent runs were not performed owing to compute limits, so statistical significance and variance were not computed. We will add a limitations statement and, if resources allow, include variance from additional seeds in an appendix. revision: partial
- Run-to-run variance and statistical significance, which cannot be supplied without new multi-seed experiments beyond the original study.
Circularity Check
No circularity: purely empirical benchmark study with no derivation chain
full rationale
The paper is an empirical investigation comparing vanilla SFT against Theorem-SFT on mathematical reasoning benchmarks (MATH, GeoQA). It reports performance deltas (+8.8%, +20.27%) and an MLP-layer ablation but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claim rests on experimental contrasts rather than any self-referential logical reduction, making the work self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like... decompose theorems into <Cond,Map,Exec>
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.