pith. sign in

arxiv: 2605.09270 · v1 · submitted 2026-05-10 · 💻 cs.LG · cs.AI

Memorize Theorems, Not Instances: Probing SFT Generalization through Mathematical Reasoning

Pith reviewed 2026-05-12 04:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords supervised fine-tuningmathematical reasoninggeneralizationtheorem applicationmemorizationSFTMATH benchmarkGeoQA
0
0 comments X

The pith

Supervised fine-tuning for math reasoning succeeds when models learn to apply theorems explicitly instead of memorizing individual problem-answer pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that SFT harms generalization in mathematical reasoning because it leads models to latch onto surface correlations in specific examples rather than general rules. Theorem-SFT counters this by redirecting training signals to show how theorems are invoked during problem solving. This shift produces measurable gains on MATH and GeoQA across several model families and works even when only MLP layers are updated. The authors conclude that the problem with SFT is not memorization itself but the choice of what gets memorized.

Core claim

Vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to input variations; Theorem-SFT reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like, yielding consistent gains such as +8.8% on MATH and +20.27% on GeoQA while implicating feed-forward components as the primary locus of reasoning rules.

What carries the argument

Theorem-SFT, which shifts supervision from answer patterns in individual instances to explicit teaching of theorem invocation rules.

Load-bearing premise

The reported performance gains are caused by the shift to theorem-level supervision rather than by other unspecified differences in data construction, prompting, or training hyperparameters.

What would settle it

A controlled replication that holds data volume, format, prompting, and all hyperparameters fixed while changing only the supervision target from answer pairs to theorem applications and observes no accuracy difference.

Figures

Figures reproduced from arXiv: 2605.09270 by Jing Lei, Mengyu Yang, Ruiying Peng, Xiaohui Li, Xinlei Chen, Xueyu Wu.

Figure 1
Figure 1. Figure 1: Left: Comparison between Vanilla SFT and Theorem-SFT (ours), illustrating the goal of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Errors in single-step reasoning. We identify two main failure modes: premise oversight and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model responses to the applicability of the [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Object referential error on a geometric right-angle identification task. Here, [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of theorem extraction and theorem-chain construction from problem-solution [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: This figure compares standard responses in problem-solution datasets and Theorem-SFT [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Where are theorems stored? Evidence from parameter-efficient tuning. The figure compares [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

Supervised Fine-Tuning (SFT) is widely used for task-specific adaptation, yet recent work shows it systematically undermines reasoning generalization. We argue the root cause is not memorization itself, but its target: vanilla SFT drives models to exploit and memorize spurious surface correlations in problem-solution pairs, leaving them brittle to superficial input variations. To address this, we propose Theorem-SFT, which reorients supervision toward explicit theorem application by teaching models how rules are invoked rather than what answers look like. Theorem-SFT yields consistent gains across benchmarks and model families: +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct) without modality-specific re-training. Fine-tuning MLP layers alone matches full-layers performance, implicating feed-forward components as the primary locus of reasoning rules. Our findings reframe the debate: Generalization failures stem not from memorization as a mechanism, but from memorizing the wrong inductive targets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that supervised fine-tuning (SFT) impairs reasoning generalization because models memorize spurious surface correlations in problem-solution pairs rather than general rules. It introduces Theorem-SFT to reorient supervision toward explicit theorem application and invocation, reporting consistent gains of +8.8% on MATH (LLaMA3.2-3B-Instruct) and +20.27% on GeoQA (Qwen2.5-VL-7B-Instruct). It further finds that fine-tuning MLP layers alone matches full-layer performance, implicating feed-forward components as the primary site for encoding reasoning rules, and reframes generalization failures as arising from memorizing incorrect inductive targets.

Significance. If the reported gains can be isolated to the theorem-supervision shift, the work would offer a useful reframing of SFT limitations in mathematical reasoning and a practical intervention that avoids modality-specific retraining. The MLP ablation provides a mechanistic hint about where rules may be stored, which could inform targeted fine-tuning strategies. The empirical scope across two benchmarks and multiple models is a strength, though the absence of detailed controls limits immediate impact.

major comments (3)
  1. [Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.
  2. [Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.
  3. [Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.
minor comments (1)
  1. [Abstract] Abstract: The phrase 'consistent gains across benchmarks and model families' would be clearer if the exact model-benchmark pairs and the definition of 'Theorem-SFT' were stated more explicitly.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of experimental rigor needed to support our claims about Theorem-SFT. We respond to each major comment below, indicating revisions where the manuscript will be updated to provide clearer controls and details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central causal claim—that performance deltas (+8.8% MATH, +20.27% GeoQA) result from reorienting supervision to theorem application rather than spurious correlations—requires that vanilla SFT and Theorem-SFT differ only in the inductive target. No evidence is given that dataset curation, example count, prompt phrasing, or training hyperparameters were held fixed, making the attribution load-bearing and currently unsupported.

    Authors: We agree that the manuscript must demonstrate the two regimes differed only in the supervision target. Both conditions used identical base models, the same number of examples sampled from the same problem distributions, matching prompt templates (differing solely in target output format), and the same training hyperparameters. We will add an explicit comparison table and description in the Methods section to document these controls. revision: yes

  2. Referee: [Experimental results] Experimental results: The MLP-layer ablation is presented as implicating feed-forward components, but it does not address the primary baseline comparison between the two SFT regimes; without matched controls on the main contrast, the ablation cannot confirm that theorem-level supervision is the operative factor.

    Authors: The MLP ablation was intended as a mechanistic follow-up to locate where rules are encoded after the main comparison. We accept that extending it to the vanilla SFT baseline with matched controls would strengthen the link. In revision we will report MLP-only fine-tuning results under the vanilla SFT regime for direct comparison. revision: yes

  3. Referee: [Methods] Methods: No information is supplied on statistical significance, run-to-run variance, or whether data distributions were matched between conditions, which is required to evaluate whether the benchmark improvements reliably support the generalization claims.

    Authors: Data distributions were matched by sampling from identical problem sources with equal example counts; we will clarify this in Methods. However, multiple independent runs were not performed owing to compute limits, so statistical significance and variance were not computed. We will add a limitations statement and, if resources allow, include variance from additional seeds in an appendix. revision: partial

standing simulated objections not resolved
  • Run-to-run variance and statistical significance, which cannot be supplied without new multi-seed experiments beyond the original study.

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark study with no derivation chain

full rationale

The paper is an empirical investigation comparing vanilla SFT against Theorem-SFT on mathematical reasoning benchmarks (MATH, GeoQA). It reports performance deltas (+8.8%, +20.27%) and an MLP-layer ablation but contains no equations, first-principles derivations, or predictions that reduce to fitted inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The central claim rests on experimental contrasts rather than any self-referential logical reduction, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on experimental comparisons rather than formal axioms or derivations; no free parameters, axioms, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.0 · 5501 in / 1016 out tokens · 26625 ms · 2026-05-12T04:47:26.502142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Since ∠BAC ∼= ∠ABC and ∠BAC ∼= ∠ACB, we have 2m∠BAC+m∠BAC= 180 ◦. 6. Simplifying, we get 3m∠BAC= 180 ◦. 7. Solving for m∠BAC, we have m∠BAC= 180◦ 3 = 60◦

  2. [2]

    Since ∠BAC ∼= ∠ABC, we have m∠ABC= 60 ◦. 9. We are given that m∠BAC= 124 ◦, so we can find x by subtracting m∠BAC from 180◦: x= 180 ◦ −124 ◦ = 56◦. </think> <answer> 56 </answer> 12 Vanilla SFT Answer:56 ◦ Correct:× <think> Since BC is parallel to the segment through A, then ∠ABC and ∠BAC are corresponding angles. By the Corresponding Angles Theorem, they...

  3. [3]

    Since AB=BC , △ABC is isosceles (SAS). 4. ∠BAC=∠ABC (Isosceles Triangle The- orem). 5. The sum of interior angles in △ABC is 180◦ (Angle Sum Property of a Triangle). 6. ∠BAD+∠BAC+∠ABC= 180 ◦. 7. Substituting ∠BAD= 124 ◦ and ∠BAC=∠ABC=x ◦ into the equation. 8. 124◦ +x ◦ +x ◦ = 180◦. 9. Simplifying the equation gives 2x◦ + 124◦ = 180◦

  4. [4]

    Subtracting 124◦ from both sides gives 2x◦ = 56◦. 11. Dividing by 2 gives x◦ = 28◦. </think> <answer> 28 </answer> A.1.2 Case 2: Cubic Equation Reasoning QuestionFind the roots of(x−3) 3 + (x−7) 3 = (2x−10) 3. Ground Truth:x= 3,5,7 Baseline Answer:x= 3,−15,1 Correct:× <think> 1. Expand the equation using the formula (a−b) 3 =a 3 −3a 2b+ 3ab 2 −b 3. 2. App...

  5. [5]

    Combine like terms: 2x3 −30x 2 + 90x−1270 = 8x 3 −60x 2 + 160x−1000 . 4. Move all terms to one side of the equation: 6x3 + 30x2 −70x+ 270 = 0 . 5. Factor out the greatest common factor of 6: 6(x3 + 5x2 − 35 3 x+ 45) = 0 . 6. Notice that the equation is a cubic equation, so we need to find a rational root using the Rational Root Theorem (Theorem 1). 7. The...

  6. [6]

    The remaining quadratic factor is x2 + 8x−15 . 12. Factor the quadratic expression using the quadratic formula (Formula 2) or by finding two numbers that multiply to -15 and add up to 8. 13. The quadratic expression factors as (x+ 15)(x−1) . 14. Therefore, the roots of the original equation arex= 3, x=−15, x= 1. Formulas and Theorems used: (1) The Rationa...

  7. [7]

    Simplify the numerator: √ 5.76. 5. Calculate the denominator: 2.7−2.5 = 0.2 . 6. Apply the quotient rule for square roots: √a√ b = p a b . 7. Simplify the expression: q 5.76 0.2 . 8. Calculate the value inside the square root: √ 28.8. 9. Simplify the expression:5.36. </think> <answer>5.36</answer> Vanilla SFT Answer:11.55 Correct:× <think> Use difference ...

  8. [8]

    Simplify the numerator: p (2.5 + 0.7)(2.5−0.7) = p (3.2)(1.8)

  9. [9]

    Evaluate the denominator:2.7−2.5 = 0.2

  10. [10]

    Simplify the expression: √ (3.2)(1.8) 0.2 = √ 5.76 0.2

  11. [11]

    Simplify the square root: √ 5.76 = √ 5.76≈2.4

  12. [12]

    Simplify the final expression: 2.4 0.2 = 12. </think> <answer>12</answer> A.1.4 Case 4: Incorrect Polynomial Factorization and Evaluation QuestionLet x8 + 3x4 −4 =p 1(x)p2(x)· · ·p k(x), where each non-constant polynomial pi(x) is monic with integer coefficients, and cannot be factored further over the integers. Computep 1(1) +p 2(1) +· · ·+p k(1). Ground...

  13. [13]

    x4 + 4 = (x2 + 2x+ 2)(x 2 −2x+ 2)by completing the square

    Further factorizex 4 −1andx 4 + 4: x4 −1 = (x 2 −1)(x 2 + 1) = (x−1)(x+ 1)(x 2 + 1). x4 + 4 = (x2 + 2x+ 2)(x 2 −2x+ 2)by completing the square

  14. [14]

    The non-constant polynomials are (x−1) , (x+ 1) , (x2 + 1), (x2 + 2x+ 2), and (x2 −2x+ 2)

  15. [15]

    Evaluate each polynomial atx= 1: (x−1)→1−1 = 0, 15 (x+ 1)→1 + 1 = 2, (x2 + 1)→1 2 + 1 = 2, (x2 + 2x+ 2)→1 2 + 2(1) + 2 = 5, (x2 −2x+ 2)→1 2 −2(1) + 2 = 1

  16. [16]

    Pythagorean Theorem

    Sum the evaluated polynomials:0 + 2 + 2 + 5 + 1 = 10. </think> <answer> 10 </answer> A.2 Computational Resources All experiments are conducted on two nodes, each equipped with 8 NVIDIA H20 GPUs (96GB HBM3 memory per GPU). We use DeepSpeed ZeRO-2 for memory optimization and FlashAttention for efficient attention computation. All models are trained with bf1...

  17. [17]

    am) describes the action of σ on a subset of elements{a 1, a2,

    Definition of a cycle:A cycle (a1 a2 . . . am) describes the action of σ on a subset of elements{a 1, a2, . . . , am} ⊆ {1,2, . . . , n}. Specifically: σ(a1) =a 2, σ(a 2) =a 3, . . . , σ(a m) =a 1. For anyx /∈ {a1, a2, . . . , am},σ(x) =x

  18. [18]

    That is, if Ci = (a1 a2

    Disjoint cycles:Two cycles Ci and Cj are disjoint if their supports (the sets of elements they permute) are disjoint. That is, if Ci = (a1 a2 . . . am) and Cj = (b1 b2 . . . bp), then {a1, a2, . . . , am} ∩ {b1, b2, . . . , bp}=∅

  19. [19]

    , n} , repeatedly apply σ to trace the orbit of x under σ

    Cycle decomposition process:Starting from any element x∈ {1,2, . . . , n} , repeatedly apply σ to trace the orbit of x under σ. This orbit forms a cycle. Remove these elements from the set {1,2, . . . , n} and repeat the process with the remaining elements until all elements are exhausted

  20. [20]

    , n} into disjoint subsets, and the action of σ within each subset is uniquely determined

    Uniqueness:The cycle decomposition is unique up to the order of the cycles because the orbits of σ partition the set {1,2, . . . , n} into disjoint subsets, and the action of σ within each subset is uniquely determined. 18 Example 1.Letn= 6andσ∈S 6 be defined by: σ(1) = 3, σ(3) = 1, σ(2) = 4, σ(4) = 2, σ(5) = 6, σ(6) = 5. Step 1: Trace the orbits ofσ: • S...

  21. [21]

    Triangle Angle Sum Theorem:The sum of the interior angles of a triangle is always 180◦

  22. [22]

    Angle Bisector Definition:If a segment bisects an angle of a triangle, it divides the triangle into two sub-triangles such that the bisected angle is split equally

  23. [23]

    Theorem Composition

    Exterior Angle Theorem:The measure of an exterior angle of a triangle is equal to the sum of the measures of the two non-adjacent interior angles. Theorem Composition. • Step 1: Apply the Triangle Angle Sum Theorem in sub-triangle ABD.Let triangle ABC have point D on side BC, such that BD bisects ∠ABC. In triangle ABD, the sum of the angles is: ∠BAD+∠DBA+...

  24. [24]

    Theorem Composition

    Inscribed Angle Theorem:An inscribed angle in a circle is half the measure of the corresponding central angle subtending the same arc. Theorem Composition. • Step 1: Apply the Triangle Angle Sum Theorem.Let △ABC be a triangle where ∠ACB= 90 ◦. By the Triangle Angle Sum Theorem: ∠A+∠B+∠C= 180 ◦ Substituting∠C= 90 ◦, we get: ∠A+∠B= 90 ◦ • Step 2: Apply the ...

  25. [25]

    Let△ABCbe a right triangle with∠ACB= 90 ◦

  26. [26]

    Let an arc be drawn with centerBand radiusBC, intersecting segmentABat pointD

  27. [27]

    By the Inscribed Angle Theorem, ∠ACD is half the measure of the central angle ∠ABC: ∠ACD= 1 2 ∠ABC

  28. [28]

    Using the Triangle Angle Sum Theorem,∠A+∠ABC= 90 ◦

  29. [29]

    21 Example 1

    Substituting∠ABC= 2∠ACD, we find∠A= 90 ◦ −2∠ACD. 21 Example 1. • Step 1: Define Objects.Let △ABC be a right triangle with ∠ACB= 90 ◦. Let ∠ACD= 20◦. •Step 2: Apply Derived Theorem.Using the derived theorem: ∠A= 90 ◦ −2∠ACD Substituting∠ACD= 20 ◦: ∠A= 90 ◦ −2(20 ◦) = 50◦ •Step 3: Compute.The measure of∠Ais50 ◦. Counterexample. • Failure Case:If △ABC is not...