GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations
Pith reviewed 2026-05-11 00:50 UTC · model grok-4.3
The pith
GSM-SEM creates fresh math problem variants by changing facts and entities while preserving the original answers and calculations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GSM-SEM is a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. It perturbs problem statements by modifying entities, attributes, and relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations, answer, and approximate problem difficulty. When applied to GSM8K and existing variation suites, the resulting GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets reveal consistent performance drops in 14 SOTA LLMs, with an average drop rate of 28 percent in the maximum-
What carries the argument
The GSM-SEM framework, a stochastic generator that applies constrained modifications to entities, attributes, and relationships in problem statements to produce new variants on each run.
If this is right
- Models achieving high scores on static GSM8K versions may exhibit lower accuracy when facts change but the underlying math remains identical.
- Fresh variants generated on each run reduce the long-term value of memorizing any single public test set.
- Combining semantic perturbations with symbolic or plus-style changes produces larger performance declines than either type alone.
- The same generation process can be reused on other reasoning benchmarks without requiring new human annotation for each release.
Where Pith is reading between the lines
- Integrating stochastic variant generation into model training loops could encourage learning of reasoning patterns that generalize across changed contexts rather than surface forms.
- If the preserved difficulty claim holds, the observed drops point to limits in how current models handle recomputation under altered problem conditions.
- Extending the approach to domains with different reasoning structures, such as code or planning tasks, would test whether similar memorization vulnerabilities exist outside math word problems.
Load-bearing premise
Modifications to entities, attributes, and relationships can alter underlying facts and require recomputation while still preserving the original calculations, answer, and approximate problem difficulty.
What would settle it
A set of generated variants where human validators confirm that the required calculations or final answer have changed, or where the 14 evaluated LLMs show no measurable accuracy drop relative to the original problems.
Figures
read the original abstract
Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GSM-SEM, a reusable stochastic framework for generating semantically variant augmentations of math reasoning benchmarks (GSM8K, GSM-Symbolic, GSM-Plus) by perturbing entities/attributes/relationships to increase semantic variance while constraining generation to preserve original calculations, answers, and approximate difficulty. It produces three new human-validated datasets, evaluates 14 SOTA LLMs showing consistent performance drops (larger when semantic perturbations are combined with symbolic/plus variations, averaging 28% in the maximum-strictness configuration), and demonstrates extension to other benchmarks such as BigBenchHard, LogicBench, and NLR-BIRD.
Significance. If the preservation constraints hold, GSM-SEM provides a practical, on-demand method for creating dynamic benchmarks that reduce memorization bias and better isolate true reasoning generalization; the reported drops when semantic changes are layered on symbolic variations would constitute useful evidence of current model limitations. The public release of validated datasets and the framework's applicability beyond GSM-style problems are concrete strengths.
major comments (2)
- [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
- [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
minor comments (2)
- [Abstract] Abstract: the human-validation claim would be strengthened by a brief statement of validation criteria or inter-annotator statistics.
- [Throughout] Throughout: notation for the three released variants (GSM8K-SEM, etc.) should be introduced once and used consistently; a small number of figure captions could be expanded to clarify what 'maximum strictness configuration' entails.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify how to strengthen the presentation of the GSM-SEM framework and results. We address each major comment below and will incorporate the requested details into the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (GSM-SEM framework description): the central claim that perturbations 'frequently alter underlying facts and require models to recompute solutions' while 'constraining generation to preserve the original calculations/answer and approximate problem difficulty' is load-bearing for interpreting the 28% drop as semantic-robustness evidence rather than difficulty inflation or answer mismatch; the manuscript provides no concrete description of the enforcement mechanism (template rules, symbolic equivalence checks, post-generation filtering, or verification steps).
Authors: We agree that Section 3 would benefit from a more explicit account of the enforcement mechanisms. In the revision we will expand the framework description to detail the template rules governing entity/attribute/relationship perturbations, the symbolic equivalence checks that verify answer preservation, the post-generation filtering criteria, and the verification steps used to maintain approximate problem difficulty. These additions will directly support the interpretation of performance drops as arising from increased semantic variance. revision: yes
-
Referee: [Results] Results (evaluation on 14 LLMs and the 28% figure): without explicit reporting of how answer equivalence and difficulty preservation were measured or validated on the generated sets (beyond the high-level human validation statement), the cross-configuration drops cannot be unambiguously attributed to semantic variance.
Authors: We acknowledge that the Results section should report the validation procedures more explicitly. We will add a dedicated subsection describing the automated answer-equivalence checks (exact numerical match after recomputation), the human validation protocol (including inter-annotator agreement on answer correctness and difficulty), and how these steps were applied across the GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM datasets. This will allow readers to attribute the observed drops unambiguously to semantic variance. revision: yes
Circularity Check
No significant circularity in GSM-SEM framework or empirical evaluation
full rationale
The paper introduces a stochastic generation framework that perturbs entities/attributes/relationships in existing benchmarks while enforcing preservation of answers and difficulty, applies it to produce new variant sets (GSM8K-SEM etc.), and reports empirical LLM performance drops on those sets. This chain relies on external model evaluations and human validation rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The central claims are observational results from applying the defined process to independent test items, making the derivation self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic perturbations can be applied to alter underlying facts while preserving original calculations, answers, and approximate difficulty
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.