pith. sign in

arxiv: 2510.12451 · v2 · submitted 2025-10-14 · 💻 cs.LG · cs.AI· cs.CV

A Function-Centric Perspective on Flat and Sharp Minima

Pith reviewed 2026-05-18 07:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV
keywords flat minimasharp minimageneralizationfunction complexityregularizationdecision boundariesloss landscapeneural networks
0
0 comments X

The pith

Sharpness in neural network minima is a property of the learned function rather than a marker of poor generalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that the common association between flat minima and good generalization in deep networks is not fundamental. Instead, sharpness depends on the complexity of the specific function the model learns. Experiments demonstrate this in optimization tasks where different functions yield different geometries for equally good solutions. Synthetic classification shows that tighter decision boundaries produce sharper minima without harming generalization. In real image tasks, regularization creates sharper minima that improve multiple performance aspects, pointing to function complexity as the key influence on loss geometry.

Core claim

The authors claim that sharpness is better understood as a function-dependent property. In single-objective optimisation, equally optimal solutions can exhibit markedly different local geometry. In synthetic non-linear binary classification, increasing decision-boundary tightness increases sharpness even with perfect generalization. In large-scale image classification, sharper minima emerge under regularisation and coincide with better generalization, calibration, robustness, and functional consistency. This suggests that function complexity shapes the geometry of solutions and that sharper minima can reflect more appropriate inductive biases.

What carries the argument

Function-dependent sharpness, the idea that the local geometry around a solution is determined by the complexity of the target function being learned.

If this is right

  • Equally optimal solutions in single-objective optimisation can have different sharpness depending on the function.
  • Increasing decision-boundary tightness in classification tasks increases sharpness while maintaining perfect generalization.
  • Regularisation techniques such as weight decay, data augmentation, or SAM often produce sharper minima that also show better generalization and related properties.
  • Function complexity rather than flatness determines the geometry of minima in the loss landscape.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If this view holds, training algorithms could benefit from targeting appropriate function complexity instead of directly optimizing for flatness.
  • The findings may link to why certain regularisers improve performance by encouraging functions with suitable complexity for the task.
  • Sharpness measures might need adjustment to account for the underlying function class to better predict generalization.
  • Testing on tasks with explicitly varied function complexities while holding generalization constant could further validate the perspective.

Load-bearing premise

The sharpness measures and task constructions isolate function complexity without confounding effects from optimization dynamics or data distribution.

What would settle it

Showing that regularised models achieve better generalization through flatter rather than sharper minima, or that decision boundary tightness does not increase sharpness in the synthetic tasks.

read the original abstract

Flat minima are strongly associated with improved generalisation in deep neural networks. However, this connection has proven nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance and argue that sharpness is better understood as a function-dependent property rather than an indicator of poor generalisation. We conduct extensive empirical studies ranging from single-objective optimisation, synthetic non-linear binary classification tasks, to modern image classification tasks. In single-objective optimisation, we show that flatness and sharpness are relative to the function being learned: equally optimal solutions can exhibit markedly different local geometry. In synthetic non-linear binary classification tasks, we show that increasing decision-boundary tightness can increase sharpness even when models generalise perfectly, indicating that sharpness is not reducible to memorisation alone. Finally, in large-scale experiments, we find that sharper minima often emerge when models are regularised (e.g., via weight decay, data augmentation, or SAM), and coincide with better generalisation, calibration, robustness, and functional consistency. Our findings suggest that function complexity, rather than flatness, shapes the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases, calling for a function-centric reappraisal of minima geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that sharpness is better understood as a function-dependent property of the learned solution rather than an indicator of poor generalisation. It supports this via three empirical regimes: (i) single-objective optimisation showing that equally optimal solutions can exhibit different local geometry; (ii) synthetic non-linear binary classification where tightening the decision boundary increases sharpness while preserving perfect generalisation; and (iii) large-scale image classification where regularisation (weight decay, augmentation, SAM) produces sharper minima that coincide with improved generalisation, calibration, robustness and functional consistency. The central conclusion is that function complexity, rather than flatness per se, shapes the geometry of minima.

Significance. If the empirical patterns hold after proper controls, the work would usefully complicate the flat-minima narrative and encourage a more function-centric analysis of solution geometry. The synthetic-task construction and the observation that regularised solutions can be both sharper and better-performing are potentially valuable contributions, provided they are shown to isolate inductive bias from optimisation dynamics and data-distribution effects.

major comments (3)
  1. [§4] §4 (single-objective optimisation): the demonstration that equally optimal solutions can have different geometry is suggestive, but the manuscript does not report whether the compared solutions are reached by distinct optimiser trajectories or basin-selection mechanisms; without trajectory controls or explicit construction of distinct functions with identical loss values, it remains possible that the observed geometric differences are artefacts of the optimisation path rather than intrinsic function properties.
  2. [§5] §5 (synthetic non-linear binary classification): tightening the decision boundary is reported to increase sharpness while maintaining perfect generalisation. However, the construction necessarily alters the effective data distribution and loss-surface curvature; the paper does not provide an ablation that holds the data distribution fixed while varying only boundary tightness, leaving open the possibility that the sharpness increase is driven by distribution shift rather than by function complexity per se.
  3. [§6] §6 (large-scale image classification): the claim that sharper minima emerge under regularisation and coincide with better generalisation, calibration and robustness is central, yet the experiments simultaneously modify both the effective function class and the optimisation procedure. No controlled comparison is presented that applies the same regulariser while measuring sharpness on solutions that achieve identical training loss; this conflation weakens the inference that sharpness itself reflects appropriate inductive biases.
minor comments (2)
  1. The abstract and experimental sections should explicitly state the number of random seeds, statistical tests, and any post-hoc selection of tasks or metrics; the current description leaves the robustness of the reported trends unclear.
  2. Notation for the various sharpness measures (e.g., trace of Hessian, maximum eigenvalue, PAC-Bayes bounds) should be introduced once in a dedicated subsection and used consistently; occasional shifts between definitions make cross-experiment comparison harder.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us clarify the experimental controls and strengthen the distinction between function properties and optimisation artefacts. We respond to each major comment below and have incorporated revisions to address the concerns.

read point-by-point responses
  1. Referee: [§4] §4 (single-objective optimisation): the demonstration that equally optimal solutions can have different geometry is suggestive, but the manuscript does not report whether the compared solutions are reached by distinct optimiser trajectories or basin-selection mechanisms; without trajectory controls or explicit construction of distinct functions with identical loss values, it remains possible that the observed geometric differences are artefacts of the optimisation path rather than intrinsic function properties.

    Authors: We agree that trajectory effects must be ruled out. In the single-objective setting we directly parameterise and optimise distinct target functions that are constrained to identical loss values on the training set while differing in their higher-order derivatives (hence local geometry). Because the functions are constructed explicitly rather than discovered via different optimisers, the geometric differences are properties of the functions themselves. We have added a paragraph in the revised §4 describing the parameterisation and confirming that all solutions are obtained from the same initialisation and optimiser to further isolate the effect. revision: yes

  2. Referee: [§5] §5 (synthetic non-linear binary classification): tightening the decision boundary is reported to increase sharpness while maintaining perfect generalisation. However, the construction necessarily alters the effective data distribution and loss-surface curvature; the paper does not provide an ablation that holds the data distribution fixed while varying only boundary tightness, leaving open the possibility that the sharpness increase is driven by distribution shift rather than by function complexity per se.

    Authors: The referee correctly identifies that boundary tightening changes the effective support of the data. Nevertheless, the central observation—that perfect generalisation is preserved while sharpness increases—still demonstrates that sharpness is not synonymous with poor generalisation. To isolate function complexity from distribution shift we have added an ablation that fixes the training and test point sets and varies only the model’s capacity (via width or explicit regularisation on the decision boundary). The revised §5 now reports that sharpness still rises with tighter boundaries under this fixed-distribution regime. revision: yes

  3. Referee: [§6] §6 (large-scale image classification): the claim that sharper minima emerge under regularisation and coincide with better generalisation, calibration and robustness is central, yet the experiments simultaneously modify both the effective function class and the optimisation procedure. No controlled comparison is presented that applies the same regulariser while measuring sharpness on solutions that achieve identical training loss; this conflation weakens the inference that sharpness itself reflects appropriate inductive biases.

    Authors: We acknowledge that regularisation simultaneously alters the function class and the optimisation trajectory. To decouple these factors we have added a controlled comparison in which all models (regularised and baseline) are trained until they reach the same training loss value, either by extending training epochs or by early stopping at matched loss. Sharpness, generalisation, calibration and robustness are then measured on these loss-matched solutions. The new results, now included in the revised §6, continue to show that regularised solutions are both sharper and superior on the downstream metrics, supporting the function-centric interpretation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations are independent of fitted inputs or self-definitional reductions

full rationale

The paper advances its central claim through direct empirical comparisons across single-objective optimization, synthetic binary classification, and large-scale image tasks. It reports observed differences in local geometry for equally optimal solutions and correlations between regularization, sharpness, and generalization metrics. No equations, predictions, or first-principles derivations are shown that reduce by construction to author-defined quantities, fitted parameters, or self-citation chains. The argument remains self-contained against external benchmarks because the reported patterns are falsifiable via the described experimental setups without requiring the target conclusion as an input.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and relies on standard assumptions about loss-landscape geometry measures without introducing new free parameters or postulated entities.

axioms (1)
  • domain assumption Common sharpness proxies (e.g., Hessian-based or perturbation-based) faithfully capture local geometry relevant to generalization.
    Invoked when interpreting results from all three experimental regimes.

pith-pipeline@v0.9.0 · 5765 in / 1153 out tokens · 43523 ms · 2026-05-18T07:31:03.327371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.