Enhancing Robustness of Offline Reinforcement Learning Under Data Corruption via Sharpness-Aware Minimization

Jiayu Chen; Le Xu

arxiv: 2511.17568 · v1 · submitted 2025-11-14 · 💻 cs.LG · cs.AI

Enhancing Robustness of Offline Reinforcement Learning Under Data Corruption via Sharpness-Aware Minimization

Le Xu , Jiayu Chen This is my paper

Pith reviewed 2026-05-17 22:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords offline reinforcement learningdata corruptionsharpness-aware minimizationrobustnessgeneralizationIQLRIQLD4RL

0 comments

The pith

Integrating Sharpness-Aware Minimization into offline RL baselines like IQL and RIQL finds flatter loss minima and improves robustness to data corruption.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that data corruption in offline reinforcement learning produces sharp minima in the loss landscape, which harms generalization even in robust algorithms. To counter this, it applies Sharpness-Aware Minimization (SAM) for the first time as a plug-and-play optimizer that explicitly searches for flatter minima and more robust parameter regions. The approach is tested by adding SAM to IQL and to RIQL, then evaluating on D4RL tasks under both random and adversarial corruptions in observations and mixtures. Performance gains are reported as consistent and significant, with reward-surface visualizations offered as direct evidence that SAM produces smoother solutions.

Core claim

The central claim is that data corruption creates sharp minima leading to poor generalization in offline RL, and that SAM, by seeking flatter minima, serves as an effective general-purpose addition that makes strong baselines like IQL and RIQL more robust, as shown by superior results on D4RL benchmarks with random and adversarial corruptions and by smoother reward surfaces.

What carries the argument

Sharpness-Aware Minimization (SAM), used as a plug-and-play optimizer that perturbs parameters to find flatter minima in the loss landscape.

If this is right

SAM-enhanced versions of IQL and RIQL achieve higher returns than the originals under both random and adversarial data corruptions on D4RL.
Visualizations of the reward surface confirm that SAM-trained policies occupy smoother regions of the loss landscape.
The same SAM integration can be applied to other offline RL algorithms as a general robustness tool.
The method addresses both observation corruptions and mixture corruptions without requiring changes to the underlying RL objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Loss-landscape geometry may be a more general lever for robustness than designing corruption-specific objectives.
SAM could be tested in online RL settings where data collection itself is subject to noise or sensor faults.
If flatter minima improve corruption robustness, similar sharpness-aware training might help in other sequential decision problems with imperfect data.

Load-bearing premise

Data corruption creates sharp minima in the loss landscape that cause poor generalization, and seeking flatter minima via SAM will reliably improve robustness in offline RL under both random and adversarial corruptions.

What would settle it

A direct comparison on the same D4RL corrupted datasets where the SAM-augmented IQL and RIQL versions fail to outperform their non-SAM counterparts, or where reward-surface plots do not show reduced sharpness for the SAM models.

Figures

Figures reproduced from arXiv: 2511.17568 by Jiayu Chen, Le Xu.

read the original abstract

Offline reinforcement learning (RL) is vulnerable to real-world data corruption, with even robust algorithms failing under challenging observation and mixture corruptions. We posit this failure stems from data corruption creating sharp minima in the loss landscape, leading to poor generalization. To address this, we are the first to apply Sharpness-Aware Minimization (SAM) as a general-purpose, plug-and-play optimizer for offline RL. SAM seeks flatter minima, guiding models to more robust parameter regions. We integrate SAM into strong baselines for data corruption: IQL, a top-performing offline RL algorithm in this setting, and RIQL, an algorithm designed specifically for data-corruption robustness. We evaluate them on D4RL benchmarks with both random and adversarial corruption. Our SAM-enhanced methods consistently and significantly outperform the original baselines. Visualizations of the reward surface confirm that SAM finds smoother solutions, providing strong evidence for its effectiveness in improving the robustness of offline RL agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that data corruption induces sharp minima in the offline RL loss landscape, leading to poor generalization, and proposes Sharpness-Aware Minimization (SAM) as a general-purpose optimizer to seek flatter minima and improve robustness. It integrates SAM into IQL and RIQL, reports consistent outperformance over baselines on D4RL under random and adversarial corruptions, and uses reward-surface visualizations to support that SAM yields smoother solutions.

Significance. If the results and mechanism hold, the work provides a practical, architecture-agnostic way to enhance robustness in offline RL via optimization rather than algorithmic redesign. It leverages strong existing baselines and standard benchmarks with both corruption types. Credit is due for the plug-and-play framing and empirical evaluation, but significance is reduced by the lack of direct evidence tying performance gains to the flat-minima hypothesis.

major comments (1)

Abstract and reward-surface visualizations: the claim that SAM improves robustness by locating flatter minima induced by data corruption is not supported by any direct sharpness metric (e.g., SAM sharpness, neighborhood loss, or Hessian trace) evaluated on the critic or actor training loss. The reported reward-surface plots measure a different quantity and therefore do not confirm the posited mechanism; gains could arise from altered gradient dynamics or implicit regularization instead.

minor comments (1)

Missing or insufficient details on statistical significance of reported improvements, precise definitions and implementation of the random/adversarial corruption models, sensitivity to the SAM radius hyperparameter, and ablation controls isolating SAM's contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comment regarding the evidence for the flat-minima mechanism below. We agree that direct sharpness metrics would strengthen the paper and will incorporate them in the revision.

read point-by-point responses

Referee: Abstract and reward-surface visualizations: the claim that SAM improves robustness by locating flatter minima induced by data corruption is not supported by any direct sharpness metric (e.g., SAM sharpness, neighborhood loss, or Hessian trace) evaluated on the critic or actor training loss. The reported reward-surface plots measure a different quantity and therefore do not confirm the posited mechanism; gains could arise from altered gradient dynamics or implicit regularization instead.

Authors: We thank the referee for highlighting this important point. Our reward-surface visualizations were intended to provide qualitative support by showing smoother performance landscapes under SAM, which aligns with the expected outcome of flatter minima in the optimization landscape. However, we acknowledge that these plots do not directly quantify sharpness in the training loss (e.g., via Hessian trace or neighborhood loss). To more rigorously tie the performance gains to the flat-minima hypothesis, we will add direct sharpness measurements on the critic and actor losses in the revised version of the manuscript. This will help rule out alternative explanations such as altered gradient dynamics. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical application of existing optimizer to offline RL benchmarks

full rationale

The paper is an empirical application study that integrates the established SAM optimizer into existing offline RL baselines (IQL, RIQL) and evaluates performance on D4RL benchmarks under random and adversarial data corruptions. No mathematical derivation chain, fitted parameters, or predictions are present that reduce by construction to the paper's own inputs or self-citations. The posited link between data corruption and sharp minima is stated as a hypothesis, supported by reward-surface visualizations and external benchmark results rather than any self-referential equations or load-bearing prior work by the authors. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that flatter minima generalize better under data corruption; no new entities are postulated and free parameters are limited to standard optimizer hyperparameters such as the SAM neighborhood size.

free parameters (1)

SAM neighborhood radius rho
Standard SAM hyperparameter that controls the size of the perturbation used to estimate sharpness; its value must be chosen or tuned.

axioms (1)

domain assumption Flatter minima in the loss landscape correlate with better generalization under distribution shift induced by data corruption.
Invoked in the abstract to motivate why SAM should help; this is a common but unproven premise in sharpness-aware optimization literature.

pith-pipeline@v0.9.0 · 5457 in / 1359 out tokens · 30228 ms · 2026-05-17T22:35:28.115340+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We posit this failure stems from data corruption creating sharp minima in the loss landscape... SAM seeks flatter minima... reward surface visualizations confirm that SAM finds smoother solutions
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SAM Optimization Process... two-step minimax procedure... neighborhood radius ρ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. Kostrikov, I.; Nair, A.; and Levine, S. 2021. Of- fline Reinforcement Learning with Implicit Q-Learning. arXiv:2110.06169. Levine, S.; Kumar, A.; Tucker, G.; and Fu, J. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. L...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[2]

arXiv:2205.07015

Cliff Diving: Exploring Reward Surfaces in Rein- forcement Learning Environments. arXiv:2205.07015. Wen, K.; Ma, T.; and Li, Z. 2023. How Does Sharpness-Aware Minimization Minimize Sharpness? arXiv:2211.05729. Xu, J.; Yang, R.; Qiu, S.; Luo, F.; Fang, M.; Wang, B.; and Han, L. 2025. Tackling Data Corruption in Offline Reinforcement Learning via Sequence M...

work page arXiv 2023

[1] [1]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

D4RL: Datasets for Deep Data-Driven Reinforcement Learning. arXiv:2004.07219. Kostrikov, I.; Nair, A.; and Levine, S. 2021. Of- fline Reinforcement Learning with Implicit Q-Learning. arXiv:2110.06169. Levine, S.; Kumar, A.; Tucker, G.; and Fu, J. 2020. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643. L...

work page internal anchor Pith review Pith/arXiv arXiv 2004

[2] [2]

arXiv:2205.07015

Cliff Diving: Exploring Reward Surfaces in Rein- forcement Learning Environments. arXiv:2205.07015. Wen, K.; Ma, T.; and Li, Z. 2023. How Does Sharpness-Aware Minimization Minimize Sharpness? arXiv:2211.05729. Xu, J.; Yang, R.; Qiu, S.; Luo, F.; Fang, M.; Wang, B.; and Han, L. 2025. Tackling Data Corruption in Offline Reinforcement Learning via Sequence M...

work page arXiv 2023