RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Andreas Bulling; Andr\'es Bruhn; Jenny Schmalfuss; Lukas Mehl; Madlen Bartsch; Margret Keuper; Shashank Agnihotri; Victor Oei

arxiv: 2505.09368 · v2 · submitted 2025-05-14 · 💻 cs.CV · cs.LG

RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo

Victor Oei , Jenny Schmalfuss , Lukas Mehl , Madlen Bartsch , Shashank Agnihotri , Margret Keuper , Andreas Bulling , Andr\'es Bruhn This is my paper

Pith reviewed 2026-05-22 15:41 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords robustness benchmarkoptical flowscene flowstereo visionimage corruptionscomputer visionmodel evaluationdataset

0 comments

The pith

RobustSpring applies 20 consistent image corruptions to the Spring dataset to benchmark robustness in optical flow, scene flow, and stereo models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing benchmarks for optical flow, scene flow, and stereo prioritize accuracy on clean data while leaving resilience to real-world issues like noise, blur, or rain largely unmeasured. RobustSpring addresses this by corrupting the high-resolution Spring dataset with 20 corruption types applied in a time-, stereo-, and depth-consistent manner, producing 20,000 corrupted images. It introduces a new corruption robustness metric that supports two-axis evaluation alongside the original Spring accuracy scores. Benchmarking a selection of models reveals that robustness differs sharply across corruption categories, and the results correlate with performance in actual real-world conditions.

Core claim

RobustSpring establishes a dataset and benchmark that applies 20 image corruptions—including noise, blur, color changes, quality degradations, and weather distortions—in a time-, stereo-, and depth-consistent manner to the Spring dataset, yielding 20,000 corrupted images and a dedicated corruption robustness metric; when integrated with the Spring benchmark this enables joint accuracy-robustness evaluation, with experiments showing that scores on RobustSpring predict real-world resilience.

What carries the argument

The RobustSpring dataset with its 20 time-stereo-depth-consistent corruptions and the associated corruption robustness metric for two-axis model evaluation.

If this is right

Models can now be ranked and compared jointly on accuracy and robustness for optical flow, scene flow, and stereo.
Robustness varies widely across different corruption types, guiding targeted improvements.
Integration with the existing Spring benchmark allows simultaneous tracking of both performance axes.
Development can shift toward models that maintain accuracy under realistic perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The consistency requirement across frames and views could be adopted when creating robustness tests for other dense prediction tasks such as depth estimation or segmentation.
The metric opens a path to training objectives that directly optimize for the reported robustness score.
In deployed systems like autonomous driving, selecting models by RobustSpring scores may reduce failure rates under adverse weather.
Extending the set of corruptions or learning them from real data distributions would test whether the current fixed suite is sufficient.

Load-bearing premise

The 20 chosen image corruptions, when applied consistently, accurately simulate challenging real-world conditions and the new metric validly quantifies model resilience.

What would settle it

Models that rank high on RobustSpring accuracy-robustness scores would show no corresponding improvement when tested on independently collected real-world sequences containing similar noise, rain, or blur.

read the original abstract

Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type, and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark to treat robustness as a first-class citizen, fostering models that are accurate and resilient. It is available at https://spring-benchmark.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RobustSpring introduces a useful new benchmark for corruption robustness in optical flow and stereo, though its real-world predictive power still needs demonstration.

read the letter

This paper's key move is to release RobustSpring, a benchmark that corrupts the Spring dataset with 20 types of image degradations while keeping them consistent over time, stereo pairs, and depth maps. This creates a large collection of test cases that can be used alongside the standard accuracy measures. The work does a good job addressing the lack of robustness testing in flow and stereo. Applying corruptions in a structured way avoids breaking the ground truth, and tying it back to the original benchmark lets people evaluate both accuracy and robustness together. Their initial results on a few models illustrate that some corruptions hurt more than others, which helps show that robustness is corruption-specific rather than a single property. A soft spot is the jump to real-world relevance. The abstract claims that RobustSpring evaluations indicate real-world robustness, but without any quantitative check against real adverse data, like rain in driving datasets, that part rests on the assumption that the synthetic corruptions match reality closely enough. More experiments linking the two would make the case stronger. This is for computer vision researchers focused on making models work outside the lab. A reader looking for new evaluation tools will find it practical and easy to integrate with existing setups. Anyone developing algorithms for applications like robotics or autonomous systems where conditions are unpredictable would benefit from having this kind of test available. It should go to peer review. The benchmark itself is a useful addition even if the real-world claim needs more backing to fully convince.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces RobustSpring, a benchmark dataset derived from the Spring dataset by applying 20 image corruptions (noise, blur, color changes, quality degradations, weather distortions) in a time-, stereo-, and depth-consistent manner, yielding 20,000 corrupted images. It proposes a new corruption robustness metric for comparing model resilience in optical flow, scene flow, and stereo tasks. The benchmark is integrated with the original Spring benchmark for two-axis (accuracy and robustness) evaluations. The authors benchmark a selection of models, note wide variation in robustness across corruption types, and claim that RobustSpring evaluations indicate real-world robustness.

Significance. If validated against real-world data, RobustSpring would provide a valuable new resource for the field by elevating robustness evaluation to a first-class criterion alongside accuracy for optical flow, scene flow, and stereo. The time-/stereo-/depth-consistent corruption application is a technical strength that avoids common inconsistencies in prior robustness benchmarks. Public release of the dataset and integration with the existing Spring benchmark supports reproducibility and adoption. The significance is currently tempered by the absence of quantitative evidence linking benchmark performance to deployment conditions.

major comments (2)

[§4.3] §4.3 (Experiments and Results): The central claim that 'evaluations on RobustSpring indicate real-world robustness' is unsupported. No rank correlation, transfer learning experiments, or comparison to real captured corruptions (e.g., KITTI rain sequences or nuScenes adverse weather) is reported to establish that performance under the 20 synthetic corruptions predicts resilience outside the benchmark. This link is load-bearing for the paper's motivation and conclusions.
[§3.2] §3.2 (Robustness Metric): The new corruption robustness metric is introduced at a conceptual level but lacks an explicit mathematical definition, including the precise aggregation formula across the 20 corruptions, normalization procedure, and handling of per-model baselines. Without this, it is difficult to assess whether the metric validly quantifies resilience or is reproducible.

minor comments (2)

[Figure 2] Figure 2: The example corrupted images would be clearer if accompanied by quantitative severity metrics (e.g., PSNR or SSIM values) for each corruption type to allow readers to gauge intensity.
[Related Work] Related Work: The discussion of prior robustness benchmarks (e.g., ImageNet-C, Cityscapes-C) could explicitly contrast the multi-task consistent corruption approach used here versus single-task or inconsistent applications in earlier work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we plan to incorporate to strengthen the paper.

read point-by-point responses

Referee: [§4.3] §4.3 (Experiments and Results): The central claim that 'evaluations on RobustSpring indicate real-world robustness' is unsupported. No rank correlation, transfer learning experiments, or comparison to real captured corruptions (e.g., KITTI rain sequences or nuScenes adverse weather) is reported to establish that performance under the 20 synthetic corruptions predicts resilience outside the benchmark. This link is load-bearing for the paper's motivation and conclusions.

Authors: We appreciate the referee's emphasis on validating the connection to real-world conditions. The manuscript's claim is grounded in the observation that the 20 corruptions were selected and applied to mimic common real-world degradations (noise, weather, etc.), with results showing substantial variation in model robustness that aligns with known sensitivities in the literature. However, we acknowledge that direct quantitative evidence such as rank correlations or comparisons to real captured sequences is not currently reported. To address this, we will add a new analysis subsection in the revised §4.3 that includes rank correlation where feasible with available real-world data and a clearer discussion of the benchmark's role as a proxy. This revision will better support the conclusions. revision: yes
Referee: [§3.2] §3.2 (Robustness Metric): The new corruption robustness metric is introduced at a conceptual level but lacks an explicit mathematical definition, including the precise aggregation formula across the 20 corruptions, normalization procedure, and handling of per-model baselines. Without this, it is difficult to assess whether the metric validly quantifies resilience or is reproducible.

Authors: We thank the referee for identifying this gap in presentation. Section 3.2 describes the metric at a high level as quantifying resilience via performance degradation under the consistent corruptions. We agree that an explicit mathematical formulation is necessary for full reproducibility. In the revised manuscript, we will expand §3.2 to include the precise formula: the robustness score as the normalized average of relative error increases across the 20 corruptions (with details on aggregation as mean, normalization relative to clean performance, and baseline handling per model). This will make the metric fully specified and easier to implement. revision: yes

Circularity Check

0 steps flagged

No significant circularity: benchmark creation is self-contained

full rationale

The paper introduces RobustSpring by applying 20 known image corruptions (noise, blur, weather, etc.) in a time-/stereo-/depth-consistent manner to the existing Spring dataset, then defines a corruption robustness metric and benchmarks models on the resulting 20,000 images. No derivation chain, equations, or predictions reduce to fitted parameters or self-referential inputs by construction. The claim that evaluations indicate real-world robustness is presented as an experimental observation from model benchmarking rather than a tautological result. The work is a dataset and evaluation framework without load-bearing self-citations or ansatzes that collapse into the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen corruptions and their consistent application reflect real-world perturbations for these vision tasks.

axioms (1)

domain assumption Image corruptions are applied in a time-, stereo-, and depth-consistent manner.
This consistency is required for the corruptions to be meaningful for optical flow, scene flow, and stereo evaluation.

pith-pipeline@v0.9.0 · 5775 in / 1178 out tokens · 38928 ms · 2026-05-22T15:41:23.406428+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a corruption robustness metric based on Lipschitz continuity... Rc_M = M[f(I), f(Ic)]
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RobustSpring applies 20 different image corruptions... in a time-, stereo-, and depth-consistent manner

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.