RobustSpring: Benchmarking Robustness to Image Corruptions for Optical Flow, Scene Flow and Stereo
Pith reviewed 2026-05-22 15:41 UTC · model grok-4.3
The pith
RobustSpring applies 20 consistent image corruptions to the Spring dataset to benchmark robustness in optical flow, scene flow, and stereo models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RobustSpring establishes a dataset and benchmark that applies 20 image corruptions—including noise, blur, color changes, quality degradations, and weather distortions—in a time-, stereo-, and depth-consistent manner to the Spring dataset, yielding 20,000 corrupted images and a dedicated corruption robustness metric; when integrated with the Spring benchmark this enables joint accuracy-robustness evaluation, with experiments showing that scores on RobustSpring predict real-world resilience.
What carries the argument
The RobustSpring dataset with its 20 time-stereo-depth-consistent corruptions and the associated corruption robustness metric for two-axis model evaluation.
If this is right
- Models can now be ranked and compared jointly on accuracy and robustness for optical flow, scene flow, and stereo.
- Robustness varies widely across different corruption types, guiding targeted improvements.
- Integration with the existing Spring benchmark allows simultaneous tracking of both performance axes.
- Development can shift toward models that maintain accuracy under realistic perturbations.
Where Pith is reading between the lines
- The consistency requirement across frames and views could be adopted when creating robustness tests for other dense prediction tasks such as depth estimation or segmentation.
- The metric opens a path to training objectives that directly optimize for the reported robustness score.
- In deployed systems like autonomous driving, selecting models by RobustSpring scores may reduce failure rates under adverse weather.
- Extending the set of corruptions or learning them from real data distributions would test whether the current fixed suite is sufficient.
Load-bearing premise
The 20 chosen image corruptions, when applied consistently, accurately simulate challenging real-world conditions and the new metric validly quantifies model resilience.
What would settle it
Models that rank high on RobustSpring accuracy-robustness scores would show no corresponding improvement when tested on independently collected real-world sequences containing similar noise, rain, or blur.
read the original abstract
Standard benchmarks for optical flow, scene flow, and stereo vision algorithms generally focus on model accuracy rather than robustness to image corruptions like noise or rain. Hence, the resilience of models to such real-world perturbations is largely unquantified. To address this, we present RobustSpring, a comprehensive dataset and benchmark for evaluating robustness to image corruptions for optical flow, scene flow, and stereo models. RobustSpring applies 20 different image corruptions, including noise, blur, color changes, quality degradations, and weather distortions, in a time-, stereo-, and depth-consistent manner to the high-resolution Spring dataset, creating a suite of 20,000 corrupted images that reflect challenging conditions. RobustSpring enables comparisons of model robustness via a new corruption robustness metric. Integration with the Spring benchmark enables two-axis evaluations of both accuracy and robustness. We benchmark a curated selection of initial models, observing that robustness varies widely by corruption type, and experimentally show that evaluations on RobustSpring indicate real-world robustness. RobustSpring is a new computer vision benchmark to treat robustness as a first-class citizen, fostering models that are accurate and resilient. It is available at https://spring-benchmark.org.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RobustSpring, a benchmark dataset derived from the Spring dataset by applying 20 image corruptions (noise, blur, color changes, quality degradations, weather distortions) in a time-, stereo-, and depth-consistent manner, yielding 20,000 corrupted images. It proposes a new corruption robustness metric for comparing model resilience in optical flow, scene flow, and stereo tasks. The benchmark is integrated with the original Spring benchmark for two-axis (accuracy and robustness) evaluations. The authors benchmark a selection of models, note wide variation in robustness across corruption types, and claim that RobustSpring evaluations indicate real-world robustness.
Significance. If validated against real-world data, RobustSpring would provide a valuable new resource for the field by elevating robustness evaluation to a first-class criterion alongside accuracy for optical flow, scene flow, and stereo. The time-/stereo-/depth-consistent corruption application is a technical strength that avoids common inconsistencies in prior robustness benchmarks. Public release of the dataset and integration with the existing Spring benchmark supports reproducibility and adoption. The significance is currently tempered by the absence of quantitative evidence linking benchmark performance to deployment conditions.
major comments (2)
- [§4.3] §4.3 (Experiments and Results): The central claim that 'evaluations on RobustSpring indicate real-world robustness' is unsupported. No rank correlation, transfer learning experiments, or comparison to real captured corruptions (e.g., KITTI rain sequences or nuScenes adverse weather) is reported to establish that performance under the 20 synthetic corruptions predicts resilience outside the benchmark. This link is load-bearing for the paper's motivation and conclusions.
- [§3.2] §3.2 (Robustness Metric): The new corruption robustness metric is introduced at a conceptual level but lacks an explicit mathematical definition, including the precise aggregation formula across the 20 corruptions, normalization procedure, and handling of per-model baselines. Without this, it is difficult to assess whether the metric validly quantifies resilience or is reproducible.
minor comments (2)
- [Figure 2] Figure 2: The example corrupted images would be clearer if accompanied by quantitative severity metrics (e.g., PSNR or SSIM values) for each corruption type to allow readers to gauge intensity.
- [Related Work] Related Work: The discussion of prior robustness benchmarks (e.g., ImageNet-C, Cityscapes-C) could explicitly contrast the multi-task consistent corruption approach used here versus single-task or inconsistent applications in earlier work.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions we plan to incorporate to strengthen the paper.
read point-by-point responses
-
Referee: [§4.3] §4.3 (Experiments and Results): The central claim that 'evaluations on RobustSpring indicate real-world robustness' is unsupported. No rank correlation, transfer learning experiments, or comparison to real captured corruptions (e.g., KITTI rain sequences or nuScenes adverse weather) is reported to establish that performance under the 20 synthetic corruptions predicts resilience outside the benchmark. This link is load-bearing for the paper's motivation and conclusions.
Authors: We appreciate the referee's emphasis on validating the connection to real-world conditions. The manuscript's claim is grounded in the observation that the 20 corruptions were selected and applied to mimic common real-world degradations (noise, weather, etc.), with results showing substantial variation in model robustness that aligns with known sensitivities in the literature. However, we acknowledge that direct quantitative evidence such as rank correlations or comparisons to real captured sequences is not currently reported. To address this, we will add a new analysis subsection in the revised §4.3 that includes rank correlation where feasible with available real-world data and a clearer discussion of the benchmark's role as a proxy. This revision will better support the conclusions. revision: yes
-
Referee: [§3.2] §3.2 (Robustness Metric): The new corruption robustness metric is introduced at a conceptual level but lacks an explicit mathematical definition, including the precise aggregation formula across the 20 corruptions, normalization procedure, and handling of per-model baselines. Without this, it is difficult to assess whether the metric validly quantifies resilience or is reproducible.
Authors: We thank the referee for identifying this gap in presentation. Section 3.2 describes the metric at a high level as quantifying resilience via performance degradation under the consistent corruptions. We agree that an explicit mathematical formulation is necessary for full reproducibility. In the revised manuscript, we will expand §3.2 to include the precise formula: the robustness score as the normalized average of relative error increases across the 20 corruptions (with details on aggregation as mean, normalization relative to clean performance, and baseline handling per model). This will make the metric fully specified and easier to implement. revision: yes
Circularity Check
No significant circularity: benchmark creation is self-contained
full rationale
The paper introduces RobustSpring by applying 20 known image corruptions (noise, blur, weather, etc.) in a time-/stereo-/depth-consistent manner to the existing Spring dataset, then defines a corruption robustness metric and benchmarks models on the resulting 20,000 images. No derivation chain, equations, or predictions reduce to fitted parameters or self-referential inputs by construction. The claim that evaluations indicate real-world robustness is presented as an experimental observation from model benchmarking rather than a tautological result. The work is a dataset and evaluation framework without load-bearing self-citations or ansatzes that collapse into the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Image corruptions are applied in a time-, stereo-, and depth-consistent manner.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a corruption robustness metric based on Lipschitz continuity... Rc_M = M[f(I), f(Ic)]
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
RobustSpring applies 20 different image corruptions... in a time-, stereo-, and depth-consistent manner
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.