pith. sign in

arxiv: 2605.10603 · v3 · pith:TOM435F3new · submitted 2026-05-11 · 💻 cs.CV

Segment Anything with Robust Uncertainty-Accuracy Correlation

Pith reviewed 2026-05-21 08:08 UTC · model grok-4.3

classification 💻 cs.CV
keywords segment anythinguncertainty estimationdomain shiftrobust segmentationpixel-wise uncertaintyadversarial trainingzero-shot generalizationcalibration
0
0 comments X

The pith

RUAC adds an uncertainty head to SAM and trains it against joint style-deformation attacks so uncertainty reliably flags pixel errors even after domain shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Mask-level Confidence Confusion in SAM, where a single IoU-based mask score does not reflect pixel-wise reliability near boundaries once the input domain changes. The authors model those changes as combined appearance shifts and non-rigid deformations, then attach a lightweight uncertainty head that is trained with a collaborative style-deformation attack. An additional Uncertainty-Accuracy Alignment step forces the uncertainty values to stay high exactly where the segmentation is wrong. Experiments across 23 zero-shot domains show both higher segmentation quality and stronger correlation between reported uncertainty and actual accuracy. A reader would care because trustworthy uncertainty lets downstream systems know when to trust or discard the model's output under real-world variation.

Core claim

RUAC adds a lightweight uncertainty head to the Segment Anything Model, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations, resulting in improved segmentation quality and stronger uncertainty-accuracy correlation across 23 zero-shot domains.

What carries the argument

The collaborative style-deformation attack that jointly perturbs texture and geometry, together with Uncertainty-Accuracy Alignment that enforces consistent correlation between uncertainty values and segmentation errors.

If this is right

  • Segmentation quality rises in zero-shot domains that differ in both appearance and shape from the training data.
  • Uncertainty maps more accurately mark pixels near erroneous boundaries rather than giving uniform mask-level scores.
  • The correlation between uncertainty and accuracy remains high even when inputs undergo simultaneous texture and geometry changes.
  • Downstream tasks can use the uncertainty values to reject or re-query unreliable regions with greater confidence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same uncertainty-head approach could be attached to other large segmentation or detection models that currently rely on single scalar scores.
  • Real deployment in medical imaging or robotics might gain from using the improved uncertainty to trigger human review only on high-uncertainty regions.
  • Further tests with deformation patterns absent from the current attack would clarify how broadly the joint-training idea generalizes.

Load-bearing premise

The collaborative style-deformation attack that jointly perturbs texture and geometry sufficiently models real out-of-domain appearance shifts and non-rigid deformations.

What would settle it

Measuring the uncertainty-accuracy correlation on a fresh collection of real images that contain appearance and deformation shifts not generated by the style-deformation attack; if the correlation drops to the level of the original SAM, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.10603 by Hongyou Zhou, Ling Shao, Marc Toussaint, Zihan Ye.

Figure 1
Figure 1. Figure 1: Overview of Robust Uncertainty-Accuracy Correlation (RUAC): Both panels share the same out-of-domain test images spanning four categories (Objects, Scenes, Scientific, Egocentric). (a) Vanilla SAM2 trained on a fixed source domain produces unexplainable segmentation and confused confidence maps, exhibiting Mask-level Confidence Confusion (MCC). (b) RUAC introduces a dynamic training domain via Adversarial … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the RUAC Framework. We formulate adversarial training as a minimax game between attackers and the segmentation model. The Style Adversarial Network (ψs) constructs an object graph from ground-truth masks and visual features, then refines per-object style statistics via GCN layers to generate semantically coherent styled images. The Deformation Network (ψd) predicts a dense offset field from SAM… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative segmentation comparison across diverse domains. From top to bottom: Scene (Cityscapes), Scientific (IBD aerial building imagery), Object (mixed industrial objects), and Egocentric (hand-object interaction). Columns show input, ground truth, and predictions from SAM2, Bayes-SAM2, and RUAC (ours). RUAC produces more complete and accurate masks, particularly for challenging cases such as fine-grai… view at source ↗
Figure 4
Figure 4. Figure 4: Confidence and uncertainty visualization for three object instances from the first row of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of adversarial perturbations (zoomed to central region). (a) Original input image. (b) Without GCN, style perturbation is applied globally, causing uniform darkening across the entire scene. (c) With GCN capturing inter-object correlations, perturbations become object-aware, enabling more targeted attacks while preserving global semantics (e.g., the central elephant). (d) Geometric deformatio… view at source ↗
read the original abstract

Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://hongyouzhou.github.io/ruac/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RUAC as an extension to SAM that adds a lightweight uncertainty head. This head is trained via a collaborative style-deformation attack that jointly perturbs texture and geometry, together with an Uncertainty-Accuracy Alignment loss intended to ensure uncertainty maps consistently flag erroneous pixels under adversarial perturbations. The central claim is that RUAC improves segmentation quality and produces more faithful pixel-wise uncertainty estimates exhibiting stronger uncertainty-accuracy correlation across 23 zero-shot domains, thereby mitigating Mask-level Confidence Confusion (MCC) under appearance shifts and non-rigid deformations.

Significance. If the reported gains in correlation and segmentation quality hold under scrutiny, the work offers a practical route to more reliable uncertainty quantification for SAM in out-of-domain settings. The motivation contrasting texture-biased network shortcuts with shape-centric human vision is well-grounded. Credit is due for the public project page and the explicit focus on zero-shot evaluation across a broad set of domains.

major comments (2)
  1. [§ Experiments] § Experiments (results on 23 domains): the central claim of improved uncertainty-accuracy correlation and segmentation quality across all 23 zero-shot domains requires quantitative support (e.g., correlation coefficients, IoU deltas, error bars, and ablation tables). If these metrics are present only in supplementary material or figures without clear statistical testing against baselines, the generalization statement remains under-supported.
  2. [Method] Method section describing the collaborative style-deformation attack: the training distribution is defined by joint texture and geometry perturbations. The manuscript does not report any overlap metric, qualitative comparison, or ablation that verifies this synthetic distribution sufficiently covers the actual appearance and deformation shifts present in the 23 test domains (e.g., local elastic warps or sensor-specific artifacts). This assumption is load-bearing for the claim that the alignment loss yields generally stronger correlation rather than attack-specific behavior.
minor comments (2)
  1. [Abstract] Abstract: the statement of improvements across 23 domains would be strengthened by a single sentence citing the key quantitative gains (e.g., average correlation increase or IoU lift) rather than leaving all numbers to the body.
  2. [Introduction] Notation: ensure MCC and RUAC are defined on first use and used consistently; the current abbreviation list appears incomplete for readers unfamiliar with the prior SAM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered the major comments and provide point-by-point responses below. We believe the revisions will strengthen the presentation of our results and methodological justifications.

read point-by-point responses
  1. Referee: [§ Experiments] § Experiments (results on 23 domains): the central claim of improved uncertainty-accuracy correlation and segmentation quality across all 23 zero-shot domains requires quantitative support (e.g., correlation coefficients, IoU deltas, error bars, and ablation tables). If these metrics are present only in supplementary material or figures without clear statistical testing against baselines, the generalization statement remains under-supported.

    Authors: We thank the referee for this observation. While the main results on correlation and segmentation quality across the 23 domains are presented in Section 4 with supporting figures, we acknowledge that more explicit statistical testing would enhance the presentation. In the revised manuscript, we will include a dedicated table with correlation coefficients, IoU deltas, standard errors, and results of statistical significance tests against baselines. We will also bring key ablation results from the supplementary material into the main text for better visibility. revision: yes

  2. Referee: [Method] Method section describing the collaborative style-deformation attack: the training distribution is defined by joint texture and geometry perturbations. The manuscript does not report any overlap metric, qualitative comparison, or ablation that verifies this synthetic distribution sufficiently covers the actual appearance and deformation shifts present in the 23 test domains (e.g., local elastic warps or sensor-specific artifacts). This assumption is load-bearing for the claim that the alignment loss yields generally stronger correlation rather than attack-specific behavior.

    Authors: We agree that additional verification would strengthen the claims. The attack is motivated by the need to cover joint appearance and deformation shifts observed in the test domains. In the revised manuscript, we will add qualitative comparisons of the perturbations with examples from the 23 domains, as well as an ablation analyzing the coverage through metrics such as distribution similarity and deformation statistics. revision: yes

Circularity Check

0 steps flagged

No circularity: new components and empirical validation

full rationale

The paper introduces a lightweight uncertainty head, a collaborative style-deformation attack for training, and an Uncertainty-Accuracy Alignment loss. These are presented as novel additions to SAM rather than quantities derived from previously fitted parameters or self-referential equations. Claims of improved segmentation quality and stronger uncertainty-accuracy correlation are supported by empirical results across 23 zero-shot domains, not by construction from the training distribution itself. No self-citations, uniqueness theorems, or ansatzes that reduce the central result to its inputs appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the modeling choice that appearance shifts plus non-rigid deformations capture the main stresses on calibration; no explicit free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption Out-of-domain variation can be modeled as appearance shifts and non-rigid deformations that jointly stress calibration
    Invoked to justify the collaborative style-deformation attack in the motivation section of the abstract.

pith-pipeline@v0.9.0 · 5688 in / 1101 out tokens · 36853 ms · 2026-05-21T08:08:51.486226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    URL https://arxiv.org/abs/2106.0 2740. Bengio, Y ., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for con- ditional computation.arXiv preprint arXiv:1308.3432, 2013. Brooks, T., Holynski, A., and Efros, A. A. InstructPix2Pix: Learning to follow image editing instructions. InProceed- ings of the IEEE/CVF Conf...

  2. [2]

    Minervini, M., Fischbach, A., Scharr, H., and Tsaftaris, S

    doi: 10.1145/307400.307435. Minervini, M., Fischbach, A., Scharr, H., and Tsaftaris, S. A. Finely-grained annotated datasets for image-based plant phenotyping.Pattern Recognition Letters, 81:80– 89, 2016. doi: 10.1016/j.patrec.2015.10.013. Mukhoti, J. and Gal, Y . Evaluating Bayesian deep learn- ing methods for semantic segmentation.arXiv preprint arXiv:1...

  3. [3]

    Walk in the cloud: Learning curves for point clouds shape analysis, pp

    doi: 10.1109/ICCV48922.2021.01073. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with la- tent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pp. 10684–10695, 2022. URL https: //openaccess.thecvf.com/content/CVPR 2022/html/Rombach_High-Resol...

  4. [4]

    mask A is confident but mask B is uncertain

    URL https://arxiv.org/abs/2109.1 5068. Ye, K., Chen, T., Wei, H., and Zhan, L. Uncertainty regu- larized evidential regression. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 38, pp. 16460–16468, 2024. doi: 10.1609/aaai.v38i15.29583. URL https://ojs.aaai.org/index.php/A AAI/article/view/29583. Yogamani, S., Hughes, C., Horg...

  5. [5]

    Since calibration metrics (AUROC, PAvPU) de- pend only on therankingof uncertainty values, not their absolute magnitudes, the approximation preserves all task-relevant information

    Rank preservation: The Pearson correlation between simplified and full variance exceeds0.99across diverse inputs. Since calibration metrics (AUROC, PAvPU) de- pend only on therankingof uncertainty values, not their absolute magnitudes, the approximation preserves all task-relevant information

  6. [6]

    This constant factor is absorbed by the subsequent MacKay approxima- tion, which maps variance to probability through κ= 1/ √ 1 +πv/8

    Constant scaling: The full variance is approximately 2–3×the simplified variance. This constant factor is absorbed by the subsequent MacKay approxima- tion, which maps variance to probability through κ= 1/ √ 1 +πv/8. A constant scaling of v merely shifts the sigmoid operating point uniformly, without affecting relative ordering

  7. [7]

    This variancev is then used in MacKay’s probit approxima- tion to compute analytic uncertainty

    Analytic-sampling agreement: Comparing our analytic uncertainty against 100-sample Monte Carlo ground truth yields Pearson correlation >0.90 , confirming that the end-to-end pipeline (variance approximation + MacKay) produces well-ranked uncertainty estimates. This variancev is then used in MacKay’s probit approxima- tion to compute analytic uncertainty. ...

  8. [8]

    Spatial overlap (IoU):Objects with overlapping masks should have coordinated perturbations: wIoU ij =IoU(M i,Mj)·⊮[IoU(Mi,Mj)>τIoU]. (S14)

  9. [9]

    Geometric proximity:Nearby objects likely share light- ing conditions: wdist ij = max ( 0,1−dboundary(Mi,Mj) dmax ) ·⊮[dij <τd], (S15) where dboundary is the minimum boundary distance be- tween masks

  10. [10]

    flat minima

    Semantic similarity:Visually similar objects should receive similar perturbations: wsem ij = cos(fi,fj)·⊮[cos(fi,fj)>τsim].(S16) The final edge weight combines all criteria:wij =w IoU ij + wdist ij +w sem ij . Self-loops with unit weight are added to pre- serve node identity. Node Feature Initialization.Each node is initialized with the concatenation of s...

  11. [11]

    Uniform uncertainty collapse: the model could mini- mize the loss by raisingu everywhere, ignoring whether eis actually large

  12. [12]

    The dual stop-gradient blocks both shortcuts

    Attackeru-shortcut: the attacker could maximize the loss by manipulatingu via BNDL rather than producing genuinely hard inputs. The dual stop-gradient blocks both shortcuts. sg[e] in the u-channel terms makesu-updates conditional on the actual error: u is only pushed up wheree is observed to be high, preventing uniform collapse. sg[u] in thee-channel term...

  13. [13]

    demonstrate effective generalization, not memorization of source-domain GT patterns. D. Dataset Details Source domain:MOSE (Ding et al., 2023) contains 2,149 video clips with 5,200 objects across 36 categories. We usefirst frames onlyfor both training and evaluation, oper- ating SAM2 in single-frame mode without memory prop- agation. This isolates image s...

  14. [14]

    To minimize the expected NLL term Ew[NLL]≈NLL(µ) +1 2Tr(H(Σ)) , the optimizer is forced to reduce Σ significantly

    Standard Training (Sharp Minima):Standard ERM often converges to sharp minima where the Hessian Hw has large eigenvalues (high curvature). To minimize the expected NLL term Ew[NLL]≈NLL(µ) +1 2Tr(H(Σ)) , the optimizer is forced to reduce Σ significantly. This re- sults in small posterior variance, low logit variance, and consequently overconfident predicti...

  15. [15]

    knows that it doesn’t know

    AUE Training (Flat Minima):Optimizing against perturbationsδ∗effectively minimizes the worst-case loss within a neighborhood, which necessitates finding a solu- tionµthat is robust to local changes. This implies finding a region where the loss surface is flat (low curvature, small Hw). Because Hw is smaller, the penalty term Tr(H(Σ)) is reduced, allowing ...

  16. [16]

    At matched accuracy: When comparing predictions at the same coverage level (i.e., same effective accuracy), RUAC’s uncertainty ranking is more reliable

  17. [17]

    Actionable uncertainty: The improvement in AURC directly translates to better performance in downstream tasks that use uncertainty for rejection, active learning, or human-in-the-loop correction

  18. [18]

    egocentric kitchen

    Consistent across domains: The benefit is observed on the majority of OOD datasets, not just those where RUAC has higher accuracy. J. Statistical Significance Analysis To rigorously evaluate calibration improvements, we con- duct Wilcoxon signed-rank tests across all 23 OOD datasets. This non-parametric test is appropriate for paired compar- 9 Segment Any...