Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

Allan Kazakov; Duygu Cakir; Hilal Kurt \.Irfano\u{g}lu; Yavuz \.Irfano\u{g}lu

arxiv: 2605.23747 · v1 · pith:HUSUTGCWnew · submitted 2026-05-22 · 💻 cs.CV

Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox

Allan Kazakov , Duygu Cakir , Hilal Kurt \.Irfano\u{g}lu , Yavuz \.Irfano\u{g}lu This is my paper

Pith reviewed 2026-05-25 04:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords dense material segmentationvision transformerSegFormergeneralization paradoxApple DMS datasetstabilized trainingmIoUout-of-distribution performance

0 comments

The pith

Stabilized SegFormer training reaches 0.4572 mIoU on the original Apple DMS split while exposing how easier repartitions mislead on real-world performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper revives the Apple Dense Material Segmentation benchmark by testing modern Vision Transformer models on the task of pixel-wise classification of physical surface properties. It shows that standard training fails because of high-variance gradients on amorphous texture fields, so the authors introduce a stabilized recipe with High-Fidelity Logit Projection, Query Entropy Regularization, and physics-compliant augmentations. Their optimized SegFormer-B5 model sets a new record of 0.4572 mIoU on the original dataset split, beating the prior convolutional baseline. At the same time the work identifies a Generalization Paradox in which an 80/10/10 repartition lifts the score to 0.5276 mIoU yet produces distributional homogenization that harms out-of-distribution behavior in practice. Readers care because the result indicates that material perception still requires careful benchmark design to produce physically grounded models rather than inflated numbers.

Core claim

By applying a stabilized training recipe of High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific physics-compliant augmentation pipeline, the optimized SegFormer-B5 architecture attains 0.4572 mIoU on the original Apple DMS split and thereby surpasses the prior convolutional baseline; the same models reach 0.5276 mIoU under an 80/10/10 repartition, yet expert qualitative analysis shows that repartition induces distributional homogenization that degrades real-world out-of-distribution performance.

What carries the argument

The stabilized training recipe of High-Fidelity Logit Projection, Query Entropy Regularization, and physics-compliant augmentations that counters high-variance gradients on amorphous texture fields.

If this is right

Material segmentation on texture fields requires domain-specific stabilization techniques beyond standard Vision Transformer training.
The original dataset split supplies a stricter and more trustworthy benchmark than repartitioned versions.
Releasing the recovered dataset index allows the community to avoid homogenized splits in future work.
Physically grounded AI for surface properties benefits from explicit attention to gradient stability and augmentation compliance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Stabilization methods of this form may transfer to other dense prediction tasks that involve high-variance natural textures.
Dataset repartitioning decisions should be cross-checked with qualitative out-of-distribution probes rather than metric gains alone.
Combining the released training framework with larger foundation models could test whether the same paradox appears at greater scale.

Load-bearing premise

Expert qualitative analysis can reliably detect when an 80/10/10 repartition has induced distributional homogenization that degrades real-world out-of-distribution performance.

What would settle it

A quantitative test on an external collection of real-world material photographs not present in the original dataset, where the model trained on the 80/10/10 split underperforms the model trained on the original split.

Figures

Figures reproduced from arXiv: 2605.23747 by Allan Kazakov, Duygu Cakir, Hilal Kurt \.Irfano\u{g}lu, Yavuz \.Irfano\u{g}lu.

read the original abstract

Material segmentation, the pixel-wise classification of physical surface properties, remains a challenging problem in computer vision, requiring physicochemical understanding distinct from object-centric parsing. Despite the introduction of the rigorous Apple Dense Material Segmentation (DMS) dataset, the benchmark has suffered from attrition and stagnation, increasingly overshadowed by geometry-biased foundation models. In this paper, we revive the Apple-DMS benchmark to establish a modern Vision Transformer baseline. We conduct an exhaustive evaluation of SegFormer and Mask2Former architectures, revealing that standard training paradigms fail on amorphous texture fields due to high-variance gradients. To address this, we introduce a stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline. Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split, significantly surpassing the prior convolutional baseline. Furthermore, we identify a critical "Generalization Paradox": while re-partitioning the dataset into a data-rich 80/10/10 split inflates the metric to 0.5276 mIoU, expert qualitative analysis reveals this induces distributional homogenization, severely degrading real-world, out-of-distribution performance. By releasing our recovered dataset index and robust training framework, we demonstrate that material perception is far from solved and urge the community to leverage the rigorous original split to drive genuine progress in physically grounded artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They report a concrete new SOTA of 0.4572 mIoU on the original DMS split via a stabilized SegFormer recipe, but the generalization paradox claim depends on thin expert qualitative review without quantitative OOD support.

read the letter

The main thing to know is that the paper gets SegFormer-B5 to 0.4572 mIoU on the original Apple DMS split, beating the old convolutional baseline, by adding High-Fidelity Logit Projection, Query Entropy Regularization, and physics-compliant augmentations to handle high-variance gradients on texture fields. They also flag a generalization paradox where the 80/10/10 repartition lifts the score to 0.5276 but supposedly hurts real-world performance due to distributional homogenization, based on expert qualitative analysis. They release the dataset index and training framework to support this.

Referee Report

2 major / 0 minor

Summary. The paper revives the Apple Dense Material Segmentation (DMS) benchmark by exhaustively evaluating SegFormer and Mask2Former architectures on the original dataset split. It identifies failure of standard training due to high-variance gradients on amorphous textures and introduces a stabilized recipe (High-Fidelity Logit Projection, Query Entropy Regularization, physics-compliant augmentations). The optimized SegFormer-B5 reports a new SOTA of 0.4572 mIoU on the original split (vs. prior convolutional baseline), while an 80/10/10 repartition yields 0.5276 mIoU but is argued—via expert qualitative analysis—to induce distributional homogenization that harms real-world OOD performance. The authors release the recovered dataset index and training framework.

Significance. If the reported mIoU numbers hold under the released framework, the work provides a concrete modern ViT baseline for material segmentation and usefully flags risks of repartitioning small domain-specific datasets. Releasing the dataset index and training code is a clear strength that supports reproducibility. However, the significance of the 'Generalization Paradox' claim is limited by its reliance on unquantified qualitative review without OOD metrics or protocol, reducing its force as a call to prefer the original split.

major comments (2)

[Abstract] Abstract (Generalization Paradox paragraph): The assertion that the 80/10/10 repartition 'induces distributional homogenization' that 'severely degrad[es] real-world, out-of-distribution performance' rests exclusively on unspecified 'expert qualitative analysis' with no evaluation protocol, inter-rater agreement, quantitative proxy (e.g., external OOD set performance), or reproducibility criteria. This is load-bearing for the paper's recommendation to avoid the repartitioned split.
[Abstract] Abstract and § on training: The claim that 'standard training paradigms fail on amorphous texture fields due to high-variance gradients' is stated without supporting evidence such as gradient-norm statistics, variance measurements across runs, or ablation tables isolating the contribution of High-Fidelity Logit Projection and Query Entropy Regularization. The reported 0.4572 mIoU therefore cannot be attributed to specific components of the stabilized recipe.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and planned revisions to improve the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract (Generalization Paradox paragraph): The assertion that the 80/10/10 repartition 'induces distributional homogenization' that 'severely degrad[es] real-world, out-of-distribution performance' rests exclusively on unspecified 'expert qualitative analysis' with no evaluation protocol, inter-rater agreement, quantitative proxy (e.g., external OOD set performance), or reproducibility criteria. This is load-bearing for the paper's recommendation to avoid the repartitioned split.

Authors: We recognize the referee's concern regarding the reliance on qualitative analysis for the Generalization Paradox. While the expert review provides important context on real-world applicability that in-distribution metrics alone cannot capture, we agree to enhance the description. In the revised manuscript, we will elaborate on the qualitative evaluation protocol, including the assessment criteria and illustrative cases of OOD degradation. We stand by the recommendation for the original split, as the repartition leads to inflated metrics that do not reflect true generalization. We will not add new quantitative OOD experiments at this stage. revision: partial
Referee: [Abstract] Abstract and § on training: The claim that 'standard training paradigms fail on amorphous texture fields due to high-variance gradients' is stated without supporting evidence such as gradient-norm statistics, variance measurements across runs, or ablation tables isolating the contribution of High-Fidelity Logit Projection and Query Entropy Regularization. The reported 0.4572 mIoU therefore cannot be attributed to specific components of the stabilized recipe.

Authors: The full paper contains ablation studies demonstrating the effectiveness of the stabilized recipe. To strengthen the attribution, we will include additional figures and tables showing gradient norm statistics during training and variance across multiple runs with standard vs. stabilized training. We will also provide more detailed ablations isolating each component's contribution to the final 0.4572 mIoU performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical mIoU results and qualitative observation are self-contained

full rationale

The paper reports direct empirical measurements (0.4572 mIoU on original split, 0.5276 mIoU on 80/10/10 split) from training SegFormer-B5 and Mask2Former with a stabilized recipe on the released DMS dataset index. No equations, fitted parameters, or self-citations reduce these numbers to inputs defined by the authors themselves. The Generalization Paradox rests on expert qualitative analysis rather than any self-definitional loop, uniqueness theorem, or ansatz smuggled via citation. The derivation chain consists of standard model evaluation and observation; it does not collapse by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard computer-vision training assumptions plus the domain-specific premise that the original DMS split better reflects real-world material distributions; no new physical constants or invented entities are introduced.

free parameters (1)

training recipe hyperparameters
Learning rates, regularization weights, and augmentation parameters are chosen to stabilize training; exact values not stated in abstract.

axioms (1)

domain assumption Standard training paradigms fail on amorphous texture fields due to high-variance gradients
Invoked to justify the need for the stabilized recipe.

pith-pipeline@v0.9.0 · 5817 in / 1368 out tokens · 39398 ms · 2026-05-25T04:34:23.894707+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our optimized SegFormer-B5 achieves a new State-of-the-Art (SOTA) of 0.4572 mIoU on the original dataset split... Generalization Paradox... expert qualitative analysis reveals this induces distributional homogenization
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stabilized training recipe featuring High-Fidelity Logit Projection, Query Entropy Regularization, and a domain-specific, physics-compliant augmentation pipeline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 9 internal anchors

[1]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Upchurch, P., Niu, R.: A Dense Material Segmentation Dataset for Indoor and Outdoor Scene Parsing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 450–466. Springer, Cham (2022). https://link.springer.com/chapter/10.1007/978-3-031-20074-8_26

work page doi:10.1007/978-3-031-20074-8_26 2022
[2]

In: Rogowitz, B.E., Pappas, T.N

Adelson, E.H.: On Seeing Stuff: The Perception of Materials by Humans and Machines. In: Rogowitz, B.E., Pappas, T.N. (eds.) Human Vision and Elec- tronic Imaging VI. Proc. SPIE, vol. 4299, pp. 1–12. SPIE, Bellingham (2001). https://doi.org/10.1117/12.429489 12 A. Kazakov et al

work page doi:10.1117/12.429489 2001
[3]

Sharan, L., Rosenholtz, R., Adelson, E.H.: Recognizing Materials Using Percep- tually Inspired Features. Int. J. Comput. Vis.103(3), 348–371 (2013). https: //doi.org/10.1007/s11263-013-0609-0

work page doi:10.1007/s11263-013-0609-0 2013
[4]

Describing Textures in the Wild

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing Textures in the Wild. In: Proc. CVPR, pp. 3606–3613. IEEE (2014). https://arxiv.org/abs/ 1311.3618

work page internal anchor Pith review Pith/arXiv arXiv 2014
[5]

ACM Trans

Bell, S., Upchurch, P., Snavely, N., Bala, K.: OpenSurfaces: A Richly Annotated Catalog of Surface Appearance. ACM Trans. Graph. (Proc. SIGGRAPH)32(4), 111:1–111:17 (2013). https://doi.org/10.1145/2461912.2462002

work page doi:10.1145/2461912.2462002 2013
[6]

Material Recognition in the Wild with the Materials in Context Database

Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material Recognition in the Wild with the Materials in Context Database. In: Proc. CVPR, pp. 3479–3487. IEEE (2015). https://arxiv.org/abs/1412.0623

work page internal anchor Pith review Pith/arXiv arXiv 2015
[7]

In: Proc

Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and Stuff Classes in Con- text. In: Proc. CVPR, pp. 1209–1218. IEEE (2018). https://arxiv.org/abs/1612. 03716

work page 2018
[8]

Semantic Understanding of Scenes through the ADE20K Dataset

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene Parsing Through ADE20K Dataset. In: Proc. CVPR, pp. 633–641. IEEE (2017). https: //arxiv.org/abs/1608.05442

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation. In: Proc. CVPR, pp. 3431–3440. IEEE (2015). https://arxiv.org/ abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Convo- lution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proc. ICLR. OpenReview.net (2021). https://arxiv.org/ abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

In: Proc

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In: Proc. NeurIPS, vol. 34, pp. 12077–12090 (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021
[13]

In: Proc

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-Attention Mask Transformer for Universal Image Segmentation. In: Proc. CVPR, pp. 1290–

work page
[14]

https://arxiv.org/abs/2112.01527

IEEE (2022). https://arxiv.org/abs/2112.01527

work page arXiv 2022
[15]

Segment Anything

Kirillov, A., et al.: Segment Anything. In: Proc. ICCV, pp. 4015–4026. IEEE (2023). https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

SAM 2: Segment Anything in Images and Videos

Ravi, N., et al.: SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714 (2024). https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

SAM 3: Segment Anything with Concepts

Feichtenhofer, C., et al.: SAM 3: Segment Anything with Concepts. arXiv preprint arXiv:2511.16719 (2025). https://arxiv.org/abs/2511.16719

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Ji, W., et al.: Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-World Applications. Mach. Intell. Res.21(4), 617–630 (2024). https://arxiv.org/abs/2304.05750

work page arXiv 2024

[1] [1]

In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T

Upchurch, P., Niu, R.: A Dense Material Segmentation Dataset for Indoor and Outdoor Scene Parsing. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 450–466. Springer, Cham (2022). https://link.springer.com/chapter/10.1007/978-3-031-20074-8_26

work page doi:10.1007/978-3-031-20074-8_26 2022

[2] [2]

In: Rogowitz, B.E., Pappas, T.N

Adelson, E.H.: On Seeing Stuff: The Perception of Materials by Humans and Machines. In: Rogowitz, B.E., Pappas, T.N. (eds.) Human Vision and Elec- tronic Imaging VI. Proc. SPIE, vol. 4299, pp. 1–12. SPIE, Bellingham (2001). https://doi.org/10.1117/12.429489 12 A. Kazakov et al

work page doi:10.1117/12.429489 2001

[3] [3]

Sharan, L., Rosenholtz, R., Adelson, E.H.: Recognizing Materials Using Percep- tually Inspired Features. Int. J. Comput. Vis.103(3), 348–371 (2013). https: //doi.org/10.1007/s11263-013-0609-0

work page doi:10.1007/s11263-013-0609-0 2013

[4] [4]

Describing Textures in the Wild

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing Textures in the Wild. In: Proc. CVPR, pp. 3606–3613. IEEE (2014). https://arxiv.org/abs/ 1311.3618

work page internal anchor Pith review Pith/arXiv arXiv 2014

[5] [5]

ACM Trans

Bell, S., Upchurch, P., Snavely, N., Bala, K.: OpenSurfaces: A Richly Annotated Catalog of Surface Appearance. ACM Trans. Graph. (Proc. SIGGRAPH)32(4), 111:1–111:17 (2013). https://doi.org/10.1145/2461912.2462002

work page doi:10.1145/2461912.2462002 2013

[6] [6]

Material Recognition in the Wild with the Materials in Context Database

Bell, S., Upchurch, P., Snavely, N., Bala, K.: Material Recognition in the Wild with the Materials in Context Database. In: Proc. CVPR, pp. 3479–3487. IEEE (2015). https://arxiv.org/abs/1412.0623

work page internal anchor Pith review Pith/arXiv arXiv 2015

[7] [7]

In: Proc

Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and Stuff Classes in Con- text. In: Proc. CVPR, pp. 1209–1218. IEEE (2018). https://arxiv.org/abs/1612. 03716

work page 2018

[8] [8]

Semantic Understanding of Scenes through the ADE20K Dataset

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene Parsing Through ADE20K Dataset. In: Proc. CVPR, pp. 633–641. IEEE (2017). https: //arxiv.org/abs/1608.05442

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Fully Convolutional Networks for Semantic Segmentation

Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation. In: Proc. CVPR, pp. 3431–3440. IEEE (2015). https://arxiv.org/ abs/1411.4038

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Convo- lution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587 (2017). https://arxiv.org/abs/1706.05587

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: Proc. ICLR. OpenReview.net (2021). https://arxiv.org/ abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

In: Proc

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In: Proc. NeurIPS, vol. 34, pp. 12077–12090 (2021). https://arxiv.org/abs/2105.15203

work page arXiv 2021

[13] [13]

In: Proc

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-Attention Mask Transformer for Universal Image Segmentation. In: Proc. CVPR, pp. 1290–

work page

[14] [14]

https://arxiv.org/abs/2112.01527

IEEE (2022). https://arxiv.org/abs/2112.01527

work page arXiv 2022

[15] [15]

Segment Anything

Kirillov, A., et al.: Segment Anything. In: Proc. ICCV, pp. 4015–4026. IEEE (2023). https://arxiv.org/abs/2304.02643

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

SAM 2: Segment Anything in Images and Videos

Ravi, N., et al.: SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:2408.00714 (2024). https://arxiv.org/abs/2408.00714

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

SAM 3: Segment Anything with Concepts

Feichtenhofer, C., et al.: SAM 3: Segment Anything with Concepts. arXiv preprint arXiv:2511.16719 (2025). https://arxiv.org/abs/2511.16719

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Ji, W., et al.: Segment Anything Is Not Always Perfect: An Investigation of SAM on Different Real-World Applications. Mach. Intell. Res.21(4), 617–630 (2024). https://arxiv.org/abs/2304.05750

work page arXiv 2024