SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving
Pith reviewed 2026-06-29 13:43 UTC · model grok-4.3
The pith
A SAM-based pipeline converts bounding-box labels into dense semantic masks for the ZOD autonomous-driving dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate
What carries the argument
SAM-based annotation pipeline that converts bounding boxes into semantic masks, followed by manual curation of a 2,300-frame subset.
If this is right
- The generated annotations support training and evaluation of CLFT-Hybrid and DeepLabV3+ models to 48.1% mIoU across weather conditions.
- Specialized models can be trained to improve performance on rare classes such as pedestrians, cyclists, and signs.
- The pipeline produces usable annotations on the Iseauto platform, reaching 77.5% mIoU.
- SAM-derived features enable bidirectional transfer learning between different sensor configurations without major performance loss.
- Public release of the code and annotations allows other researchers to reproduce and extend the work on ZOD.
Where Pith is reading between the lines
- Bounding-box-only datasets from other driving projects could be turned into segmentation resources with similar pipelines, expanding available training data.
- The class-imbalance techniques tested here could be applied to improve detection of safety-critical objects in other imbalanced segmentation tasks.
- If the curation step can be further automated, the approach could scale to full datasets without the 36% acceptance filter.
- The cross-platform transfer results point to a practical route for adapting segmentation models when vehicle sensor rigs change.
Load-bearing premise
The manual curation of the 2,300-frame subset at a 36% acceptance rate produces a reliable baseline without introducing selection bias that affects the reported mIoU scores.
What would settle it
Retraining and evaluating the same models on an independently created pixel-level annotation set for a fresh random sample of ZOD frames and obtaining mIoU well below 48.1% would show the curated subset is not representative.
Figures
read the original abstract
Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a SAM-based annotation pipeline to generate dense semantic segmentation labels from bounding boxes in the ZOD dataset. They process over 100,000 frames, manually curate a 2,300-frame subset at 36% acceptance rate, and use it to train and evaluate CLFT-Hybrid and DeepLabV3+ models, reporting up to 48.1% mIoU. The work also addresses class imbalance for rare classes like pedestrians, validates the pipeline on the Iseauto platform with 77.5% mIoU, demonstrates transfer learning across sensors, and releases all code and annotations.
Significance. If the generated annotations prove reliable without significant selection bias, this contribution would be valuable for enabling semantic segmentation research on the multi-modal ZOD dataset, which currently lacks pixel-level labels. The bidirectional transfer learning results and focus on rare classes are relevant to autonomous driving applications. The public release of code and annotations supports reproducibility and is a notable strength.
major comments (2)
- [Abstract and curation description] The manual curation of the 2,300-frame subset with a 36% acceptance rate risks introducing selection bias, as only frames where SAM-generated masks appear acceptable are retained. Since all mIoU results (including the 48.1% peak with CLFT-Hybrid), weather breakdowns, and rare-class experiments are reported exclusively on this filtered subset, the performance metrics may not generalize to the full ZOD distribution or uncurated data.
- [Validation of annotations] The manuscript lacks details on how the quality of the SAM-derived annotations was validated beyond the manual acceptance criterion. Additional quantitative validation, such as comparison to a small set of human-annotated ground truth or inter-annotator agreement, would strengthen the claim that the pipeline produces a reliable baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our SAM-based annotation pipeline for ZOD. We address the two major concerns below with proposed revisions to improve clarity and transparency while preserving the pilot-study scope of the work.
read point-by-point responses
-
Referee: [Abstract and curation description] The manual curation of the 2,300-frame subset with a 36% acceptance rate risks introducing selection bias, as only frames where SAM-generated masks appear acceptable are retained. Since all mIoU results (including the 48.1% peak with CLFT-Hybrid), weather breakdowns, and rare-class experiments are reported exclusively on this filtered subset, the performance metrics may not generalize to the full ZOD distribution or uncurated data.
Authors: We agree this is a valid concern: the reported metrics are computed exclusively on the manually curated high-quality subset. This choice was intentional for a pilot study aimed at establishing a reliable baseline rather than claiming generalization to the full uncurated ZOD distribution. In the revised manuscript we will (1) add an explicit limitations paragraph discussing potential selection bias and its implications, (2) state that future work should evaluate the pipeline on uncurated frames, and (3) provide additional detail on the acceptance criteria used during curation. These changes will be made without altering the core claims or results. revision: yes
-
Referee: [Validation of annotations] The manuscript lacks details on how the quality of the SAM-derived annotations was validated beyond the manual acceptance criterion. Additional quantitative validation, such as comparison to a small set of human-annotated ground truth or inter-annotator agreement, would strengthen the claim that the pipeline produces a reliable baseline.
Authors: The validation in the current manuscript rests on the manual curation process performed by domain experts, which yielded the 36% acceptance rate. We will expand the methods section with a more detailed description of the curation protocol, including the specific visual criteria applied and the number of reviewers involved where applicable. However, because ZOD itself provides no pixel-level ground truth, creating an independent human-annotated reference set for quantitative comparison (e.g., mIoU against manual labels) lies outside the scope and resources of this pilot study; we will explicitly note this limitation. No new quantitative validation data will be added, but the existing manual process will be documented more thoroughly. revision: partial
Circularity Check
No circularity; empirical pipeline with released data
full rationale
The paper describes an annotation pipeline using SAM to convert bounding boxes to masks, followed by manual curation of a 2300-frame subset and empirical training/evaluation of segmentation models (CLFT, DeepLabV3+) reporting mIoU on that data. No equations, derivations, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. Results rest on external data release and standard model training, making the work self-contained against benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Segment Anything Model can be effectively prompted with bounding boxes to produce accurate semantic masks in road scenes without additional training.
Reference graph
Works this paper leans on
-
[1]
Computer vision for autonomous vehicles: Problems, datasets and state of the art,
J. Janai, F. G ¨uney, A. Behl, and A. Geiger, “Computer vision for autonomous vehicles: Problems, datasets and state of the art,” 2021. [Online]. Available: https://arxiv.org/abs/1704.05519
-
[2]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, 2015
2015
-
[3]
Pyramid scene parsing network,
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inCVPR, 2017
2017
-
[4]
Encoder- decoder with atrous separable convolution for semantic image segmen- tation,
L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inECCV, 2018
2018
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020. TABLE VII: Impact of SAM-Enhanced Edge Fine-Tuning on Segmentation Performance in Adverse Conditions on I...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
Swin transformer: Hierarchical vision transformer using shifted windows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022
2021
-
[7]
Mask2former for video instance segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Mask2former for video instance segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 810–12 820
2022
-
[8]
Segformer: Simple and efficient design for semantic segmentation with transformers,
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, 2021
2021
-
[9]
Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,
S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inCVPR, 2021
2021
-
[10]
The zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,
M. Klingneret al., “The zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[11]
nuscenes: A multi- modal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multi- modal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020
2020
-
[12]
Are we ready for autonomous driving? the kitti vision benchmark suite,
A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361
2012
-
[13]
Segment anything,
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inICCV, 2023
2023
-
[14]
Fusenet: Incorporat- ing depth into semantic segmentation via fusion-based cnn architecture,
C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporat- ing depth into semantic segmentation via fusion-based cnn architecture,” inAsian Conference on Computer Vision. Springer, 2016, pp. 213–228
2016
-
[15]
Class-balanced loss based on effective number of samples,
Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9268–9277
2019
-
[16]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988
2017
-
[17]
Cascade r-cnn: Delving into high quality object detection,
Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inCVPR, 2018
2018
-
[18]
Clft: Camera-lidar fusion transformer for semantic segmentation in autonomous driving,
J. Gu, M. Bellone, T. Pivo ˇnka, and R. Sell, “Clft: Camera-lidar fusion transformer for semantic segmentation in autonomous driving,”IEEE Transactions on Intelligent Vehicles, pp. 1–12, 2024
2024
-
[19]
Object segmentation for autonomous driving using iseauto data,
J. Gu, M. Bellone, R. Sell, and A. Lind, “Object segmentation for autonomous driving using iseauto data,”Electronics, vol. 11, no. 7,
-
[20]
Available: https://www.mdpi.com/2079-9292/11/7/1119
[Online]. Available: https://www.mdpi.com/2079-9292/11/7/1119
2079
-
[21]
SAM 2: Segment Anything in Images and Videos
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Base: Taltech’s hpc infras- tructure 2020–2024,
H. Herrmann, T. Kaevand, and L. Anton, “Base: Taltech’s hpc infras- tructure 2020–2024,” TalTech Data Repository, Mar. 2025
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.