pith. sign in

arxiv: 2605.28136 · v1 · pith:G2OWHDTXnew · submitted 2026-05-27 · 💻 cs.CV · cs.RO

SAM-Enhanced Segmentation on Road Datasets: Balancing Critical Classes in Autonomous Driving

Pith reviewed 2026-06-29 13:43 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords semantic segmentationautonomous drivingSegment Anything ModelZenseact Open Datasetannotation pipelineclass imbalancemulti-modal datatransfer learning
0
0 comments X

The pith

A SAM-based pipeline converts bounding-box labels into dense semantic masks for the ZOD autonomous-driving dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a Segment Anything Model pipeline can turn existing bounding-box annotations into pixel-level semantic segmentation labels for the Zenseact Open Dataset, which previously had none. This unlocks training and testing of segmentation models on its rich multi-sensor data across weather conditions. They process over 100,000 frames, manually curate a 2,300-frame subset, and report up to 48.1% mIoU with transformer models while testing ways to handle rare classes that occupy under 1% of pixels. The same pipeline is shown to work on a different vehicle platform at 77.5% mIoU and to support transfer learning between sensor setups. A reader would care because it removes a major barrier to using large existing driving datasets for full scene understanding without starting from scratch on manual pixel labels.

Core claim

Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate

What carries the argument

SAM-based annotation pipeline that converts bounding boxes into semantic masks, followed by manual curation of a 2,300-frame subset.

If this is right

  • The generated annotations support training and evaluation of CLFT-Hybrid and DeepLabV3+ models to 48.1% mIoU across weather conditions.
  • Specialized models can be trained to improve performance on rare classes such as pedestrians, cyclists, and signs.
  • The pipeline produces usable annotations on the Iseauto platform, reaching 77.5% mIoU.
  • SAM-derived features enable bidirectional transfer learning between different sensor configurations without major performance loss.
  • Public release of the code and annotations allows other researchers to reproduce and extend the work on ZOD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Bounding-box-only datasets from other driving projects could be turned into segmentation resources with similar pipelines, expanding available training data.
  • The class-imbalance techniques tested here could be applied to improve detection of safety-critical objects in other imbalanced segmentation tasks.
  • If the curation step can be further automated, the approach could scale to full datasets without the 36% acceptance filter.
  • The cross-platform transfer results point to a practical route for adapting segmentation models when vehicle sensor rigs change.

Load-bearing premise

The manual curation of the 2,300-frame subset at a 36% acceptance rate produces a reliable baseline without introducing selection bias that affects the reported mIoU scores.

What would settle it

Retraining and evaluating the same models on an independently created pixel-level annotation set for a fresh random sample of ZOD frames and obtaining mIoU well below 48.1% would show the curated subset is not representative.

Figures

Figures reproduced from arXiv: 2605.28136 by Junyi Gu, Mauro Bellone, Raivo Sell, Toomas Tahves.

Figure 1
Figure 1. Figure 1: SAM-based preprocessing pipeline visualization showing the progression from raw ZOD bounding boxes to dense [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data preprocessing pipeline for converting ZOD bound [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LiDAR annotation generation pipeline illustrating the creation of LiDAR-native segmentation masks using SAM [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of segmentation results on ZOD frame 000404. (a) Ground truth segmentation with zoomed [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Iseauto platform equipped with Velodyne lidars, 4K [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation Mean IoU trajectories demonstrating accelerated convergence through bidirectional transfer learning between [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
read the original abstract

Dense semantic segmentation is essential for autonomous driving, yet many multi-modal datasets lack pixel-level annotations. The Zenseact Open Dataset (ZOD) provides rich multi-sensor data but only bounding-box labels, limiting its use for segmentation research. Our primary contribution is a Segment Anything Model (SAM)-based annotation pipeline that produces dense, pixel-level annotations for ZOD by converting bounding boxes into semantic masks. In this pilot study, we process over 100,000 frames and manually curate a 2,300-frame subset (36% acceptance rate) to establish a reliable baseline. Using these annotations, we evaluate transformer-based CLFT and CNN-based DeepLabV3+ architectures across diverse weather conditions, achieving up to 48.1% mIoU with CLFT-Hybrid. To address extreme class imbalance, where pedestrians, cyclists, and signs constitute less than 1% of pixels, we explore specialized models targeting rare classes. We further validate the pipeline on the Iseauto autonomous-vehicle platform, achieving 77.5% mIoU, and show that SAM-derived representations transfer effectively across sensor configurations via bidirectional transfer learning. All code and annotations are released to support reproducible research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents a SAM-based annotation pipeline to generate dense semantic segmentation labels from bounding boxes in the ZOD dataset. They process over 100,000 frames, manually curate a 2,300-frame subset at 36% acceptance rate, and use it to train and evaluate CLFT-Hybrid and DeepLabV3+ models, reporting up to 48.1% mIoU. The work also addresses class imbalance for rare classes like pedestrians, validates the pipeline on the Iseauto platform with 77.5% mIoU, demonstrates transfer learning across sensors, and releases all code and annotations.

Significance. If the generated annotations prove reliable without significant selection bias, this contribution would be valuable for enabling semantic segmentation research on the multi-modal ZOD dataset, which currently lacks pixel-level labels. The bidirectional transfer learning results and focus on rare classes are relevant to autonomous driving applications. The public release of code and annotations supports reproducibility and is a notable strength.

major comments (2)
  1. [Abstract and curation description] The manual curation of the 2,300-frame subset with a 36% acceptance rate risks introducing selection bias, as only frames where SAM-generated masks appear acceptable are retained. Since all mIoU results (including the 48.1% peak with CLFT-Hybrid), weather breakdowns, and rare-class experiments are reported exclusively on this filtered subset, the performance metrics may not generalize to the full ZOD distribution or uncurated data.
  2. [Validation of annotations] The manuscript lacks details on how the quality of the SAM-derived annotations was validated beyond the manual acceptance criterion. Additional quantitative validation, such as comparison to a small set of human-annotated ground truth or inter-annotator agreement, would strengthen the claim that the pipeline produces a reliable baseline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our SAM-based annotation pipeline for ZOD. We address the two major concerns below with proposed revisions to improve clarity and transparency while preserving the pilot-study scope of the work.

read point-by-point responses
  1. Referee: [Abstract and curation description] The manual curation of the 2,300-frame subset with a 36% acceptance rate risks introducing selection bias, as only frames where SAM-generated masks appear acceptable are retained. Since all mIoU results (including the 48.1% peak with CLFT-Hybrid), weather breakdowns, and rare-class experiments are reported exclusively on this filtered subset, the performance metrics may not generalize to the full ZOD distribution or uncurated data.

    Authors: We agree this is a valid concern: the reported metrics are computed exclusively on the manually curated high-quality subset. This choice was intentional for a pilot study aimed at establishing a reliable baseline rather than claiming generalization to the full uncurated ZOD distribution. In the revised manuscript we will (1) add an explicit limitations paragraph discussing potential selection bias and its implications, (2) state that future work should evaluate the pipeline on uncurated frames, and (3) provide additional detail on the acceptance criteria used during curation. These changes will be made without altering the core claims or results. revision: yes

  2. Referee: [Validation of annotations] The manuscript lacks details on how the quality of the SAM-derived annotations was validated beyond the manual acceptance criterion. Additional quantitative validation, such as comparison to a small set of human-annotated ground truth or inter-annotator agreement, would strengthen the claim that the pipeline produces a reliable baseline.

    Authors: The validation in the current manuscript rests on the manual curation process performed by domain experts, which yielded the 36% acceptance rate. We will expand the methods section with a more detailed description of the curation protocol, including the specific visual criteria applied and the number of reviewers involved where applicable. However, because ZOD itself provides no pixel-level ground truth, creating an independent human-annotated reference set for quantitative comparison (e.g., mIoU against manual labels) lies outside the scope and resources of this pilot study; we will explicitly note this limitation. No new quantitative validation data will be added, but the existing manual process will be documented more thoroughly. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical pipeline with released data

full rationale

The paper describes an annotation pipeline using SAM to convert bounding boxes to masks, followed by manual curation of a 2300-frame subset and empirical training/evaluation of segmentation models (CLFT, DeepLabV3+) reporting mIoU on that data. No equations, derivations, or fitted parameters are presented that reduce to their own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. Results rest on external data release and standard model training, making the work self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the quality and applicability of the pre-trained SAM model to this domain and the validity of the manual curation process for creating a reliable dataset.

axioms (1)
  • domain assumption The Segment Anything Model can be effectively prompted with bounding boxes to produce accurate semantic masks in road scenes without additional training.
    This is invoked in the description of the annotation pipeline.

pith-pipeline@v0.9.1-grok · 5747 in / 1335 out tokens · 55192 ms · 2026-06-29T13:43:03.269074+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 3 canonical work pages · 2 internal anchors

  1. [1]

    Computer vision for autonomous vehicles: Problems, datasets and state of the art,

    J. Janai, F. G ¨uney, A. Behl, and A. Geiger, “Computer vision for autonomous vehicles: Problems, datasets and state of the art,” 2021. [Online]. Available: https://arxiv.org/abs/1704.05519

  2. [2]

    U-net: Convolutional networks for biomedical image segmentation,

    O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, 2015

  3. [3]

    Pyramid scene parsing network,

    H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inCVPR, 2017

  4. [4]

    Encoder- decoder with atrous separable convolution for semantic image segmen- tation,

    L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder- decoder with atrous separable convolution for semantic image segmen- tation,” inECCV, 2018

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020. TABLE VII: Impact of SAM-Enhanced Edge Fine-Tuning on Segmentation Performance in Adverse Conditions on I...

  6. [6]

    Swin transformer: Hierarchical vision transformer using shifted windows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022

  7. [7]

    Mask2former for video instance segmentation,

    B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Mask2former for video instance segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 810–12 820

  8. [8]

    Segformer: Simple and efficient design for semantic segmentation with transformers,

    E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, 2021

  9. [9]

    Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,

    S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y . Wang, Y . Fu, J. Feng, T. Xiang, P. H. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” inCVPR, 2021

  10. [10]

    The zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,

    M. Klingneret al., “The zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  11. [11]

    nuscenes: A multi- modal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multi- modal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020

  12. [12]

    Are we ready for autonomous driving? the kitti vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361

  13. [13]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, P. Doll´ar, and R. Girshick, “Segment anything,” inICCV, 2023

  14. [14]

    Fusenet: Incorporat- ing depth into semantic segmentation via fusion-based cnn architecture,

    C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporat- ing depth into semantic segmentation via fusion-based cnn architecture,” inAsian Conference on Computer Vision. Springer, 2016, pp. 213–228

  15. [15]

    Class-balanced loss based on effective number of samples,

    Y . Cui, M. Jia, T.-Y . Lin, Y . Song, and S. Belongie, “Class-balanced loss based on effective number of samples,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9268–9277

  16. [16]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2980–2988

  17. [17]

    Cascade r-cnn: Delving into high quality object detection,

    Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” inCVPR, 2018

  18. [18]

    Clft: Camera-lidar fusion transformer for semantic segmentation in autonomous driving,

    J. Gu, M. Bellone, T. Pivo ˇnka, and R. Sell, “Clft: Camera-lidar fusion transformer for semantic segmentation in autonomous driving,”IEEE Transactions on Intelligent Vehicles, pp. 1–12, 2024

  19. [19]

    Object segmentation for autonomous driving using iseauto data,

    J. Gu, M. Bellone, R. Sell, and A. Lind, “Object segmentation for autonomous driving using iseauto data,”Electronics, vol. 11, no. 7,

  20. [20]

    Available: https://www.mdpi.com/2079-9292/11/7/1119

    [Online]. Available: https://www.mdpi.com/2079-9292/11/7/1119

  21. [21]

    SAM 2: Segment Anything in Images and Videos

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Sam 2: Segment anything in images and videos,”arXiv preprint arXiv:2408.00714, 2024

  22. [22]

    Base: Taltech’s hpc infras- tructure 2020–2024,

    H. Herrmann, T. Kaevand, and L. Anton, “Base: Taltech’s hpc infras- tructure 2020–2024,” TalTech Data Repository, Mar. 2025