pith. sign in

arxiv: 2605.16774 · v1 · pith:BXZ7AUJMnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

Pith reviewed 2026-05-19 21:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords marine debrisobject detectionASV visiondataset benchmarkYOLOv11surface trackingaluminum cansautonomous systems
0
0 comments X

The pith

A dataset tailored to aluminum cans on water surfaces improves object detection accuracy twelve times over generic training sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CANSURF, a dataset of roughly 7,300 raw images of surface-level cans captured from an ASV perspective, expanded with ten augmentations to about 57,000 images. It demonstrates that detectors trained on this specialized data achieve much higher performance in identifying small reflective debris under challenging water conditions like glare and ripples. By benchmarking various YOLO models and tracking combinations, the work shows specific pipelines excel at different aspects of the task, such as stable tracking or far-field detection. This fills a gap since no prior open dataset focuses on this exact viewpoint and target for marine cleanup applications.

Core claim

The authors create and release CANSURF, consisting of annotated ASV-view images of aluminum cans on water, and show that training YOLOv11 on it boosts performance 12x compared to generic datasets. They find that YOLOv11 combined with ByteTrack gives the most stable tracks, while YOLOv11 with SAHI is better for detecting the maximum number of cans in single-can pickup scenarios. The dataset addresses the lack of prior open data for this specific marine debris detection from surface level.

What carries the argument

The CANSURF dataset of surface-level can images from ASV viewpoint with bounding box annotations and multiple augmentation types, used to train and evaluate detection and tracking pipelines.

If this is right

  • YOLOv11 models achieve higher accuracy in detecting cans when trained on CANSURF rather than generic image collections.
  • Using ByteTrack with YOLOv11 results in fewer identity switches during tracking of multiple cans.
  • SAHI integration with YOLOv11 increases recall for distant cans but may reduce precision in closer views.
  • Single-can pickup operations benefit more from the SAHI-enhanced detector for maximizing detections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This dataset could be extended to include other types of floating debris for broader cleanup applications.
  • Real-world ASV deployments could test these models to validate performance beyond augmented data.
  • Integration with robotic grasping systems might enable end-to-end autonomous debris collection using these detection methods.

Load-bearing premise

The collected raw images and the ten augmentation types produce a training distribution that is sufficiently representative of real ASV operating conditions including glare, ripples, and partial submersion.

What would settle it

A field test where an ASV equipped with a camera records new videos of cans in water under varying conditions, and a model trained only on CANSURF shows no significant improvement in detection rate or tracking stability compared to one trained on generic datasets would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.16774 by Abdullah Moosa, Mostafa Elemam, Zahra F. Rahmatullah, Zaid Aljundi.

Figure 1
Figure 1. Figure 1: Example tracking results from the top-performing [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of different image augmentations applied to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pipeline of workflow. 1. Object Detection The core of the vision pipeline is a robust object detection model capable of accurately identifying floating cans. To select the optimal architecture for this task, a systematic benchmarking process was conducted on a curated dataset consisting of 900 images where cans occupy less than 5% of the image frame. The goal of these tests is to identify a model that not … view at source ↗
Figure 4
Figure 4. Figure 4: By prompting the model with various synonyms for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: YOLO-World’s multi-class confusion matrix. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the CANSURF dataset of ~7.3k raw ASV-view images of surface-level aluminum cans, expanded via ten augmentation types to ~57k images, and benchmarks YOLOv11-based detection and tracking pipelines (including combinations with ByteTrack and SAHI). It claims that training YOLOv11 on CANSURF yields a 12x performance boost over generic datasets, with YOLOv11+ByteTrack providing stable tracks and YOLOv11+SAHI improving far-field recall, filling a gap as the first open dataset for this specific viewpoint and target.

Significance. If the reported gains are reproducible and the dataset distribution matches real ASV conditions, this work provides a practical resource for marine debris detection in autonomous clean-up missions. The contribution lies in domain-specific data collection and straightforward benchmarking rather than novel algorithms; the absence of prior open datasets for aluminum cans on water from surface level makes the release potentially useful for reproducible evaluation in robotics and CV applications.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'Training YOLOv11 on CANSURF boosts performance 12x over generic datasets' is presented without any supporting quantitative metrics (e.g., mAP, precision, recall values), baseline numbers from the generic datasets, error bars, or statistical tests. This absence leaves the headline result unverified and load-bearing for the paper's assertion of the dataset's value.
  2. [Experiments] Experiments / Benchmark section: No distribution statistics, failure-case analysis, or external real-world test set is provided to substantiate that the ~7.3k raw frames plus the ten augmentation types produce a training distribution representative of actual ASV conditions (glare, ripples, partial submersion). Without this, the measured gains risk being an in-distribution artifact rather than evidence of practical utility.
minor comments (2)
  1. [Abstract] The abstract mentions 'stronger multi-object accuracy under' but the sentence appears truncated; clarify the exact condition or metric being referenced.
  2. [Dataset] Annotation quality and inter-annotator agreement are not discussed; adding a brief description of the annotation protocol would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'Training YOLOv11 on CANSURF boosts performance 12x over generic datasets' is presented without any supporting quantitative metrics (e.g., mAP, precision, recall values), baseline numbers from the generic datasets, error bars, or statistical tests. This absence leaves the headline result unverified and load-bearing for the paper's assertion of the dataset's value.

    Authors: We agree that the abstract would benefit from explicit quantitative support for the reported performance gain. In the revised manuscript, we will update the abstract to include the specific metrics underlying the 12x claim, such as the mAP@0.5 and mAP@0.5:0.95 values for YOLOv11 trained on CANSURF versus the generic baselines, along with the corresponding precision and recall figures. These numbers are already detailed in the experiments section and will now be referenced directly in the abstract for immediate verifiability. revision: yes

  2. Referee: [Experiments] Experiments / Benchmark section: No distribution statistics, failure-case analysis, or external real-world test set is provided to substantiate that the ~7.3k raw frames plus the ten augmentation types produce a training distribution representative of actual ASV conditions (glare, ripples, partial submersion). Without this, the measured gains risk being an in-distribution artifact rather than evidence of practical utility.

    Authors: We acknowledge the value of additional evidence for dataset representativeness. In the revision, we will add distribution statistics for the raw frames (e.g., histograms and breakdowns across lighting conditions, ripple intensity, and submersion levels) and a new failure-case analysis subsection that examines detection errors under challenging ASV conditions and how the augmentations mitigate them. Our current test split is drawn from temporally held-out ASV video sequences to approximate real deployment; we will explicitly discuss this as a limitation and note that a fully independent external test set collected on different platforms or dates is not available in the present work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset collection and external benchmarking

full rationale

The paper introduces a new surface-can dataset from ~7.3k raw ASV video frames plus ten augmentations, then benchmarks standard off-the-shelf detectors (YOLOv11, ByteTrack, SAHI) and reports measured performance gains against generic external datasets. No equations, derivations, fitted parameters, or self-citation chains appear in the abstract or described content. All load-bearing claims rest on new data collection and reproducible evaluation against independent baselines rather than any reduction to prior fitted inputs or author-specific uniqueness theorems. This is a standard empirical contribution whose central result (12x boost) is externally falsifiable and does not reduce by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmarking paper. No mathematical derivations, fitted parameters, or postulated entities are introduced; the contributions consist of data curation and experimental comparisons using existing detector architectures.

pith-pipeline@v0.9.0 · 5766 in / 1110 out tokens · 37890 ms · 2026-05-19T21:49:35.822820+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

  1. [1]

    Marine debris handling guide- lines,

    NOAA Marine Debris Program, “Marine debris handling guide- lines,” https://marinedebris.noaa.gov/marine-debris-handling-guidelines, apr 2020, accessed 17 Aug 2025

  2. [2]

    2020 international coastal cleanup: By the numbers,

    Ocean Conservancy, “2020 international coastal cleanup: By the numbers,” https://oceanconservancy.org/wp-content/uploads/2021/09/ ByTheNumbers.pdf, 2020, lists beverage cans among top ten collected items; Accessed 17 Aug 2025

  3. [3]

    Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,

    K. Kikaki, I. Kakogeorgiou, P. Mikeli, D. E. Raitsos, and K. Karantzalos, “Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,”PLOS ONE, vol. 17, no. 1, p. e0262247, 2022

  4. [4]

    Trash-icra19: A bounding box labeled dataset of underwater trash,

    M. S. Fulton, J. Hong, and J. Sattar, “Trash-icra19: A bounding box labeled dataset of underwater trash,” Data Repository for the University of Minnesota (DRUM), 2020, underwater debris dataset; Accessed 17 Aug 2025

  5. [5]

    Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,

    J. Hong, M. S. Fulton, and J. Sattar, “Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,” Data Repository for the University of Minnesota (DRUM), 2020, underwater instance- segmentation dataset; Accessed 17 Aug 2025

  6. [6]

    Slicing aided hyper infer- ence and fine-tuning for small object detection,

    F. C. Akyon, S. O. Altinuc, and A. Temizel, “Slicing aided hyper infer- ence and fine-tuning for small object detection,”arXiv, no. 2202.06934, 2022

  7. [7]

    Multi-scale object detection model for au- tonomous ship navigation in maritime environment,

    Z. Shao, H. Lyu, Y . Yin, T. Cheng, X. Gao, W. Zhang, Q. Jing, Y . Zhao, and L. Zhang, “Multi-scale object detection model for au- tonomous ship navigation in maritime environment,” https://www.mdpi. com/2077-1312/10/11/1783, 2022

  8. [8]

    Potato: A dataset for analyzing polarimetric traces of afloat trash objects,

    L. F. W. Batista, S. Khazem, M. Adibi, S. Hutchinson, and C. Pradalier, “Potato: A dataset for analyzing polarimetric traces of afloat trash objects,” https://arxiv.org/abs/2409.12659, 2024

  9. [9]

    Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,

    H. Lee, S. Byeon, J. H. Kim, J.-K. Shin, and Y . Park, “Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,” https://www.mdpi.com/1424-8220/25/7/2225, 2025

  10. [10]

    Bytetrack: Multi-object tracking by associating every detection box,

    Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” 2022

  11. [11]

    o 20bg 2 dataset,

    label, “o 20bg 2 dataset,” https://universe.roboflow.com/label-mz0kf/o 20bg 2, apr 2024, visited on 2025-08-24

  12. [12]

    Canettes dataset,

    Class, “Canettes dataset,” https://universe.roboflow.com/class-iqy5c/ canettes-wjjyb, nov 2022, visited on 2025-08-24

  13. [13]

    YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

    P. Hidayatullah, N. Syakrani, M. R. Sholahuddin, T. Gelar, and R. Tuba- gus, “Yolov8 to yolo11: A comprehensive architecture in-depth compar- ative review,” https://arxiv.org/abs/2501.13400, 2025

  14. [14]

    A survey of zero-shot object detection,

    W. Cao, X. Yao, Z. Xu, Y . Liu, Y . Pan, and Z. Ming, “A survey of zero-shot object detection,” https://www.sciopen.com/article/10.26599/ BDMA.2024.9020098, pp. 726–750, 2025

  15. [15]

    Yolo-world: Real-time open-vocabulary object detection,

    T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,” https://arxiv.org/abs/2401. 17270, 2024

  16. [16]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” https://arxiv. org/abs/2303.05499, 2024

  17. [17]

    (2025) Vast.ai: Gpu rental marketplace and cloud compute service

    Vast.ai. (2025) Vast.ai: Gpu rental marketplace and cloud compute service. https://vast.ai