CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

Abdullah Moosa; Mostafa Elemam; Zahra F. Rahmatullah; Zaid Aljundi

arxiv: 2605.16774 · v1 · pith:BXZ7AUJMnew · submitted 2026-05-16 · 💻 cs.CV · cs.AI

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

Zaid Aljundi , Zahra F. Rahmatullah , Mostafa Elemam , Abdullah Moosa This is my paper

Pith reviewed 2026-05-19 21:49 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords marine debrisobject detectionASV visiondataset benchmarkYOLOv11surface trackingaluminum cansautonomous systems

0 comments

The pith

A dataset tailored to aluminum cans on water surfaces improves object detection accuracy twelve times over generic training sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CANSURF, a dataset of roughly 7,300 raw images of surface-level cans captured from an ASV perspective, expanded with ten augmentations to about 57,000 images. It demonstrates that detectors trained on this specialized data achieve much higher performance in identifying small reflective debris under challenging water conditions like glare and ripples. By benchmarking various YOLO models and tracking combinations, the work shows specific pipelines excel at different aspects of the task, such as stable tracking or far-field detection. This fills a gap since no prior open dataset focuses on this exact viewpoint and target for marine cleanup applications.

Core claim

The authors create and release CANSURF, consisting of annotated ASV-view images of aluminum cans on water, and show that training YOLOv11 on it boosts performance 12x compared to generic datasets. They find that YOLOv11 combined with ByteTrack gives the most stable tracks, while YOLOv11 with SAHI is better for detecting the maximum number of cans in single-can pickup scenarios. The dataset addresses the lack of prior open data for this specific marine debris detection from surface level.

What carries the argument

The CANSURF dataset of surface-level can images from ASV viewpoint with bounding box annotations and multiple augmentation types, used to train and evaluate detection and tracking pipelines.

If this is right

YOLOv11 models achieve higher accuracy in detecting cans when trained on CANSURF rather than generic image collections.
Using ByteTrack with YOLOv11 results in fewer identity switches during tracking of multiple cans.
SAHI integration with YOLOv11 increases recall for distant cans but may reduce precision in closer views.
Single-can pickup operations benefit more from the SAHI-enhanced detector for maximizing detections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This dataset could be extended to include other types of floating debris for broader cleanup applications.
Real-world ASV deployments could test these models to validate performance beyond augmented data.
Integration with robotic grasping systems might enable end-to-end autonomous debris collection using these detection methods.

Load-bearing premise

The collected raw images and the ten augmentation types produce a training distribution that is sufficiently representative of real ASV operating conditions including glare, ripples, and partial submersion.

What would settle it

A field test where an ASV equipped with a camera records new videos of cans in water under varying conditions, and a model trained only on CANSURF shows no significant improvement in detection rate or tracking stability compared to one trained on generic datasets would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.16774 by Abdullah Moosa, Mostafa Elemam, Zahra F. Rahmatullah, Zaid Aljundi.

**Figure 2.** Figure 2: Examples of different image augmentations applied to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pipeline of workflow. 1. Object Detection The core of the vision pipeline is a robust object detection model capable of accurately identifying floating cans. To select the optimal architecture for this task, a systematic benchmarking process was conducted on a curated dataset consisting of 900 images where cans occupy less than 5% of the image frame. The goal of these tests is to identify a model that not … view at source ↗

**Figure 4.** Figure 4: By prompting the model with various synonyms for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 4.** Figure 4: YOLO-World’s multi-class confusion matrix. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CANSURF supplies the first open surface-view aluminum can dataset for ASV debris work, but the 12x boost claim sits on thin evidence about real-world conditions.

read the letter

The main takeaway is a new dataset of roughly 7.3k raw frames of cans on water, pulled from ASV videos and expanded with ten augmentations to about 57k images. No prior open set targets this exact viewpoint and object, so the collection itself addresses a clear gap for marine cleanup robots that need to spot small reflective targets at distance under glare and ripples. The benchmarks compare standard detectors and trackers on this data, with YOLOv11 plus ByteTrack noted for stable tracks and YOLOv11 plus SAHI for better far-field recall. That practical orientation is the paper's strength. It ties the experiments directly to single-can pickup missions rather than generic object detection. The authors keep the focus narrow and report concrete pipeline choices instead of broad claims about new architectures. The 12x performance lift over generic datasets is the headline number, and if the training distribution really matches deployment, it would show why domain data matters here. The soft spots are in the verification. The abstract states the gain but gives no precision-recall numbers, no error bars, and no breakdown of how the generic baseline was constructed or tested. It is also unclear whether the raw videos captured enough extreme glare, ripple, or partial submersion cases, or if the augmentations added realistic variation instead of artifacts. Without distribution stats or an external real-world test set, the measured improvement could be an in-distribution effect. Annotation quality and inter-annotator agreement are not discussed either. This paper is for groups building ASV vision systems for environmental monitoring or debris collection. A reader who needs a starting point for surface-can detection would find the data and the basic benchmarks useful even before the experiments are tightened. It deserves peer review because the dataset fills the stated gap and the work is grounded in a real mission profile rather than abstract benchmarks. Send it out with requests for the missing metrics and a clearer account of how the augmentations were validated against actual operating conditions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the CANSURF dataset of ~7.3k raw ASV-view images of surface-level aluminum cans, expanded via ten augmentation types to ~57k images, and benchmarks YOLOv11-based detection and tracking pipelines (including combinations with ByteTrack and SAHI). It claims that training YOLOv11 on CANSURF yields a 12x performance boost over generic datasets, with YOLOv11+ByteTrack providing stable tracks and YOLOv11+SAHI improving far-field recall, filling a gap as the first open dataset for this specific viewpoint and target.

Significance. If the reported gains are reproducible and the dataset distribution matches real ASV conditions, this work provides a practical resource for marine debris detection in autonomous clean-up missions. The contribution lies in domain-specific data collection and straightforward benchmarking rather than novel algorithms; the absence of prior open datasets for aluminum cans on water from surface level makes the release potentially useful for reproducible evaluation in robotics and CV applications.

major comments (2)

[Abstract] Abstract: The central claim that 'Training YOLOv11 on CANSURF boosts performance 12x over generic datasets' is presented without any supporting quantitative metrics (e.g., mAP, precision, recall values), baseline numbers from the generic datasets, error bars, or statistical tests. This absence leaves the headline result unverified and load-bearing for the paper's assertion of the dataset's value.
[Experiments] Experiments / Benchmark section: No distribution statistics, failure-case analysis, or external real-world test set is provided to substantiate that the ~7.3k raw frames plus the ten augmentation types produce a training distribution representative of actual ASV conditions (glare, ripples, partial submersion). Without this, the measured gains risk being an in-distribution artifact rather than evidence of practical utility.

minor comments (2)

[Abstract] The abstract mentions 'stronger multi-object accuracy under' but the sentence appears truncated; clarify the exact condition or metric being referenced.
[Dataset] Annotation quality and inter-annotator agreement are not discussed; adding a brief description of the annotation protocol would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Training YOLOv11 on CANSURF boosts performance 12x over generic datasets' is presented without any supporting quantitative metrics (e.g., mAP, precision, recall values), baseline numbers from the generic datasets, error bars, or statistical tests. This absence leaves the headline result unverified and load-bearing for the paper's assertion of the dataset's value.

Authors: We agree that the abstract would benefit from explicit quantitative support for the reported performance gain. In the revised manuscript, we will update the abstract to include the specific metrics underlying the 12x claim, such as the mAP@0.5 and mAP@0.5:0.95 values for YOLOv11 trained on CANSURF versus the generic baselines, along with the corresponding precision and recall figures. These numbers are already detailed in the experiments section and will now be referenced directly in the abstract for immediate verifiability. revision: yes
Referee: [Experiments] Experiments / Benchmark section: No distribution statistics, failure-case analysis, or external real-world test set is provided to substantiate that the ~7.3k raw frames plus the ten augmentation types produce a training distribution representative of actual ASV conditions (glare, ripples, partial submersion). Without this, the measured gains risk being an in-distribution artifact rather than evidence of practical utility.

Authors: We acknowledge the value of additional evidence for dataset representativeness. In the revision, we will add distribution statistics for the raw frames (e.g., histograms and breakdowns across lighting conditions, ripple intensity, and submersion levels) and a new failure-case analysis subsection that examines detection errors under challenging ASV conditions and how the augmentations mitigate them. Our current test split is drawn from temporally held-out ASV video sequences to approximate real deployment; we will explicitly discuss this as a limitation and note that a fully independent external test set collected on different platforms or dates is not available in the present work. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset collection and external benchmarking

full rationale

The paper introduces a new surface-can dataset from ~7.3k raw ASV video frames plus ten augmentations, then benchmarks standard off-the-shelf detectors (YOLOv11, ByteTrack, SAHI) and reports measured performance gains against generic external datasets. No equations, derivations, fitted parameters, or self-citation chains appear in the abstract or described content. All load-bearing claims rest on new data collection and reproducible evaluation against independent baselines rather than any reduction to prior fitted inputs or author-specific uniqueness theorems. This is a standard empirical contribution whose central result (12x boost) is externally falsifiable and does not reduce by construction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmarking paper. No mathematical derivations, fitted parameters, or postulated entities are introduced; the contributions consist of data curation and experimental comparisons using existing detector architectures.

pith-pipeline@v0.9.0 · 5766 in / 1110 out tokens · 37890 ms · 2026-05-19T21:49:35.822820+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Marine debris handling guide- lines,

NOAA Marine Debris Program, “Marine debris handling guide- lines,” https://marinedebris.noaa.gov/marine-debris-handling-guidelines, apr 2020, accessed 17 Aug 2025

work page 2020
[2]

2020 international coastal cleanup: By the numbers,

Ocean Conservancy, “2020 international coastal cleanup: By the numbers,” https://oceanconservancy.org/wp-content/uploads/2021/09/ ByTheNumbers.pdf, 2020, lists beverage cans among top ten collected items; Accessed 17 Aug 2025

work page 2020
[3]

Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,

K. Kikaki, I. Kakogeorgiou, P. Mikeli, D. E. Raitsos, and K. Karantzalos, “Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,”PLOS ONE, vol. 17, no. 1, p. e0262247, 2022

work page 2022
[4]

Trash-icra19: A bounding box labeled dataset of underwater trash,

M. S. Fulton, J. Hong, and J. Sattar, “Trash-icra19: A bounding box labeled dataset of underwater trash,” Data Repository for the University of Minnesota (DRUM), 2020, underwater debris dataset; Accessed 17 Aug 2025

work page 2020
[5]

Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,

J. Hong, M. S. Fulton, and J. Sattar, “Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,” Data Repository for the University of Minnesota (DRUM), 2020, underwater instance- segmentation dataset; Accessed 17 Aug 2025

work page 2020
[6]

Slicing aided hyper infer- ence and fine-tuning for small object detection,

F. C. Akyon, S. O. Altinuc, and A. Temizel, “Slicing aided hyper infer- ence and fine-tuning for small object detection,”arXiv, no. 2202.06934, 2022

work page arXiv 2022
[7]

Multi-scale object detection model for au- tonomous ship navigation in maritime environment,

Z. Shao, H. Lyu, Y . Yin, T. Cheng, X. Gao, W. Zhang, Q. Jing, Y . Zhao, and L. Zhang, “Multi-scale object detection model for au- tonomous ship navigation in maritime environment,” https://www.mdpi. com/2077-1312/10/11/1783, 2022

work page 2077
[8]

Potato: A dataset for analyzing polarimetric traces of afloat trash objects,

L. F. W. Batista, S. Khazem, M. Adibi, S. Hutchinson, and C. Pradalier, “Potato: A dataset for analyzing polarimetric traces of afloat trash objects,” https://arxiv.org/abs/2409.12659, 2024

work page arXiv 2024
[9]

Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,

H. Lee, S. Byeon, J. H. Kim, J.-K. Shin, and Y . Park, “Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,” https://www.mdpi.com/1424-8220/25/7/2225, 2025

work page 2025
[10]

Bytetrack: Multi-object tracking by associating every detection box,

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” 2022

work page 2022
[11]

o 20bg 2 dataset,

label, “o 20bg 2 dataset,” https://universe.roboflow.com/label-mz0kf/o 20bg 2, apr 2024, visited on 2025-08-24

work page 2024
[12]

Canettes dataset,

Class, “Canettes dataset,” https://universe.roboflow.com/class-iqy5c/ canettes-wjjyb, nov 2022, visited on 2025-08-24

work page 2022
[13]

YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

P. Hidayatullah, N. Syakrani, M. R. Sholahuddin, T. Gelar, and R. Tuba- gus, “Yolov8 to yolo11: A comprehensive architecture in-depth compar- ative review,” https://arxiv.org/abs/2501.13400, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

A survey of zero-shot object detection,

W. Cao, X. Yao, Z. Xu, Y . Liu, Y . Pan, and Z. Ming, “A survey of zero-shot object detection,” https://www.sciopen.com/article/10.26599/ BDMA.2024.9020098, pp. 726–750, 2025

work page arXiv 2024
[15]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,” https://arxiv.org/abs/2401. 17270, 2024

work page 2024
[16]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” https://arxiv. org/abs/2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

(2025) Vast.ai: Gpu rental marketplace and cloud compute service

Vast.ai. (2025) Vast.ai: Gpu rental marketplace and cloud compute service. https://vast.ai

work page 2025

[1] [1]

Marine debris handling guide- lines,

NOAA Marine Debris Program, “Marine debris handling guide- lines,” https://marinedebris.noaa.gov/marine-debris-handling-guidelines, apr 2020, accessed 17 Aug 2025

work page 2020

[2] [2]

2020 international coastal cleanup: By the numbers,

Ocean Conservancy, “2020 international coastal cleanup: By the numbers,” https://oceanconservancy.org/wp-content/uploads/2021/09/ ByTheNumbers.pdf, 2020, lists beverage cans among top ten collected items; Accessed 17 Aug 2025

work page 2020

[3] [3]

Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,

K. Kikaki, I. Kakogeorgiou, P. Mikeli, D. E. Raitsos, and K. Karantzalos, “Marida: A benchmark for marine debris detection from sentinel-2 remote sensing data,”PLOS ONE, vol. 17, no. 1, p. e0262247, 2022

work page 2022

[4] [4]

Trash-icra19: A bounding box labeled dataset of underwater trash,

M. S. Fulton, J. Hong, and J. Sattar, “Trash-icra19: A bounding box labeled dataset of underwater trash,” Data Repository for the University of Minnesota (DRUM), 2020, underwater debris dataset; Accessed 17 Aug 2025

work page 2020

[5] [5]

Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,

J. Hong, M. S. Fulton, and J. Sattar, “Trashcan 1.0: An instance- segmentation labeled dataset of trash observations,” Data Repository for the University of Minnesota (DRUM), 2020, underwater instance- segmentation dataset; Accessed 17 Aug 2025

work page 2020

[6] [6]

Slicing aided hyper infer- ence and fine-tuning for small object detection,

F. C. Akyon, S. O. Altinuc, and A. Temizel, “Slicing aided hyper infer- ence and fine-tuning for small object detection,”arXiv, no. 2202.06934, 2022

work page arXiv 2022

[7] [7]

Multi-scale object detection model for au- tonomous ship navigation in maritime environment,

Z. Shao, H. Lyu, Y . Yin, T. Cheng, X. Gao, W. Zhang, Q. Jing, Y . Zhao, and L. Zhang, “Multi-scale object detection model for au- tonomous ship navigation in maritime environment,” https://www.mdpi. com/2077-1312/10/11/1783, 2022

work page 2077

[8] [8]

Potato: A dataset for analyzing polarimetric traces of afloat trash objects,

L. F. W. Batista, S. Khazem, M. Adibi, S. Hutchinson, and C. Pradalier, “Potato: A dataset for analyzing polarimetric traces of afloat trash objects,” https://arxiv.org/abs/2409.12659, 2024

work page arXiv 2024

[9] [9]

Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,

H. Lee, S. Byeon, J. H. Kim, J.-K. Shin, and Y . Park, “Construction of a real-time detection for floating plastics in a stream using video cameras and deep learning,” https://www.mdpi.com/1424-8220/25/7/2225, 2025

work page 2025

[10] [10]

Bytetrack: Multi-object tracking by associating every detection box,

Y . Zhang, P. Sun, Y . Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “Bytetrack: Multi-object tracking by associating every detection box,” 2022

work page 2022

[11] [11]

o 20bg 2 dataset,

label, “o 20bg 2 dataset,” https://universe.roboflow.com/label-mz0kf/o 20bg 2, apr 2024, visited on 2025-08-24

work page 2024

[12] [12]

Canettes dataset,

Class, “Canettes dataset,” https://universe.roboflow.com/class-iqy5c/ canettes-wjjyb, nov 2022, visited on 2025-08-24

work page 2022

[13] [13]

YOLOv8 to YOLO11: A Comprehensive Architecture In-depth Comparative Review

P. Hidayatullah, N. Syakrani, M. R. Sholahuddin, T. Gelar, and R. Tuba- gus, “Yolov8 to yolo11: A comprehensive architecture in-depth compar- ative review,” https://arxiv.org/abs/2501.13400, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

A survey of zero-shot object detection,

W. Cao, X. Yao, Z. Xu, Y . Liu, Y . Pan, and Z. Ming, “A survey of zero-shot object detection,” https://www.sciopen.com/article/10.26599/ BDMA.2024.9020098, pp. 726–750, 2025

work page arXiv 2024

[15] [15]

Yolo-world: Real-time open-vocabulary object detection,

T. Cheng, L. Song, Y . Ge, W. Liu, X. Wang, and Y . Shan, “Yolo-world: Real-time open-vocabulary object detection,” https://arxiv.org/abs/2401. 17270, 2024

work page 2024

[16] [16]

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” https://arxiv. org/abs/2303.05499, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

(2025) Vast.ai: Gpu rental marketplace and cloud compute service

Vast.ai. (2025) Vast.ai: Gpu rental marketplace and cloud compute service. https://vast.ai

work page 2025