ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition

Abdulwahab Felemban; Burton H. Jones; Fabio Marchese; Faizan Farooq Khan; Francesca Benzoni; Mohamed Elhoseiny; Sara Beery; Xiang Li; Xuhui Liu; Yahia Battach

arxiv: 2510.16822 · v3 · submitted 2025-10-19 · 💻 cs.CV · cs.AI

ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition

Abdulwahab Felemban , Yahia Battach , Faizan Farooq Khan , Yuqian Fu , Xuhui Liu , Yesmeen M. Khattab , Yousef A. Radwan , Xiang Li

show 5 more authors

Fabio Marchese Sara Beery Burton H. Jones Francesca Benzoni Mohamed Elhoseiny

This is my paper

Pith reviewed 2026-05-18 06:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords coral reeffine-grained recognitionlarge-scale datasetbenchmarkzero-shot learningcross-domain adaptationvision-language modelsbiodiversity monitoring

0 comments

The pith

ReefNet dataset with expert-verified labels shows vision models degrade sharply in zero-shot coral recognition

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aggregates imagery from 76 sources plus one new reef site into a dataset of roughly 925,000 genus-level hard coral annotations mapped to standard taxonomy. It applies expert verification and filtering to produce a high-confidence benchmark covering 39 classes with 92 percent agreement, designed for testing under label noise and imbalance. Experiments with vision-language models, multimodal large language models, and vision backbones document large performance losses in zero-shot and extremely few-shot regimes, plus ongoing shortfalls after adaptation when sources shift or classes are rare. This addresses the need for scalable monitoring tools as reefs decline from climate pressures.

Core claim

ReefNet supplies a taxonomically consistent collection of nearly 925,000 point-annotated hard coral images drawn from heterogeneous CoralNet sources and an additional Red Sea site. The resulting high-confidence subset, verified to 92 percent expert agreement over 39 classes, supports systematic benchmarks that expose substantial degradation for current models in zero-shot and very few-shot settings, with adaptation helping but leaving persistent gaps under cross-source transfer and on long-tail genera.

What carries the argument

The high-confidence benchmark subset of 39 hard-coral classes at 92 percent expert agreement, paired with evaluation protocols for zero-shot, few-shot, within-source, and cross-source transfer to the Al-Wajh site.

If this is right

In-domain supervision produces large gains on coral genus identification tasks.
Cross-source transfer to new reef sites remains difficult even after adaptation.
Long-tail genera continue to show lower accuracy in every training regime tested.
General-purpose multimodal models need substantial domain-specific data to approach reliable performance on biodiversity tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Conservation teams could develop specialized training pipelines based on this dataset rather than relying on off-the-shelf models.
The same aggregation and verification approach might scale to other marine or terrestrial biodiversity monitoring problems.
Incorporating explicit taxonomic hierarchy into model architectures could narrow gaps on rare classes.

Load-bearing premise

Expert verification and targeted filtering of images from many sources produces labels that stay reliable despite noise and imbalance.

What would settle it

A new model reaching within 15 percent of fully supervised accuracy on the 39-class benchmark in a pure zero-shot regime, or maintaining performance after cross-source transfer without extra data, would undermine the reported degradation and gaps.

read the original abstract

Coral reefs are rapidly declining under anthropogenic pressures (e.g., climate change), creating an urgent need for scalable and automated monitoring. Progress in data-driven coral analysis, however, is constrained by the scarcity of large-scale datasets with fine-grained labels that are taxonomically consistent across sites and studies. To address this gap, we introduce ReefNet, a large-scale public coral reef image dataset with point-level annotations mapped to the World Register of Marine Species (WoRMS) taxonomy. ReefNet aggregates imagery from 76 curated CoralNet sources and an additional reef site from Al-Wajh (Red Sea), totaling approximately 925K genus-level hard coral annotations. Through expert-driven verification and targeted filtering, we derive a high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes, enabling reliable evaluation under realistic label noise and strong class imbalance. Beyond dataset construction, we establish a comprehensive benchmark spanning zero-shot, cross-domain few-shot adaptation, within-source evaluation, and cross-source transfer to the Al-Wajh dataset. Experiments with state-of-the-art vision-language models (VLMs), multimodal large language models (MLLMs), and vision-only backbones reveal substantial degradation in zero-shot and extremely few-shot regimes, while adaptation with in-domain supervision yields large gains yet still leaves a persistent gap under cross-source shift and on long-tail genera. These results highlight fundamental challenges in applying general-purpose multimodal models to biodiversity monitoring and underscore the importance of large-scale, taxonomically grounded, high-quality datasets. ReefNet serves as both a benchmark and a training resource for advancing fine-grained coral reef understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ReefNet gives the field a large, standardized coral dataset with useful benchmarks on model gaps, though the 92% agreement claim needs more protocol details to fully land.

read the letter

ReefNet stands out mainly for pulling together roughly 925K hard-coral annotations from 76 CoralNet sources plus a new Al-Wajh site, mapping everything to WoRMS taxonomy, and releasing a 39-class high-confidence subset. The benchmarks then test VLMs, MLLMs, and standard vision backbones across zero-shot, few-shot, within-source, cross-source, and long-tail settings, showing clear drops in performance under shift and on rare genera. That combination is the practical contribution here. The scale and the multi-regime evaluation give people working on reef monitoring a concrete starting point they can actually use for training or testing. It also makes explicit that general-purpose models still need substantial in-domain data to close the gaps, which matches what field practitioners run into. The paper does a straightforward job of crediting prior CoralNet work while adding the new site and the consistent taxonomy layer. On the soft side, the 92% expert agreement figure is presented without enough supporting information. There is no breakdown on how many experts reviewed each image, what sampling was used for verification, how disagreements were settled, or whether agreement varies by genus or source. That matters because the dataset mixes heterogeneous imagery and carries strong imbalance; without those details the benchmark reliability is harder to judge. The rest of the experimental setup looks standard and reproducible enough on the surface. This paper is for the marine AI and biodiversity monitoring crowd, plus anyone building or adapting fine-grained recognition datasets. Readers who need a sizable, taxonomically grounded coral resource or who want to see quantified transfer gaps will get direct value. It has enough substance on the data side to deserve a full referee rather than a quick desk rejection.

Referee Report

1 major / 2 minor

Summary. The paper introduces ReefNet, a large-scale public coral reef dataset aggregating ~925K genus-level hard coral annotations from 76 curated CoralNet sources plus an Al-Wajh (Red Sea) site, with labels mapped to WoRMS taxonomy. Through expert-driven verification and targeted filtering, it derives a high-confidence benchmark subset achieving 92% expert agreement over 39 hard-coral classes. The work establishes benchmarks for zero-shot, cross-domain few-shot adaptation, within-source evaluation, and cross-source transfer to Al-Wajh, using VLMs, MLLMs, and vision backbones; results show substantial degradation in zero-shot/extremely few-shot regimes, large gains from in-domain supervision, yet persistent gaps under cross-source shift and on long-tail genera.

Significance. If the dataset construction and label reliability claims hold, ReefNet would provide a valuable, taxonomically grounded resource addressing the scarcity of large-scale fine-grained coral datasets for biodiversity monitoring. The empirical benchmarks highlight real challenges in applying current multimodal and vision models to ecological tasks with label noise, imbalance, and domain shift, offering concrete directions for future work in this application area.

major comments (1)

Abstract and dataset construction section: The central claim of a 'high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes' obtained via 'expert-driven verification and targeted filtering' is load-bearing for the paper's positioning as a reliable benchmark. However, the manuscript provides no information on the number of experts per image, sampling strategy for the verification subset, exact agreement metric (pairwise, majority, or kappa), resolution of disagreements, or per-class/per-genus agreement breakdowns. Without these, it is impossible to assess robustness to the heterogeneous sources, realistic label noise, and strong class imbalance noted elsewhere in the paper.

minor comments (2)

Experiments section: The description of train/test splits, exact filtering criteria applied to the 76 CoralNet sources, and how the Al-Wajh site was integrated should be expanded for reproducibility.
Table/figure captions: Ensure all performance tables explicitly state the number of shots, domains, and whether results are averaged over multiple runs with standard deviations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback on our manuscript. We address the major comment on the transparency of the expert verification process for the high-confidence benchmark subset below. We will incorporate additional details into the revised manuscript to strengthen the presentation of our dataset construction.

read point-by-point responses

Referee: Abstract and dataset construction section: The central claim of a 'high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes' obtained via 'expert-driven verification and targeted filtering' is load-bearing for the paper's positioning as a reliable benchmark. However, the manuscript provides no information on the number of experts per image, sampling strategy for the verification subset, exact agreement metric (pairwise, majority, or kappa), resolution of disagreements, or per-class/per-genus agreement breakdowns. Without these, it is impossible to assess robustness to the heterogeneous sources, realistic label noise, and strong class imbalance noted elsewhere in the paper.

Authors: We agree with the referee that the current manuscript lacks sufficient detail on the expert verification process, which is necessary to fully evaluate the reliability and robustness of the high-confidence benchmark subset. In the revised manuscript, we will expand the dataset construction section with a dedicated description of this process. This will include the number of experts per image, the sampling strategy for the verification subset, the exact agreement metric used, the method for resolving disagreements, and per-class/per-genus agreement breakdowns. These additions will allow better assessment of the benchmark's handling of heterogeneous sources, label noise, and class imbalance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivations or self-referential reductions

full rationale

The paper constructs ReefNet by aggregating imagery from 76 external CoralNet sources plus one new site, applies expert verification to produce a 39-class benchmark subset, and reports empirical results on VLMs and vision models. No equations, fitted parameters, or predictions appear in the provided text. The 92% expert agreement is presented as a direct measurement on the filtered data rather than a quantity derived from or equivalent to any author-defined prior inputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims rest on newly collected and labeled data plus standard benchmark experiments, making the work self-contained against external sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on the external WoRMS taxonomy and expert verification as background; no free parameters, new physical entities, or ad-hoc mathematical constructs are introduced.

axioms (1)

domain assumption The World Register of Marine Species (WoRMS) supplies a stable, cross-study taxonomy suitable for mapping coral labels from heterogeneous sources.
Invoked to achieve taxonomic consistency across the 76 CoralNet sources and the new Al-Wajh site.

pith-pipeline@v0.9.0 · 5885 in / 1304 out tokens · 59655 ms · 2026-05-18T06:20:25.753983+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through expert-driven verification and targeted filtering, we derive a high-confidence benchmark subset with 92% expert agreement over 39 hard-coral label classes
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two benchmarking configurations: within-source and cross-source

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.