arxiv: 2605.00877 · v2 · submitted 2026-04-25 · 💻 cs.MM · cs.AI· cs.CL· cs.CV· cs.LG

Recognition: unknown

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Yida Xue , Ningyu Zhang , Tingwei Wu , Zhe Ma , Daxiong Ji , Zhao Wang , Guozhou Zheng , Huajun Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:48 UTC · model grok-4.3

classification 💻 cs.MM cs.AIcs.CLcs.CVcs.LG

keywords ocean datamultimodal corpusfoundation modelsmarine AIdata quality controlknowledge graphsonar imagery

0 comments

The pith

OceanPile assembles fragmented ocean data into a unified multimodal corpus that improves AI performance on marine tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to overcome the data bottleneck in applying multimodal large language models to ocean science by creating OceanPile, a large-scale corpus integrating sonar data, underwater imagery, marine visuals, and scientific text. It applies a multi-stage quality control process and a hierarchical Ocean Concept Knowledge Graph to synthesize high-quality instruction data plus a curated benchmark. A sympathetic reader would care because ocean data remains highly fragmented, noisy, and weakly labeled, limiting AI's contributions to climate regulation and marine biodiversity. Experimental validation shows that models trained on this corpus achieve significant performance gains.

Core claim

OceanPile comprises three components: OceanCorpus, a unified collection of sonar, imagery, visuals, and text from authoritative sources; OceanInstruction, synthesized via a pipeline guided by the hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation set. A multi-stage quality control process ensures scientific validity and cross-modal alignment, and models trained on the resulting data exhibit significant performance improvements.

What carries the argument

OceanPile corpus built from OceanCorpus integration, OceanInstruction synthesis via hierarchical Ocean Concept Knowledge Graph, OceanBenchmark, and multi-stage quality control that enforces alignment across modalities.

If this is right

Models trained on OceanPile demonstrate significant performance improvements on ocean-related tasks.
Domain-specific multimodal foundation models become feasible for marine science applications.
The public datasets accelerate research in marine artificial intelligence.
Unified schemas and semantic alignment across ocean modalities reduce the previous data fragmentation barrier.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment approach succeeds, analogous corpora could be built for other data-scarce domains such as atmospheric or geological science.
Downstream systems using these models might improve predictions of ocean-driven climate effects and biodiversity shifts.
The knowledge-graph-guided instruction synthesis could be tested for transfer to other specialized scientific fields.

Load-bearing premise

That the multi-stage quality control and knowledge graph produce data that is scientifically accurate and properly aligned across modalities without introducing noise or misalignment that harms model training.

What would settle it

Train comparable models on OceanPile versus general-domain data and test both on the OceanBenchmark; if the OceanPile models show no improvement or decline, the corpus value is falsified.

Figures

Figures reproduced from arXiv: 2605.00877 by Daxiong Ji, Guozhou Zheng, Huajun Chen, Ningyu Zhang, Tingwei Wu, Yida Xue, Zhao Wang, Zhe Ma.

**Figure 1.** Figure 1: A overview of OCEANPILE, which comprises three components: OCEANCORPUS, OCEANINSTRUCTION, and OCEANBENCHMARK. that preserve the scientific integrity and contextual richness of oceanographic information to maintain the complex relationships inherent in marine data. In summary, our contributions are threefold: • We introduce OCEANPILE, a large‐scale multimodal corpus specifically designed for ocean science, … view at source ↗

**Figure 2.** Figure 2: A comprehensive overview of the our framework. scientifically substantive marine knowledge while effectively eliminating low‐quality and repetitive information. The refined corpus maintains the technical accuracy and contextual richness essential for marine science applications. Preprocessing Pipeline for Web Pages. Web content undergoes a multi‐stage cleaning and enhancement pipeline. First, core textual … view at source ↗

**Figure 3.** Figure 3: Case analysis. OceanCorpus. OCEANCORPUS serves as the foundational multimodal data collection. The corpus comprises five main data categories: (1) oceanographic textbooks and papers containing comprehensive marine science knowledge, (2) marine‐related web pages providing real‐world contextual information, (3) sonar detection databases with acous‐ tic imaging data and annotations, (4) underwater image datab… view at source ↗

read the original abstract

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OceanPile is a useful data curation effort that unifies ocean multimodal sources and adds instruction synthesis, but the quality validation and performance claims lack the numbers needed to judge impact.

read the letter

The main thing to know is that this paper releases OceanPile, a new multimodal corpus pulling together sonar, underwater images, visuals, and scientific text, then builds an instruction dataset from a hierarchical Ocean Concept Knowledge Graph and includes a curated benchmark. That construction and public release is the real contribution, and it targets a documented gap in marine AI data where sources are fragmented and noisy.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces OceanPile, a large-scale multimodal corpus for ocean foundation models. It consists of OceanCorpus (a unified integration of sonar data, underwater imagery, marine visuals, and scientific text from diverse sources), OceanInstruction (a high-quality instruction dataset synthesized via a pipeline guided by a hierarchical Ocean Concept Knowledge Graph), and OceanBenchmark (a manually curated evaluation set). The work describes a multi-stage quality control process to ensure scientific validity and cross-modal alignment, and claims that experimental validation shows significant performance improvements for models trained on the data. All components are publicly released.

Significance. If the quality controls prove effective and the performance claims are substantiated with rigorous evidence, OceanPile could meaningfully advance marine AI by addressing the fragmentation, noise, and weak labeling that have constrained MLLM applications in ocean science. The public release of the corpus, instruction data, and benchmark is a clear strength that supports reproducibility and community progress in this underexplored domain.

major comments (2)

[Experimental validation] Experimental validation section: the assertion that 'Experimental validation demonstrates significant performance improvements for models trained on our data' is presented without any quantitative metrics, baselines, error bars, statistical tests, or experimental details. This directly undermines evaluation of the central claim that the corpus yields meaningful gains rather than artifacts of data selection or post-hoc choices.
[Methods / OceanInstruction synthesis] Multi-stage quality control process and Ocean Concept Knowledge Graph (methods section describing OceanInstruction synthesis): no quantitative metrics are reported for cross-modal alignment (e.g., retrieval scores), noise reduction statistics, expert agreement rates, or pre/post-filtering validity checks. Without these, it is impossible to confirm that the pipeline produces scientifically valid, well-aligned data rather than retaining residual misalignment or domain bias that could explain any downstream results.

minor comments (1)

[Abstract] The abstract summarizes the components and claims but omits any concrete performance numbers or alignment statistics; adding one or two key quantitative highlights would improve the summary's informativeness without altering length substantially.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that the current manuscript requires additional quantitative evidence to support the central claims regarding data quality and performance gains. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of results and methods.

read point-by-point responses

Referee: [Experimental validation] Experimental validation section: the assertion that 'Experimental validation demonstrates significant performance improvements for models trained on our data' is presented without any quantitative metrics, baselines, error bars, statistical tests, or experimental details. This directly undermines evaluation of the central claim that the corpus yields meaningful gains rather than artifacts of data selection or post-hoc choices.

Authors: We acknowledge that the experimental validation section in the submitted manuscript lacks the required quantitative details, baselines, error bars, statistical tests, and experimental protocols. This was an oversight in the initial submission. In the revised manuscript, we will expand this section to include specific performance metrics (e.g., accuracy, F1 scores, or domain-specific measures), comparisons to relevant baselines, error bars from multiple runs, and statistical significance tests. We will also provide full details on the models trained, training hyperparameters, evaluation splits, and any controls for data selection effects to allow rigorous assessment of the claimed improvements. revision: yes
Referee: [Methods / OceanInstruction synthesis] Multi-stage quality control process and Ocean Concept Knowledge Graph (methods section describing OceanInstruction synthesis): no quantitative metrics are reported for cross-modal alignment (e.g., retrieval scores), noise reduction statistics, expert agreement rates, or pre/post-filtering validity checks. Without these, it is impossible to confirm that the pipeline produces scientifically valid, well-aligned data rather than retaining residual misalignment or domain bias that could explain any downstream results.

Authors: We agree that the methods section describing the OceanInstruction synthesis pipeline and multi-stage quality control currently provides only qualitative descriptions without supporting quantitative metrics. In the revised version, we will add specific quantitative evidence, including cross-modal retrieval scores (e.g., recall@K), noise reduction statistics before and after filtering, inter-annotator or expert agreement rates (e.g., Cohen's kappa), and pre/post-filtering validity checks. These additions will substantiate the effectiveness of the Ocean Concept Knowledge Graph-guided pipeline in achieving scientific validity and cross-modal alignment. revision: yes

Circularity Check

0 steps flagged

No significant circularity: data curation paper with no derivation chain

full rationale

This is a dataset construction and release paper whose central contribution is the assembly of OceanPile (OceanCorpus + OceanInstruction + OceanBenchmark) from external sources, followed by a described multi-stage QC pipeline and knowledge-graph-guided synthesis. No equations, fitted parameters, predictions, or mathematical derivations appear in the provided text. Claims of performance improvements are presented as experimental outcomes on the released data rather than reductions to prior fitted values or self-citations. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling are present. The paper is therefore self-contained against external benchmarks and sources; the absence of any derivation chain precludes circularity by the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required for the central claim of dataset introduction; the work relies on standard data aggregation practices and public sources.

pith-pipeline@v0.9.0 · 5555 in / 1152 out tokens · 31437 ms · 2026-05-09T20:48:03.077746+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Falkowski, P .Nature 483(7387), S17–S20 (2012)

2012
[2]

Nature communications 9(1), 690 (2018)

Visbeck, M. Nature communications 9(1), 690 (2018)

2018
[3]

IEEE Trans

Jin, X., He, X., Wang, D., Ying, J., Gong, F., Zhu, Q., Zhou, C., and Pan, D. IEEE Trans. Geosci. Remote. Sens. 61, 1–16 (2023)

2023
[4]

P ., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P ., Riechert, M., Weyn, J

Bodnar, C., Bruinsma, W . P ., Lucic, A., Stanley, M., Allen, A., Brandstetter, J., Garvan, P ., Riechert, M., Weyn, J. A., Dong, H., et al. Nature 641(8065), 1180–1187 (2025)

2025
[5]

Multimedia systems 29(3), 1815–1824 (2023)

Lou, R., Lv, Z., Dang, S., Su, T ., and Li, X. Multimedia systems 29(3), 1815–1824 (2023)

2023
[6]

In European Conference on Computer Vision, 239–257

Zheng, Z., Chen, Y ., Zeng, H., Vu, T .‐A., Hua, B.‐S., and Yeung, S.‐K. In European Conference on Computer Vision, 239–257. Springer, (2024)

2024
[7]

arXiv preprint arXiv:2412.18097 (2024)

Yang, N., Wang, C., Zhao, M., Zhao, Z., Zheng, H., Zhang, B., Wang, J., and Li, X. arXiv preprint arXiv:2412.18097 (2024)

work page arXiv 2024
[8]

arXiv preprint arXiv:2506.03210 (2025)

Huang, Q., Niu, Y ., Zhong, X., Guo, A., Chen, L., Zhang, D., Zhang, X., and Li, H. arXiv preprint arXiv:2506.03210 (2025)

work page arXiv 2025
[9]

IEEE Journal of Oceanic Engineering (2025)

Aubard, M., Madureira, A., Teixeira, L., and Pinto, J. IEEE Journal of Oceanic Engineering (2025)

2025
[10]

CoRR abs/2212.00352 (2022)

Xie, K., Yang, J., and Qiu, K. CoRR abs/2212.00352 (2022)

work page arXiv 2022
[11]

IEEE Sensors Journal 24(5), 6998–7008 (2024)

Li, Z., Xie, Z., Duan, P ., Kang, X., and Li, S. IEEE Sensors Journal 24(5), 6998–7008 (2024). 10

2024
[13]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T ., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T ., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. CoRR abs/2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

E., Stoica, I., and Xing, E

Chiang, W .‐L., Li, Z., Lin, Z., Sheng, Y ., Wu, Z., Zhang, H., Zheng, L., Zhuang, S., Zhuang, Y ., Gonzalez, J. E., Stoica, I., and Xing, E. P . March (2023)

2023
[17]

https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , (2024)

Meta AI. https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , (2024)

2024
[18]

In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7‐11, 2024

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7‐11, 2024. OpenReview.net, (2024)

2024
[19]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W ., Song, S., Dang, K., Wang, P ., Wang, S., Tang, J., Zhong, H., Zhu, Y ., Yang, M., Li, Z., Wan, J., Wang, P ., Ding, W ., Fu, Z., Xu, Y ., Ye, J., Zhang, X., Xie, T ., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. CoRR abs/2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Nat Commun 16, 5509 (2025) (2025)

Yao, Y ., Yu, T ., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T ., Li, H., Zhao, W ., He, Z., et al. Nat Commun 16, 5509 (2025) (2025)

2025
[21]

Wang, W ., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., Wang, Z., Chen, Z., Zhang, H., Yang, G., Wang, H., Wei, Q., Yin, J., Li, W ., Cui, E., Chen, G., Ding, Z., Tian, C., Wu, Z., Xie, J., Li, Z., Yang, B., Duan, Y ., Wang, X., Hou, Z., Hao, H., Zhang, T ., Li, S., Zhao, X., Duan, H., Deng, N., Fu, B., He, Y ., Wang, Y ...

2025
[22]

W ., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., and Sutskever, I

Radford, A., Kim, J. W ., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P ., Clark, J., Krueger, G., and Sutskever, I. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18‐24 July 2021, Virtual Event, 8748–8763, (2021)

2021
[23]

In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3‐7, 2021, (2021)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T ., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3‐7, 2021, (2021)

2021
[24]

Bi, Z., Zhang, N., Xue, Y ., Ou, Y ., Ji, D., Zheng, G., and Chen, H. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11‐16, 2024, Ku, L., Martins, A., and Srikumar, V ., editors, 3357–3372. Association for Computational Linguistics, (2024). 11

2024
[25]

Zheng, Z., Zhang, J., Vu, T .‐A., Diao, S., Tim, Y . H. W ., and Yeung, S.‐K.arXiv preprint arXiv:2310.13596 (2023)

work page arXiv 2023
[26]

Xu, W ., Wang, C., Liang, D., Zhao, Z., Jiang, X., Zhang, P ., and Bai, X. (2025)

2025
[27]

In2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22‐26, 2018, Boll, S., Lee, K

Zhuang, P ., Wang, Y ., and Qiao, Y . In2018 ACM Multimedia Conference on Multimedia Conference, MM 2018, Seoul, Republic of Korea, October 22‐26, 2018, Boll, S., Lee, K. M., Luo, J., Zhu, W ., Byun, H., Chen, C. W ., Lienhart, R., and Mei, T ., editors, 1301–1309. ACM, (2018)

2018
[28]

Zhuang, P ., Wang, Y ., and Qiao, Y .IEEE Trans. Multim. 23, 3603–3617 (2021)

2021
[29]

Coralvqa: A large-scale visual question answering dataset for coral reef image understanding,

Han, H., Wang, W ., Zhang, G., Li, M., and Wang, Y . CoRR abs/2507.10449 (2025)

work page arXiv 2025
[30]

R., and Yang, C

Chu, S., Huang, Z., Li, Y ., Lin, M., Carlucho, I., Petillot, Y . R., and Yang, C. (2025)

2025
[31]

Potokar, E., Ashford, S., Kaess, M., and Mangelson, J. G. In 2022 International Conference on Robotics and Automation (ICRA), 3040–3046. IEEE, (2022)

2022
[32]

V ., Zhang, Y ., and Skinner, K

Song, J., Ma, H., Bagoren, O., Sethuraman, A. V ., Zhang, Y ., and Skinner, K. A. (2025)

2025
[33]

Xue, Y ., Mao, M., Ru, X., Zhu, Y ., Ren, B., Qiao, S., Wang, M., Deng, S., An, X., Zhang, N., Chen, Y ., and Chen, H. (2025)

2025
[34]

and Tang, H

Huo, Y . and Tang, H. (2025)

2025
[35]

Yang, C., Zhao, R., Liu, Y ., and Jiang, L. (2025)

2025
[36]

Expert Systems with Applications , 123495 (2024)

Jiao, W ., Zhang, J., and Zhang, C. Expert Systems with Applications , 123495 (2024)

2024
[37]

C., Singh, D., Wehbe, B., and Petillot, Y

Valdenegro‐Toro, M., Padmanabhan, D. C., Singh, D., Wehbe, B., and Petillot, Y . CoRR abs/2503.22880 (2025)

work page arXiv 2025
[38]

Lu, Z., Liao, L., Xie, X., and Yuan, H. Ecol. Informatics 85, 102937 (2025)

2025
[39]

Wang, B., Xu, C., Zhao, X., Ouyang, L., Wu, F., Zhao, Z., Xu, R., Liu, K., Qu, Y ., Shang, F., Zhang, B., Wei, L., Sui, Z., Li, W ., Shi, B., Qiao, Y ., Lin, D., and He, C. (2024)

2024
[40]

A Survey of Large Language Models

Zhao, W . X., Zhou, K., Li, J., Tang, T ., Wang, X., Hou, Y ., Min, Y ., Zhang, B., Zhang, J., Dong, Z., et al.arXiv preprint arXiv:2303.18223 (2023)

work page internal anchor Pith review arXiv 2023
[41]

A survey on multimodal large language models.arXiv preprint arXiv:2306.13549, 2023

Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T ., and Chen, E. arXiv preprint arXiv:2306.13549 (2023)

work page arXiv 2023
[42]

Liu, R., Wei, J., Liu, F., Si, C., Zhang, Y ., Rao, J., Zheng, S., Peng, D., Yang, D., Zhou, D., and Dai, A. M. CoRR abs/2404.07503 (2024)

work page arXiv 2024
[43]

Liu, H., Li, C., Wu, Q., and Lee, Y . J. (2023)

2023
[44]

Zhang, Y ., Zhang, R., Gu, J., Zhou, Y ., Lipka, N., Yang, D., and Sun, T .CoRR abs/2306.17107 (2023)

work page arXiv 2023
[45]

Llava-o1: Let vision language models reason step-by-step

Xu, G., Jin, P ., Li, H., Song, Y ., Sun, L., and Yuan, L. CoRR abs/2411.10440 (2024)

work page arXiv 2024
[46]

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs,

Thawakar, O., Dissanayake, D., More, K., Thawkar, R., Heakl, A., Ahsan, N., Li, Y ., Zumri, M., Lahoud, J., Anwer, R. M., Cholakkal, H., Laptev, I., Shah, M., Khan, F. S., and Khan, S. H. CoRR abs/2501.06186 (2025)

work page arXiv 2025
[47]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P ., Wang, P ., Zhu, Q., Men, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W ., Shen, Y ., Ma, S., Liu, H., Wang, Y ., and Guo, J. ArXiv abs/2411.15594 (2024). 13

work page internal anchor Pith review Pith/arXiv arXiv 2024