pith. sign in

arxiv: 2606.11268 · v1 · pith:ZIVYHV34new · submitted 2026-06-09 · 💻 cs.LG

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

Pith reviewed 2026-06-27 13:57 UTC · model grok-4.3

classification 💻 cs.LG
keywords foundation modeltime series forecastingaquatic ecosystemslake dynamicsirregular samplingmultivariate time seriesecological modeling
0
0 comments X

The pith

LakeFM is a foundation model pre-trained on irregular lake time series that generalizes across new lakes and matches or beats existing forecasters while producing physically consistent outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LakeFM as a single model pre-trained on large collections of both simulated and observed lake data that arrive with irregular timing, multiple variables, and varying depths. It shows that this pre-training produces representations covering broad lake characteristics and yields forecasting results that are competitive with or better than specialized time-series models on lakes not seen in training. The work argues that the resulting predictions remain consistent with known physical lake behavior. A reader would care because existing methods require regular sampling and lake-specific tuning, limiting their use across the many lakes that differ in variables, depths, and observation patterns.

Core claim

LakeFM, pre-trained on mixed simulated and observed ecological datasets, learns representations that span broader lake-level characteristics and achieves competitive or often superior forecasting performance compared with existing time-series foundation and non-foundation models while producing physically plausible predictions consistent with real-world lake dynamics.

What carries the argument

LakeFM, the foundation model pre-trained to process irregular multivariate multi-depth time series from aquatic systems.

If this is right

  • One model can be applied to lakes with heterogeneous variables and sampling schedules instead of training separate models for each lake.
  • Forecasts remain consistent with physical lake processes, supporting their use in water-quality monitoring.
  • Representations learned during pre-training capture lake-level characteristics that generalize beyond the training distribution.
  • Competitive or superior performance holds against both foundation and non-foundation time-series baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pre-training approach could be tested on other irregularly sampled environmental time series such as river or reservoir networks.
  • If the representations prove robust, they might support downstream tasks like anomaly detection or long-term scenario simulation without lake-specific retraining.
  • Integration with additional data sources that also arrive irregularly could extend coverage to lakes lacking direct observations.

Load-bearing premise

Pre-training on a mix of simulated and observed lakes produces representations that transfer to entirely new lakes despite differences in variables, depths, and observation patterns.

What would settle it

A test on held-out lakes with irregular multi-depth sampling in which LakeFM forecasts are no better than a simple baseline model or violate known physical constraints such as temperature stratification.

Figures

Figures reproduced from arXiv: 2606.11268 by Aanish Pradhan, Abhilash Neog, Anuj Karpatne, Arka Daw, Bennett J. McAfee, Cayelan C. Carey, Emma Marchisin, Kazi Sajeed Mehrab, Medha Sawhney, Paul Hanson, Robert Ladwig, Sepideh Fatemi.

Figure 1
Figure 1. Figure 1: Overview of LakeFM. Tokenization and embedding of irregular multi-variable, multi-depth time-series data shown on the left. Overall Model architecture showing decoupled static and dynamic representation learning with joint forecasting and contrastive objectives in the middle, with the decoder shown on the right. LakeFM Architecture [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overall lake-wise prediction performance (MSE) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: An example of time-series forecasts of chlorophyll [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overall lake-wise prediction performance (MSE) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualizing DO forecasts under masked and no [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Static LakeFM embeddings of observed lakes cate￾gorized by location (State) and hydrologic regime [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparing physical consistency of LakeFM & [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: LakeFM representation for 900 unseen simulation lakes, each corresponding to a different cyanobacteria value. Embeddings of Simulated Lakes. We investigate whether LakeFM’s embeddings of simulated lakes encode information of process-based param￾eters used to generate the simulations [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Model and training ablations evaluated using MSE [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PCA of lake similarity with clusters obtained by [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparison of LakeFM and iTransformer on the Beer-Lambert Law and Vertical stratification tests, evaluated across [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Lake Embedding trajectories comparing the com [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: Dynamic Embedding-based trajectories for simulation lakes sampled from low, intermediate, and high [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Variate Masking - No masking vs Masked prediction Plots for Lake PRLA [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Variate Masking - No masking vs Masked prediction Plots for Lake BARC [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Variate Masking - No masking vs Masked Prediction Plots for Lake CB [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-lake performance deltas visualized as heatmaps. For each lake, we show [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Per-lake performance deltas, on passing a single variate as input, visualized as heatmaps. For each lake, we show [PITH_FULL_IMAGE:figures/full_fig_p022_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Depth Masking - Prediction performance visualization in the shallow region under no masking, masking the shallow [PITH_FULL_IMAGE:figures/full_fig_p023_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Depth Masking - Prediction performance visualization under no masking, masking the shallow layers and masking [PITH_FULL_IMAGE:figures/full_fig_p024_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Prediction plots for four lakes (FI, ME, MO, WI) for water temperature at depth 0.0m. [PITH_FULL_IMAGE:figures/full_fig_p027_23.png] view at source ↗
read the original abstract

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces LakeFM, a foundation model for aquatic ecosystems pre-trained on large-scale datasets of both simulated and observed lakes. It targets irregular multivariate multi-depth time series data, claiming to learn meaningful representations across broader lake-level characteristics and to achieve competitive or superior forecasting performance relative to existing time-series foundation and non-foundation models while producing physically plausible predictions consistent with real-world lake dynamics.

Significance. If the empirical claims hold after detailed verification, the work would address an important gap in ecological time-series modeling by enabling generalization across lakes with heterogeneous variables, depths, and sampling patterns. The combination of simulated and observed data for pre-training is a potentially valuable design choice for robustness. However, the absence of any architecture details, equations, dataset composition, evaluation protocol, baselines, or quantitative results in the supplied material means the significance cannot be assessed beyond the level of a promising but unverified proposal.

major comments (1)
  1. No methods, architecture, data splits, baselines, or error metrics are supplied anywhere in the manuscript. This renders the central claim of competitive or superior forecasting performance unverifiable and prevents any technical assessment of whether pre-training on mixed simulated/observed lakes actually enables the stated generalization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for methodological transparency. The central concern is that the supplied manuscript lacks sufficient technical details to assess the claims. We address this point below and commit to a substantial revision.

read point-by-point responses
  1. Referee: No methods, architecture, data splits, baselines, or error metrics are supplied anywhere in the manuscript. This renders the central claim of competitive or superior forecasting performance unverifiable and prevents any technical assessment of whether pre-training on mixed simulated/observed lakes actually enables the stated generalization.

    Authors: We agree that the version provided to the referee does not contain the required technical details. The abstract references an 'extensive empirical evaluation,' but the body of the supplied material omits the LakeFM architecture (including equations and model components), dataset composition and splits for simulated versus observed lakes, the precise evaluation protocol, chosen baselines, and quantitative error metrics. This omission prevents verification of the performance claims and the generalization argument. In the revised manuscript we will add a dedicated Methods section with full architectural specifications, pre-training and fine-tuning procedures, data characteristics, evaluation metrics, baseline implementations, and all quantitative results. We will also include ablation studies on the mixed simulated/observed pre-training strategy to directly support the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a standard pre-training pipeline on mixed simulated/observed lake data followed by empirical forecasting evaluation on held-out lakes. The abstract and provided context contain no equations, no fitted parameters renamed as predictions, and no self-citation chains that reduce the central claim to its own inputs by construction. The generalization claim is presented as an empirical outcome rather than a definitional or fitted tautology, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5711 in / 1091 out tokens · 16993 ms · 2026-06-27T13:57:13.188580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 15 canonical work pages · 4 internal anchors

  1. [1]

    Abdul Fatir Ansari, Oleksandr Shchur, Jaris Küken, Andreas Auer, Boran Han, Pedro Mercado, Syama Sundar Rangapuram, Huibin Shen, Lorenzo Stella, Xiyuan Zhang, et al. 2025. Chronos-2: From univariate to universal forecasting.arXiv preprint arXiv:2510.15821(2025)

  2. [2]

    Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebas- tian Pineda Arango, Shubham Kapoor, et al. 2024. Chronos: Learning the language of time series.arXiv preprint arXiv:2403.07815(2024)

  3. [3]

    August Beer and P Beer. 1852. Determination of the absorption of red light in colored liquids.Annalen der Physik und Chemie86, 5 (1852), 78–88

  4. [4]

    Yuqi Chen, Kan Ren, Yansen Wang, Yuchen Fang, Weiwei Sun, and Dongsheng Li. 2023. Contiformer: Continuous-time transformer for irregular time series modeling.Advances in Neural Information Processing Systems36 (2023), 47143– 47175

  5. [5]

    Ben Cohen, Emaad Khwaja, Kan Wang, Charles Masson, Elise Ramé, Youssef Doubli, and Othmane Abou-Amal. 2024. Toto: Time series optimized transformer for observability.arXiv preprint arXiv:2407.07874(2024)

  6. [6]

    Jessica Corman, Jacob Zwart, Jennifer Klug, Denise Bruesewitz, Elvira de Eyto, Marcus Klaus, Lesley Knoll, James Rusak, Michael Vanni, Maria Belen Alfonso, et al. 2023. High-frequency dissolved oxygen, water temperature, wind speed, and radiation data; stream and in-lake nutrient concentration data; and daily metabolism and nutrient loading estimates for ...

  7. [7]

    Arka Daw, Anuj Karpatne, William D Watkins, Jordan S Read, and Vipin Kumar

  8. [8]

    InKnowledge guided machine learning

    Physics-guided neural networks (pgnn): An application in lake temperature modeling. InKnowledge guided machine learning. Chapman and Hall/CRC, 353– 372

  9. [9]

    Wenjie Du, David Côté, and Yan Liu. 2023. Saits: Self-attention-based imputation for time series.Expert Systems with Applications219 (2023), 119619

  10. [10]

    Mononito Goswami, Konrad Szafer, Arjun Choudhry, Yifu Cai, Shuo Li, and Artur Dubrawski. 2024. Moment: A family of open time-series foundation models.arXiv preprint arXiv:2402.03885(2024). KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Neog, et al

  11. [11]

    P. C. Hanson, R. Ladwig, C. Buelo, E. A. Albright, A. D. Delany, and C. C. Carey

  12. [12]

    doi:10.1029/2023JG007620

    Legacy Phosphorus and Ecosystem Memory Control Future Water Quality in a Eutrophic Lake.Journal of Geophysical Research: Biogeosciences128, 12 (2023), e2023JG007620. doi:10.1029/2023JG007620

  13. [13]

    M. R. Hipsey, L. C. Bruce, C. Boon, B. Busch, C. C. Carey, D. P. Hamilton, P. C. Hanson, J. S. Read, E. de Sousa, M. Weber, and L. A. Winslow. 2019. A General Lake Model (GLM 3.0) for linking with high-frequency sensor data from the Global Lake Ecological Observatory Network (GLEON).Geoscientific Model Development12, 1 (2019), 473–523

  14. [14]

    Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan Read, Jacob Zwart, Michael Steinbach, and Vipin Kumar. 2019. Physics guided RNNs for modeling dynamical systems: A case study in simulating lake temperature profiles. (2019), 558–566

  15. [15]

    Robert Ladwig, Arka Daw, Elen A Albright, Cal Buelo, Anuj Karpatne, Michael Frederick Meyer, Abhilash Neog, Paul C Hanson, and Hilary A Dugan

  16. [16]

    Modular Compositional Learning Improves 1D Hydrodynamic Lake Model Performance by Merging Process-Based Modeling With Deep Learning.Journal of Advances in Modeling Earth Systems16, 1 (2024), e2023MS003953

  17. [17]

    OC Langman, PC Hanson, SR Carpenter, and YH Hu. 2010. Control of dissolved oxygen in northern temperate lakes over scales ranging from minutes to days. Aquatic Biology9, 2 (2010), 193–202

  18. [18]

    Boyuan Li, Zhen Liu, Yicheng Luo, and Qianli Ma. 2026. Learning Recursive Multi-Scale Representations for Irregular Multivariate Time Series Forecasting. arXiv preprint arXiv:2602.21498(2026)

  19. [19]

    Boyuan Li, Yicheng Luo, Zhen Liu, Junhao Zheng, Jianming Lv, and Qianli Ma

  20. [20]

    Hyperimts: Hypergraph neural network for irregular multivariate time series forecasting.arXiv preprint arXiv:2505.17431(2025)

  21. [21]

    Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long. 2023. itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625(2023)

  22. [22]

    Bennett J McAfee, Aanish Pradhan, Abhilash Neog, Sepideh Fatemi, Robert T Hensley, Mary E Lofton, Anuj Karpatne, Cayelan C Carey, and Paul C Hanson

  23. [23]

    LakeBeD-US: a benchmark dataset for lake water quality time series and vertical profiles.Earth System Science Data17, 7 (2025), 3141–3165

  24. [24]

    Abhilash Neog, Arka Daw, Sepideh Fatemi, Medha Sawhney, Aanish Pradhan, Mary E Lofton, Bennett J McAfee, Adrienne Breef-Pilz, Heather L Wander, Dex- ter W Howard, et al. 2026. Investigating a Model-Agnostic and Imputation-Free Approach for Irregularly-Sampled Multivariate Time-Series Modeling.Transac- tions on Machine Learning Research(2026)

  25. [25]

    Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. 2022. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730(2022)

  26. [26]

    Harshavardhan Prabhakar Kamarthi and B Aditya Prakash. 2024. Large Pre- trained time series models for cross-domain Time series analysis tasks.Advances in Neural Information Processing Systems37 (2024), 56190–56214

  27. [27]

    McAfee, Abhilash Neog, Sepideh Fatemi, Mary E

    Aanish Pradhan, Bennett J. McAfee, Abhilash Neog, Sepideh Fatemi, Mary E. Lofton, Cayelan C. Carey, Anuj Karpatne, and Paul C. Hanson. 2024. LakeBeD-US: Computer Science Edition - a benchmark dataset for lake water quality time series and vertical profiles. doi:10.57967/hf/3771

  28. [28]

    Xiaoming Shi, Shiyu Wang, Yuqi Nie, Dianqi Li, Zhou Ye, Qingsong Wen, and Ming Jin. 2024. Time-moe: Billion-scale time series foundation models with mixture of experts.arXiv preprint arXiv:2409.16040(2024)

  29. [29]

    Satya Narayan Shukla and Benjamin M Marlin. 2021. Multi-time attention networks for irregularly sampled time series.arXiv preprint arXiv:2101.10318 (2021)

  30. [30]

    Peter A Staehr, Darren Bade, Matthew C Van de Bogert, Gregory R Koch, Craig Williamson, Paul Hanson, Jonathan J Cole, and Tim Kratz. 2010. Lake metabolism and the diel oxygen technique: state of the science.Limnology and Oceanography: Methods8, 11 (2010), 628–644

  31. [31]

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. 2024. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing568 (2024), 127063

  32. [32]

    Yusuke Tashiro, Jiaming Song, Yang Song, and Stefano Ermon. 2021. Csdi: Con- ditional score-based diffusion models for probabilistic time series imputation. Advances in neural information processing systems34 (2021), 24804–24816

  33. [33]

    Jared D Willard, Jordan S Read, Alison P Appling, Samantha K Oliver, Xiaowei Jia, and Vipin Kumar. 2021. Predicting water temperature dynamics of unmoni- tored lakes with meta-transfer learning.Water Resources Research57, 7 (2021), e2021WR029579

  34. [34]

    Jared D Willard, Jordan S Read, Simon Topp, Gretchen JA Hansen, and Vipin Kumar. 2022. Daily surface temperatures for 185,549 lakes in the contermi- nous United States estimated using deep learning (1980–2020).Limnology and Oceanography Letters7, 4 (2022), 287–301

  35. [35]

    G Woo, C Liu, A Kumar, C Xiong, S Savarese, and D Sahoo. 2024. Unified training of universal time series forecasting transformers. arXiv 2024.arXiv preprint arXiv:2402.02592(2024)

  36. [36]

    Youlong Xia, Kenneth Mitchell, Michael Ek, Justin Sheffield, Brian Cosgrove, Eric Wood, Lifeng Luo, Charles Alonge, Helin Wei, Jesse Meng, Ben Livneh, Dennis Lettenmaier, Victor Koren, Qingyun Duan, Kingtse Mo, Yun Fan, and David Mocko. 2012. Continental-scale water and energy flux analysis and validation for the North American Land Data Assimilation Syst...

  37. [37]

    doi:10.1029/2011JD016048

    Intercomparison and application of model products.Journal of Geophysical Research: Atmospheres117, D3 (2012). doi:10.1029/2011JD016048

  38. [38]

    Runlong Yu, Chonghao Qiu, Robert Ladwig, Paul Hanson, Yiqun Xie, and Xi- aowei Jia. 2025. Physics-Guided Foundation Model for Scientific Discovery: An Application to Aquatic Science.arXiv preprint arXiv:2502.06084(2025). LakeFM: Toward a Foundation Model for Aquatic Ecosystems KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea A Dataset Details W...

  39. [39]

    lakes and include both high- and low-frequency measurements

    The data span 21 U.S. lakes and include both high- and low-frequency measurements. In this work, we utilize only the low-frequency measurements. The dataset features 17 variables organized into three categories: (1) static attributes, such as lake morphology and geographic location; (2) one-dimensional (1D) variables that vary over time (e.g., Secchi dept...