DINO-Explorer: Active Underwater Discovery via Ego-Motion Compensated Semantic Predictive Coding
Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3
The pith
Ego-motion compensated DINOv3 predictions let AUVs flag underwater novelty online and cut telemetry bandwidth by 48 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DINO-Explorer shows that operating inside the latent space of a frozen DINOv3 foundation model with an action-conditioned recurrent predictor and an efference-copy optical-flow compensation module yields a continuous semantic surprise signal suitable for asynchronous event triage. At the chosen operating point the signal keeps 78.8 percent of post-discovery human-reviewer consensus events, achieves a 56.8 percent trigger confirmation rate, suppresses 45.5 percent of false positives relative to an uncompensated baseline, and reaches a peak F1 score of 62.2 percent while reducing telemetry bandwidth by 48.2 percent.
What carries the argument
The ego-motion compensated semantic surprise signal, formed by short-horizon recurrent predictions over DINOv3 latents and globally pooled optical flow that removes self-induced visual changes.
If this is right
- Transmission concentrates around human-verified novelty events rather than uniform logging.
- Ego-motion conditioning reduces false triggers by 45.5 percent compared with the baseline surprise signal.
- The method dominates the validated peak F1 versus telemetry-bandwidth frontier in replay ablation studies.
- AUVs can shift from exhaustive passive recording to selective active monitoring under strict bandwidth limits.
Where Pith is reading between the lines
- The same compensated predictive-coding pattern could be tested on other mobile platforms where self-motion creates visual clutter, such as aerial drones surveying dynamic terrain.
- Pairing the frozen DINOv3 backbone with lightweight domain-specific adapters might improve detection of particular marine phenomena without retraining the entire model.
- Longer AUV missions become feasible if reduced data volume allows smaller onboard storage or lower-power transmitters.
Load-bearing premise
That DINOv3 latent predictions accurately mark mission-relevant scientific phenomena and that optical-flow compensation cleanly separates vehicle motion from genuine environmental novelty without discarding important non-semantic events.
What would settle it
A deployment trial in which the system misses a known high-value transient event later confirmed by human review of the full video, or in which compensated false-positive rates remain comparable to the uncompensated baseline across varied lighting and current conditions.
Figures
read the original abstract
Marine ecosystem degradation necessitates continuous, scientifically selective underwater monitoring. However, most autonomous underwater vehicles (AUVs) operate as passive data loggers, capturing exhaustive video for offline review and frequently missing transient events of high scientific value. Transitioning to active perception requires a causal, online signal that highlights significant phenomena while suppressing maneuver-induced visual changes. We propose DINO-Explorer, a novelty-aware perception framework driven by a continuous semantic surprise signal. Operating within the latent space of a frozen DINOv3 foundation model, it leverages a lightweight, action-conditioned recurrent predictor to anticipate short-horizon semantic evolution. An efference-copy-inspired module utilizes globally pooled optical flow to discount self-induced visual changes without suppressing genuine environmental novelty. We evaluate this signal on the downstream task of asynchronous event triage under variant telemetry constraints. Results demonstrate that DINO-Explorer provides a robust, bandwidth-efficient attention mechanism. At a fixed operating point, the system retains 78.8% of post-discovery human-reviewer consensus events with a 56.8% trigger confirmation rate, effectively surfacing mission-relevant phenomena. Crucially, ego-motion conditioning suppresses 45.5% of false positives relative to an uncompensated surprise signal baseline. In a replay-side Pareto ablation study, DINO-Explorer robustly dominates the validated peak F1 versus telemetry bandwidth frontier, reducing telemetry bandwidth by 48.2% at the selected operating point while maintaining a 62.2% peak F1 score, successfully concentrating data transmission around human-verified novelty events.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DINO-Explorer, a novelty-aware perception framework for AUVs that generates a continuous semantic surprise signal in the latent space of a frozen DINOv3 model. It uses a lightweight action-conditioned recurrent predictor for short-horizon anticipation and an efference-copy-inspired module with globally pooled optical flow to compensate for self-motion without suppressing genuine novelty. The system is evaluated on the downstream task of asynchronous event triage under telemetry constraints, reporting 78.8% retention of human-reviewer consensus events, 45.5% false-positive suppression relative to an uncompensated baseline, 48.2% bandwidth reduction, and 62.2% peak F1 at a selected operating point in a replay-side Pareto ablation study.
Significance. If the online causal performance holds, this would represent a meaningful advance in active underwater perception by enabling selective, bandwidth-efficient data transmission that concentrates around scientifically relevant events. The work effectively adapts a foundation model (DINOv3) for robotics via predictive coding and introduces a practical ego-motion compensation technique; the concrete metrics on retention, false-positive reduction, and bandwidth savings provide a clear benchmark for future comparisons.
major comments (2)
- [Evaluation] Evaluation section (replay-side Pareto ablation study): The central claim requires a causal, online signal that runs on AUVs to triage events under live telemetry constraints. However, the provided results come exclusively from a replay-side Pareto ablation study on pre-recorded data, reporting metrics such as 78.8% retention and 48.2% bandwidth reduction. This setup does not test whether the lightweight action-conditioned recurrent predictor and globally pooled optical-flow efference-copy module sustain the required frame-rate latency on embedded hardware, nor does it expose the surprise signal to live sensor noise, variable currents, or packet loss that could alter trigger confirmation rates.
- [Abstract] Abstract and methods description: The efference-copy-inspired module is claimed to discount self-induced visual changes without suppressing genuine environmental novelty, yet no quantitative analysis or ablation isolates its effect on non-semantic events or demonstrates robustness when optical-flow estimation is noisy. This assumption is load-bearing for the claim that the system surfaces mission-relevant phenomena while maintaining high retention.
minor comments (2)
- [Abstract] The abstract refers to 'variant telemetry constraints' and 'post-discovery human-reviewer consensus events' without specifying the exact constraints tested or the review protocol used to establish consensus; adding these details would improve reproducibility.
- [Evaluation] The paper would benefit from explicit reporting of the full Pareto frontier (not just the selected operating point) and any statistical measures (e.g., variance across sequences) to support the claim of robust dominance.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for highlighting important aspects of our evaluation and the ego-motion compensation analysis. We respond to each major comment below, clarifying the scope of our current results and outlining targeted revisions.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (replay-side Pareto ablation study): The central claim requires a causal, online signal that runs on AUVs to triage events under live telemetry constraints. However, the provided results come exclusively from a replay-side Pareto ablation study on pre-recorded data, reporting metrics such as 78.8% retention and 48.2% bandwidth reduction. This setup does not test whether the lightweight action-conditioned recurrent predictor and globally pooled optical-flow efference-copy module sustain the required frame-rate latency on embedded hardware, nor does it expose the surprise signal to live sensor noise, variable currents, or packet loss that could alter trigger confirmation rates.
Authors: We agree that direct validation on embedded hardware under live conditions would strengthen the practicality claims. Our evaluation employs a replay-side simulation of the exact causal online pipeline on pre-recorded data to enable controlled Pareto ablations across telemetry budgets and isolate the surprise signal's contribution. This design choice supports rigorous comparison without the variability of field trials. The recurrent predictor and global optical-flow pooling are deliberately lightweight to target real-time operation. In revision we will add a dedicated subsection on computational complexity, estimated frame-rate latency on representative AUV hardware, and qualitative discussion of robustness to sensor noise, currents, and packet loss. New live hardware experiments remain outside the scope of this revision. revision: partial
-
Referee: [Abstract] Abstract and methods description: The efference-copy-inspired module is claimed to discount self-induced visual changes without suppressing genuine environmental novelty, yet no quantitative analysis or ablation isolates its effect on non-semantic events or demonstrates robustness when optical-flow estimation is noisy. This assumption is load-bearing for the claim that the system surfaces mission-relevant phenomena while maintaining high retention.
Authors: The quantitative benefit of the efference-copy module is shown by the 45.5% false-positive reduction relative to the uncompensated baseline at comparable retention (78.8%). This direct comparison isolates the module's role in suppressing maneuver-induced triggers. Global pooling of optical flow provides averaging that confers partial robustness to local estimation noise. We will add an explicit ablation study in the methods and results sections that isolates the module on non-semantic events and includes sensitivity analysis under controlled optical-flow noise levels. revision: yes
Circularity Check
No significant circularity; empirical triage metrics are independent of model definition
full rationale
The paper defines DINO-Explorer via explicit components (frozen DINOv3 latents, action-conditioned recurrent predictor, globally pooled optical-flow efference copy) and then measures its output signal on a separate downstream triage task against human-reviewer consensus labels. Retention (78.8%), F1 (62.2%), bandwidth reduction (48.2%), and false-positive suppression (45.5%) are reported as measured outcomes under telemetry constraints rather than quantities that reduce by construction to the predictor equations or to any fitted parameter. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the provided text; the central claim therefore remains externally falsifiable and does not collapse into its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- triage operating point
axioms (1)
- domain assumption Frozen DINOv3 model produces semantically meaningful latent features for underwater imagery without fine-tuning
invented entities (1)
-
efference-copy-inspired ego-motion compensation module
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Underwater robots: From remotely operated vehicles to intervention-autonomous underwater vehicles,
Y . R. Petillot, G. Antonelli, G. Casalino, and F. Ferreira, “Underwater robots: From remotely operated vehicles to intervention-autonomous underwater vehicles,”IEEE Robotics & Automation Magazine, vol. 26, no. 2, pp. 94–101, 2019
work page 2019
-
[2]
A survey on underwater computer vision,
S. P. González-Sabbagh and A. Robles-Kelly, “A survey on underwater computer vision,”ACM Computing Surveys, vol. 55, no. 13s, pp. 1–39, 2023
work page 2023
-
[3]
Centennial decline in north sea water clarity causes strong delay in phytoplankton bloom timing,
A. F. Opdal, C. Lindemann, and D. L. Aksnes, “Centennial decline in north sea water clarity causes strong delay in phytoplankton bloom timing,”Global Change Biology, vol. 25, no. 11, pp. 3946–3953, 2019
work page 2019
-
[4]
Increasing turbidity in the north sea during the 20th century due to changing wave climate,
R. J. Wilson and M. R. Heath, “Increasing turbidity in the north sea during the 20th century due to changing wave climate,”Ocean Science, vol. 15, no. 6, pp. 1615–1625, 2019
work page 2019
-
[5]
Y . Zhou, B. Li, J. Wang, E. Rocco, and Q. Meng, “Discovering unknowns: Context-enhanced anomaly detection for curiosity-driven autonomous underwater exploration,”Pattern Recognition, vol. 131, p. 108860, 2022
work page 2022
-
[6]
Raft: Recurrent all-pairs field transforms for optical flow,
Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” inEuropean Conference on Computer Vision, pp. 402– 419, Springer, 2020
work page 2020
-
[7]
Real-time monocular visual odometry for turbid and dynamic underwater envi- ronments,
M. Ferrera, J. Moras, P. Trouvé-Peloux, and V . Creuze, “Real-time monocular visual odometry for turbid and dynamic underwater envi- ronments,”Sensors, vol. 19, no. 3, p. 687, 2019
work page 2019
-
[8]
R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, pp. 79–87, 1999
work page 1999
-
[9]
A theory of cortical responses,
K. Friston, “A theory of cortical responses,”Philosophical Transactions of the Royal Society B: Biological Sciences, vol. 360, no. 1456, pp. 815– 836, 2005
work page 2005
-
[10]
E. von Holst and H. Mittelstaedt, “Das Reafferenzprinzip,”Naturwis- senschaften, vol. 37, no. 20, pp. 464–476, 1950
work page 1950
-
[11]
Corollary discharge across the animal kingdom,
T. B. Crapse and M. A. Sommer, “Corollary discharge across the animal kingdom,”Nature Reviews Neuroscience, vol. 9, no. 8, pp. 587–600, 2008
work page 2008
-
[12]
O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. Jégou, P. Labatut, and P. Bojanowski, “Dinov3,” 2025
work page 2025
-
[13]
R. Bajcsy, “Active perception,”Proceedings of the IEEE, vol. 76, no. 8, pp. 966–1005, 1988
work page 1988
-
[14]
Revisiting active percep- tion,
R. Bajcsy, Y . Aloimonos, and J. K. Tsotsos, “Revisiting active percep- tion,”Autonomous Robots, vol. 42, no. 2, pp. 177–196, 2018
work page 2018
-
[15]
E. Chen, T. Manderson, N. Karapetyan, P. Edmunds, N. Roy, and Y . Girdhar, “Autonomous search for sparsely distributed visual phe- nomena through environmental context modeling,”arXiv preprint arXiv:2603.10174, 2026
-
[16]
Assisting human annotation of marine images with foundation models,
E. C. Orenstein, B. Woodward, L. Lundsten, K. Barnard, B. Schlining, and K. Katija, “Assisting human annotation of marine images with foundation models,”Frontiers in Marine Science, vol. 12, p. 1469396, 2025
work page 2025
-
[17]
Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,
Z. Chen, C. Zhang, H. Fang, and R. Cong, “Empowering dino represen- tations for underwater instance segmentation via aligner and prompter,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, pp. 3201–3209, 2026
work page 2026
-
[18]
Curiosity-driven exploration by self-supervised prediction,
D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” inProceedings of the 34th International Conference on Machine Learning(D. Precup and Y . W. Teh, eds.), vol. 70 ofProceedings of Machine Learning Research, pp. 2778–2787, PMLR, 2017
work page 2017
-
[19]
T. Taniguchi, S. Murata, M. Suzuki, D. Ognibene, P. Lanillos, E. Ugur, and G. Pezzulo, “World models and predictive coding for cognitive and developmental robotics: frontiers and challenges,”Advanced Robotics, vol. 37, no. 13, pp. 780–806, 2023
work page 2023
- [20]
-
[21]
Learning latent dynamics for planning from pixels,
D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 ofProceedings of Machine Learning Research, pp. 2555–2565, PMLR, 2019
work page 2019
-
[22]
Day- dreamer: World models for physical robot learning,
P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- dreamer: World models for physical robot learning,” inProceedings of the 6th Conference on Robot Learning(K. Liu, D. Kulic, and J. Ichnowski, eds.), vol. 205 ofProceedings of Machine Learning Research, pp. 2226–2240, PMLR, 2023
work page 2023
-
[23]
Dino-wm: World models on pre-trained visual features enable zero-shot planning,
G. Zhou, H. Pan, Y . LeCun, and L. Pinto, “Dino-wm: World models on pre-trained visual features enable zero-shot planning,” 2025
work page 2025
-
[24]
S. E. A. Pineda-Metz, “Master tracks in different resolutions of HEINCKE cruise HE663, Bremerhaven - Bremerhaven, 2025-06-17 - 2025-07-01,” 2025
work page 2025
-
[25]
Blue Robotics, “BlueROV2 (BROV2) Datasheet.” https://bluerobotics. com/wp-content/uploads/2025/04/BROV2-DATASHEET.pdf, 2025. Ac- cessed: 2026-03-18. APPENDIXA QUALITATIVEEXAMPLES OFSURPRISEEVENTS This appendix provides qualitative examples for three non- biological surprise categories: habitat transitions, turbidity bursts, and illumination changes. Each ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.