WATCH: Wide-Area Archaeological Site Tracking for Change Detection
Pith reviewed 2026-05-12 01:20 UTC · model grok-4.3
The pith
Unsupervised scoring of satellite image embeddings localizes archaeological site disturbances to the month without needing event labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WATCH performs month-level change-event localization on PlanetScope mosaics (4.7 m/px, 2017-2024) through three complementary scorers: Temporal Embedding Distance (TED) that quantifies month-to-month deviations from a local temporal reference using foundation-model embeddings, an ensemble of Self-Supervised Change Detection (SSCD) signals from reconstruction, forecasting, and latent novelty, and a Weakly Supervised (WS) temporal localization model. On 1943 Afghan sites, TED with SatMAE embeddings achieves 55 percent exact-month recall while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5 percent recall at three-month tolerance; unsupervised TED and SSCD consistently beat the weakly
What carries the argument
Temporal Embedding Distance (TED), a training-free scorer that measures deviations between monthly satellite patch embeddings and a local temporal reference built from the same site.
If this is right
- Unsupervised TED and SSCD can be applied to new regions without collecting event-month labels.
- Different foundation-model embeddings produce distinct temporal biases, allowing selection for early-warning or post-event confirmation needs.
- Handcrafted spectral and texture features remain competitive for exact-month detection when weak labels are available.
- The same pipeline supports scalable monitoring of cultural heritage across wide areas once the initial reference period is established.
Where Pith is reading between the lines
- Combining TED and SSCD could reduce both missed early signals and late confirmations in operational systems.
- The cross-regional results suggest the method may transfer to monitoring other sparse, wide-area phenomena such as illegal construction or environmental damage.
- If integrated with near-real-time satellite feeds, the framework could generate alerts for site managers months before conventional inspection.
- The observed embedding-specific biases indicate that pre-training data domain (e.g., general vs. remote-sensing) influences whether detection leans early or late.
Load-bearing premise
The sparse event-month labels for the 1943 Afghan sites are accurate and representative of actual disturbance timing, and visual changes visible at 4.7 m per pixel resolution reliably match the recorded events.
What would settle it
Independent field verification or higher-resolution imagery for a random subset of the sites that shows the exact-month recall falling substantially below 55 percent or the three-month recall below 90 percent.
Figures
read the original abstract
Monitoring archaeological sites at scale is vital for protecting cultural heritage, yet pinpointing when disturbances occur remains difficult because visual cues are subtle and ground-truth data are sparse. We introduce WATCH, a framework for month-level change-event localization over PlanetScope satellite mosaics (2017-2024, 4.7 m/px) that supports three complementary scoring approaches: (i) Temporal Embedding Distance (TED), a training-free method that scores month-to-month deviations from a local temporal reference; (ii) Self-Supervised Change Detection (SSCD), an ensemble of reconstruction, forecasting, and latent-novelty signals; and (iii) a Weakly Supervised (WS) temporal localization model trained with sparse event-month labels. We benchmark WATCH on 1,943 archaeological sites in Afghanistan using embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, and Satlas-Pretrain) alongside a handcrafted spectral and texture baseline, and assess cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt. The unsupervised approaches (TED, SSCD) consistently outperform the weakly supervised alternative. TED with SatMAE achieves the highest exact-month recall (55% at m=0), while TED with GeoRSCLIP, CLIP, or Satlas-Pretrain reaches 92.5% within a three-month tolerance (m=3). Handcrafted features remain competitive for exact-month detection under weak supervision. Our directional margin analysis reveals systematic temporal biases: SSCD paired with GeoRSCLIP or Prithvi-EO-2.0 exhibits the strongest early-warning profile, detecting anomalies before the recorded event, while TED favors confirmation-oriented detection after a change has materialized. These results show that satellite imagery combined with foundation-model embeddings enables scalable, decision-relevant heritage monitoring. Code: https://github.com/microsoft/WATCH
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the WATCH framework for month-level change-event localization in archaeological sites using PlanetScope satellite mosaics (4.7 m/px, 2017-2024). It defines three scoring approaches—Temporal Embedding Distance (TED, training-free), Self-Supervised Change Detection (SSCD, ensemble of reconstruction/forecasting/novelty signals), and Weakly Supervised (WS) temporal localization—and benchmarks them on 1,943 Afghan sites with embeddings from six foundation models (CLIP, GeoRSCLIP, SatMAE, Prithvi-EO-2.0, DINOv3, Satlas-Pretrain) plus a handcrafted baseline. The central claims are that unsupervised methods (TED, SSCD) outperform WS, TED+SatMAE reaches 55% exact-month recall (m=0) and 92.5% within m=3 for several embeddings, with additional cross-regional tests and directional margin analysis of temporal biases.
Significance. If the results hold, the work demonstrates a practical, scalable approach to heritage monitoring that leverages pre-trained embeddings without requiring dense labels. The public code repository and evaluation across multiple foundation models plus cross-regional generalization on sites in Syria, Turkey, Pakistan, and Egypt are clear strengths that support reproducibility and broader applicability. The directional bias analysis further aids method selection for early-warning versus confirmation tasks.
major comments (2)
- [§4 (Experiments), Table 2] §4 (Experiments), Table 2: All reported recall figures (e.g., TED+SatMAE at 55% m=0 and 92.5% m=3) are computed directly against the provided sparse event-month labels for the 1,943 sites. No validation of label accuracy, inter-annotator agreement, or sensitivity to timing offsets is presented; systematic reporting lags or attribution errors would invalidate the unsupervised-vs-WS comparison and the headline performance numbers.
- [§3.2 (Data and Preprocessing)] §3.2 (Data and Preprocessing): The assumption that visual changes at 4.7 m/px resolution reliably correspond to the recorded disturbance events is load-bearing for interpreting all metrics, yet the manuscript provides no quantitative examples of confirmed visual correspondences, analysis of sub-pixel disturbances, or discussion of spectral ambiguity at this resolution.
minor comments (2)
- [Abstract] Abstract: The handcrafted spectral and texture baseline is mentioned as competitive but its exact feature definitions are not listed; these should be specified in the methods for full reproducibility.
- [§5 (Cross-regional generalization)] §5 (Cross-regional generalization): The performance variations across Syria, Turkey, Pakistan, and Egypt are summarized but lack per-region tables or error bars, which would clarify generalization strength.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The two major comments identify important gaps in validating label quality and demonstrating visual change correspondence. We address each point below, agree where revisions are warranted, and outline specific changes to strengthen the paper without overstating current results.
read point-by-point responses
-
Referee: [§4 (Experiments), Table 2] §4 (Experiments), Table 2: All reported recall figures (e.g., TED+SatMAE at 55% m=0 and 92.5% m=3) are computed directly against the provided sparse event-month labels for the 1,943 sites. No validation of label accuracy, inter-annotator agreement, or sensitivity to timing offsets is presented; systematic reporting lags or attribution errors would invalidate the unsupervised-vs-WS comparison and the headline performance numbers.
Authors: We agree that the lack of explicit label validation is a limitation for interpreting absolute performance and the unsupervised-vs-WS comparison. The sparse event-month labels are used as provided for all 1,943 sites; no inter-annotator agreement or independent accuracy check was performed in the original work. Because TED and SSCD are training-free and do not optimize against these labels, their reported alignment with event timings still provides a meaningful unsupervised signal. The weakly supervised model is more directly affected by label noise. To address sensitivity to timing offsets, we will add a sensitivity analysis in the revised §4 that re-computes recall after shifting all labels by ±1 and ±2 months. We will also add a short discussion of potential reporting lags as a source of uncertainty. These additions will qualify the headline numbers while preserving the relative method ordering. revision: partial
-
Referee: [§3.2 (Data and Preprocessing)] §3.2 (Data and Preprocessing): The assumption that visual changes at 4.7 m/px resolution reliably correspond to the recorded disturbance events is load-bearing for interpreting all metrics, yet the manuscript provides no quantitative examples of confirmed visual correspondences, analysis of sub-pixel disturbances, or discussion of spectral ambiguity at this resolution.
Authors: We concur that explicit evidence of visual correspondence at 4.7 m/px is needed to support metric interpretation. The current manuscript does not include before/after image pairs or quantitative analysis of sub-pixel or spectrally ambiguous cases. In the revision we will add a new subsection (or expanded §3.2) containing representative PlanetScope image pairs for several sites, illustrating visible spatial changes (e.g., new looting pits or structures) that align with the recorded event months. We will also discuss the resolution limits, noting that while many heritage disturbances produce detectable patterns at 4.7 m/px, sub-pixel or purely spectral changes may go undetected and that this constitutes a boundary condition for the approach. These additions will better contextualize both the strengths and the applicability constraints of the results. revision: yes
Circularity Check
No circularity: purely empirical benchmarking of pre-trained embeddings and change-detection heuristics on held-out sites
full rationale
The paper introduces three scoring approaches (TED, SSCD, WS) and reports recall figures obtained by applying them to PlanetScope imagery of 1,943 labeled Afghan sites plus cross-regional test sets. No equations, fitted parameters, or self-citation chains are used to derive the headline performance numbers; the results are direct empirical measurements against the provided event-month labels. The methods themselves are either training-free (TED), self-supervised (SSCD), or explicitly trained on the sparse labels (WS), with no step that reduces a claimed prediction back to its own inputs by construction. Self-citations, if present, are not load-bearing for the central claims.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Foundation model embeddings from CLIP, SatMAE, etc., preserve temporal change signals relevant to archaeological disturbances.
- domain assumption Sparse event-month labels accurately reflect the timing of actual site disturbances.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
TED scores the deviation of month t from a robust reference computed over recent history... stemp_i,t = d(z'_i,t, B_i,t)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Satellite evidence of archaeological site looting in Egypt: 2002–2013,
S. Parcak, D. Gathings, C. Childs, G. Mumford, and E. Cline, “Satellite evidence of archaeological site looting in Egypt: 2002–2013,”Antiquity, vol. 90, no. 349, pp. 188–205, 2016
work page 2002
-
[2]
B. Cuca, F. Zaina, and D. Tapete, “Monitoring of damages to cultural heritage across Europe using remote sensing and earth observation: Assessment of scientific and grey literature,”Remote Sensing, vol. 15, no. 15, p. 3748, 2023
work page 2023
-
[3]
Earth observation for the world cultural and natural heritage,
I. D. Negula, R. Sofronie, A. Virsta, and A. Badea, “Earth observation for the world cultural and natural heritage,”Agriculture and Agricultural Science Procedia, vol. 6, pp. 438–445, 2015
work page 2015
-
[4]
World Heritage in danger: Big data and remote sensing can help protect sites in conflict zones,
N. Levin, S. Ali, D. Crandall, and S. Kark, “World Heritage in danger: Big data and remote sensing can help protect sites in conflict zones,” Global Environmental Change, vol. 55, pp. 97–104, 2019
work page 2019
-
[5]
Detection of archaeological looting from space: Methods, achievements and challenges,
D. Tapete and F. Cigna, “Detection of archaeological looting from space: Methods, achievements and challenges,”Remote Sensing, vol. 11, no. 20, p. 2389, 2019
work page 2019
-
[6]
A. Agapiou, “UNESCO world heritage properties in changing and dynamic environments: change detection methods using optical and radar satellite data,”Heritage Science, vol. 9, no. 1, pp. 1–14, 2021
work page 2021
-
[7]
Mapping patterns of long-term settlement in Northern Mesopotamia at a large scale,
B. H. Menze and J. A. Ur, “Mapping patterns of long-term settlement in Northern Mesopotamia at a large scale,”Proceedings of the National Academy of Sciences, vol. 109, no. 14, pp. E778–E787, 2012
work page 2012
-
[8]
Detecting looted archaeological sites from satellite image time series,
E. Vincent, M. Saroufim, J. Chemla, Y . Ubelmann, P. Marquis, J. Ponce, and M. Aubry, “Detecting looted archaeological sites from satellite image time series,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 2296–2307
work page 2025
-
[9]
Satellite- based detection of looted archaeological sites using machine learning,
G. A. Tadesse, T. Bartette, A. Hassanali, A. Kim, J. Chemla, A. Zolli, Y . Ubelmann, C. Robinson, I. Becker-Reshef, and J. L. Ferres, “Satellite- based detection of looted archaeological sites using machine learning,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, March 2026, pp. 840–848
work page 2026
-
[10]
SatMae: Pre-training transformers for temporal and multi-spectral satellite imagery,
Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lo- bell, and S. Ermon, “SatMae: Pre-training transformers for temporal and multi-spectral satellite imagery,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, 2022, pp. 197–211
work page 2022
-
[11]
SatCLIP: Global, general-purpose location embeddings with satellite imagery,
K. Klemmer, E. Rolf, C. Robinson, L. Mackey, and M. Russwurm, “SatCLIP: Global, general-purpose location embeddings with satellite imagery,” inAAAI Conference on Artificial Intelligence, vol. 38, no. 12, 2024, pp. 13 156–13 164
work page 2024
-
[12]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haz- iza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J ´egou, P. Labatut, and P. Bojanowski, “DINOv3,” 2025
work page 2025
-
[13]
J. Jakubik, S. Roy, C. E. Phillips, P. Fraccaro, G. Godwin, M. Zadrozny, C. Szwarcman, S. Gomes, S. Nyirjesy, D. Edwardset al., “Foundation models for generalist geospatial artificial intelligence,” arXiv:2310.18660, 2023
-
[14]
Fully convolutional siamese networks for change detection,
R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,”International Conference on Image Processing (ICIP), 2018
work page 2018
-
[15]
Deep learning for anomaly detection: A survey,
R. Chalapathy and S. Chawla, “Deep learning for anomaly detection: A survey,”ACM Computing Surveys, vol. 54, no. 3, pp. 1–38, 2021
work page 2021
-
[16]
SatlasPretrain: A large-scale dataset for remote sensing image under- standing,
F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, “SatlasPretrain: A large-scale dataset for remote sensing image under- standing,” inIEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 16 772–16 782
work page 2023
-
[17]
Z. Zhang, T. Zhao, Y . Guo, and J. Yin, “RS5M and GeoRSClip: A large scale vision-language dataset and a large vision-language model for remote sensing,”IEEE Transactions on Geoscience and Remote Sensing, 2024
work page 2024
-
[18]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational Conference on Machine Learning (ICML), 2021, pp. 8748–8763
work page 2021
-
[19]
Prithvi-EO-2.0: A versatile multi-temporal foundation model for earth observation applications,
D. Szwarcman, S. Roy, P. Fraccaro, O. E. G ´ıslason, B. Blumenstiel, R. Ghosal, P. H. De Oliveira, J. L. de Sousa Almeida, R. Sedona, Y . Kang et al., “Prithvi-EO-2.0: A versatile multi-temporal foundation model for earth observation applications,”IEEE Transactions on Geoscience and Remote Sensing, 2025
work page 2025
-
[20]
Ball,Archaeological gazetteer of Afghanistan
W. Ball,Archaeological gazetteer of Afghanistan. Oxford University Press, 2019
work page 2019
-
[21]
Adam: A method for stochastic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inInternational Conference on Learning Representations (ICLR), 2015. 10 APPENDIXA DATASET: GLOBALSITEMAP ANDPER-COUNTRYSITE EXAMPLES Fig. 7: Countries with archaeological sites selected for evalua- tion in this study to assess the applicability and generalizability of the proposed frame...
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.