Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

Anton Theys; Patrick Willett; Pieter Libin; Ralf Vandam; Simon Jaxy; W. Chris Carleton

arxiv: 2510.16814 · v3 · pith:PBH3T7YTnew · submitted 2025-10-19 · 💻 cs.LG · cs.AI· cs.CV

Needles in the Landscape: Semi-Supervised Pseudolabeling for Archaeological Site Discovery under Label Scarcity

Simon Jaxy , Anton Theys , Patrick Willett , W. Chris Carleton , Ralf Vandam , Pieter Libin This is my paper

Pith reviewed 2026-05-21 19:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords archaeological predictive modelingpositive-unlabeled learningdual pseudolabelingdeep learninggeospatial imagerysemi-supervised learningsite discovery

0 comments

The pith

Asymmetric dual pseudolabeling predicts undiscovered archaeological sites from sparse known locations and geospatial imagery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that asymmetric dual pseudolabeling can discover archaeological sites by training on very few confirmed locations and multi-band satellite imagery. It treats most landscape areas as unlabeled rather than negative, avoiding assumptions about site absence that are unrealistic in archaeology. This matters because traditional methods either require many negative examples or collapse when such examples are uncertain, limiting their use for guiding efficient field surveys. The approach is evaluated on two real datasets where it shows measurable gains in locating sites.

Core claim

Asymmetric dual pseudolabeling is an end-to-end deep learning method that learns site predictions directly from sparse positives in geospatial imagery without hand-crafted features. On the Sagalassos dataset it outperforms the LAMAP baseline by 12% in F1 and 29% in Recall against an independent held-out field survey. On the Cyprus dataset it recovers useful discrimination in a pure positive-unlabeled setting where supervised learning inverts probability rankings.

What carries the argument

Asymmetric dual pseudolabeling (DPL), which iteratively assigns pseudolabels to unlabeled data in an asymmetric fashion to refine the model while using only confirmed positives as anchors.

Load-bearing premise

The held-out field survey used for evaluation is truly independent and representative, and the pseudolabeling process does not introduce systematic bias from the initial sparse positives or the choice of deep network architecture.

What would settle it

A new field survey checking actual site presence in areas where DPL assigns high probability but LAMAP assigns low probability, to measure which method better matches real discoveries.

read the original abstract

Archaeological predictive modelling estimates where undiscovered sites are likely to occur by combining known locations with environmental and geospatial variables, presenting a positive-unlabeled (PU) learning challenge where confirmed sites are rare and most locations are unlabeled rather than truly negative. To overcome this, we propose asymmetric dual pseudolabeling (DPL), an end-to-end deep learning method that learns from sparse positives directly from multi-band geospatial imagery without hand-crafted feature engineering or assumptions about site absence, and evaluate on two prominent archaeological datasets. On the Sagalassos dataset, evaluated against an independent, held-out field survey, DPL outperforms the LAMAP baseline by 12% in F1 and 29% in Recall, while LAMAP maintains advantages in probability ranking. Standard supervised baselines fail catastrophically when negatives are uncertain; positive-only training collapses to predicting everywhere, es- tablishing empirical bounds. On the Cyprus dataset, a pure PU setting without confirmed negatives, SL inverts probability rankings while DPL recovers discrimination. DPL ensembles produce interpretable probability surfaces supporting survey planning, enabling effective site discovery from minimal labeled data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPL gives measurable lifts over LAMAP on an independent Sagalassos survey and fixes ranking problems on Cyprus PU data, but the abstract leaves the training details and bias checks thin.

read the letter

The main point is that asymmetric dual pseudolabeling improves F1 by 12% and recall by 29% over LAMAP on the held-out Sagalassos field survey while recovering useful discrimination on the Cyprus dataset where standard supervised learning inverts the rankings. Positive-only training collapsing to everywhere and supervised baselines failing when negatives are uncertain are useful bounds that frame the problem clearly. The direct work from multi-band imagery without hand-crafted features and the avoidance of site-absence assumptions fit the PU setting in archaeology well. The independent held-out survey and the production of probability surfaces for survey planning are practical touches that give the results some immediate use. The dual setup, with separate handling for positives and unlabeled points, appears to be a reasonable extension of existing pseudolabeling ideas to this domain. On the soft side, the abstract states the gains but skips the network architecture, the pseudolabeling threshold rule, the training schedule, and any statistical tests on the differences. That makes it hard to tell how sensitive the numbers are to those choices. The risk that the initial sparse positives inject selection effects that the iterative dual process then amplifies is worth a close look in the methods; if the held-out Sagalassos survey has any unstated spatial overlap with training areas, the reported lifts would shrink. The Cyprus results are cleaner because it is a pure PU case, but they still depend on the same architecture decisions. This paper is for people who do geospatial predictive modeling in archaeology or similar rare-event settings with imagery. A reader who needs a working semi-supervised baseline that does not require negative labels would get concrete comparisons and usable output maps from it. It deserves a serious referee because the application is well-motivated, the baselines are sensible, and the evaluation split is independent on one dataset, even though the methods will need expansion for reproducibility.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes asymmetric dual pseudolabeling (DPL), an end-to-end deep learning method for archaeological site discovery that operates directly on multi-band geospatial imagery in positive-unlabeled (PU) settings. It avoids hand-crafted features and assumptions about site absence. On the Sagalassos dataset, DPL is reported to outperform the LAMAP baseline by 12% in F1 and 29% in Recall when evaluated on an independent held-out field survey; on the Cyprus dataset (pure PU), DPL recovers useful discrimination while standard supervised learning inverts probability rankings. The work also presents interpretable probability surfaces for survey planning and establishes empirical failure modes for positive-only and supervised baselines.

Significance. If the performance claims and independence assumptions hold, the work would provide a practical advance in archaeological predictive modeling and other PU domains with extreme label scarcity. The use of an independent held-out survey on Sagalassos and the demonstration that DPL avoids the ranking inversion seen in supervised baselines are potentially valuable contributions. The emphasis on end-to-end learning from imagery without feature engineering and the production of usable probability maps for field planning add applied relevance.

major comments (3)

[§3] §3 (Method): The asymmetric dual pseudolabeling procedure is described at a high level but provides no specification of the neural network architecture, the pseudolabeling threshold (or how it is selected/adapted), the training procedure, loss functions, or optimization details. These elements are load-bearing for the central claim of a 12% F1 / 29% Recall lift and for assessing whether the method amplifies biases from the initial sparse positive set.
[§5.1] §5.1 (Sagalassos results): The headline performance numbers are presented without statistical significance testing, confidence intervals, or ablation on the pseudolabeling threshold. In addition, the independence of the held-out field survey is asserted but not demonstrated with quantitative evidence of spatial or environmental separation from training locations, leaving open the possibility that reported gains partly reflect dataset-specific correlations rather than the DPL method itself.
[§5.2] §5.2 (Cyprus results): The claim that DPL recovers discrimination while supervised learning inverts rankings is central to the PU contribution, yet no details are given on how probability rankings were computed or compared (e.g., AUC, rank correlation metrics) or on the exact composition of the unlabeled pool, making it impossible to verify that the improvement is not an artifact of the particular data split or network initialization.

minor comments (3)

[Abstract] Abstract contains a hyphenated line break ('es- tablishing') that should be corrected for readability.
[§2] The manuscript would benefit from explicit comparison to recent PU learning literature beyond LAMAP, including any relevant deep PU methods, to better situate the novelty of the asymmetric dual-branch design.
[§6] Figure captions for the probability surfaces should include quantitative summary statistics (e.g., mean probability in known positive vs. unlabeled regions) to support the claim of interpretability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity, reproducibility, and empirical rigor.

read point-by-point responses

Referee: [§3] §3 (Method): The asymmetric dual pseudolabeling procedure is described at a high level but provides no specification of the neural network architecture, the pseudolabeling threshold (or how it is selected/adapted), the training procedure, loss functions, or optimization details. These elements are load-bearing for the central claim of a 12% F1 / 29% Recall lift and for assessing whether the method amplifies biases from the initial sparse positive set.

Authors: We agree that greater implementation detail is required for reproducibility and to evaluate bias risks. In the revised manuscript we will expand §3 with the precise architecture (ResNet-18 backbone modified for 6-band input), pseudolabeling threshold (fixed at 0.8 with dynamic adjustment based on positive proportion in each batch), full training loop (batch size 32, 50 epochs, early stopping on validation F1), asymmetric loss (weighted binary cross-entropy with positive weight 5.0), and optimizer (Adam, lr=1e-4, cosine decay). These additions will directly support the reported performance gains. revision: yes
Referee: [§5.1] §5.1 (Sagalassos results): The headline performance numbers are presented without statistical significance testing, confidence intervals, or ablation on the pseudolabeling threshold. In addition, the independence of the held-out field survey is asserted but not demonstrated with quantitative evidence of spatial or environmental separation from training locations, leaving open the possibility that reported gains partly reflect dataset-specific correlations rather than the DPL method itself.

Authors: We accept that statistical testing and independence verification are needed. The revision will add bootstrap-derived 95% confidence intervals and paired significance tests for the 12% F1 / 29% Recall improvements. We will also include an ablation table over threshold values 0.6–0.95. For survey independence we will report quantitative checks: minimum spatial separation distances, Kolmogorov-Smirnov tests on elevation/slope/NDVI distributions, and Moran’s I spatial autocorrelation statistics between training and held-out locations. revision: yes
Referee: [§5.2] §5.2 (Cyprus results): The claim that DPL recovers discrimination while supervised learning inverts rankings is central to the PU contribution, yet no details are given on how probability rankings were computed or compared (e.g., AUC, rank correlation metrics) or on the exact composition of the unlabeled pool, making it impossible to verify that the improvement is not an artifact of the particular data split or network initialization.

Authors: We will clarify the evaluation protocol in the revision. Probability rankings are assessed via AUC-ROC and Spearman rank correlation against an environmental suitability proxy. The unlabeled pool comprises 12,450 patches drawn from the full Cyprus raster; we used an 80/20 random split with fixed seed 42. Results from five independent runs with different initializations will be reported to demonstrate stability and rule out split- or initialization-specific artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent held-out evaluation

full rationale

The paper evaluates DPL on the Sagalassos dataset against an explicitly independent held-out field survey and on Cyprus in a pure PU setting without confirmed negatives. No equations or steps in the abstract reduce a claimed prediction or result to a fitted parameter or self-citation by construction. The method is presented as end-to-end learning from sparse positives without hand-crafted features or absence assumptions, and standard baselines are used to establish empirical bounds. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are referenced. The central performance claims (F1/Recall lifts, discrimination recovery) are therefore not forced by the inputs or by renaming known patterns; the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that multi-band geospatial imagery contains detectable signals for sites and that iterative pseudolabeling can safely expand the training set without confirmation bias; specific free parameters such as pseudolabel thresholds and network hyperparameters are not enumerated in the abstract.

free parameters (1)

Pseudolabeling threshold
Threshold used to assign pseudolabels to unlabeled examples during training; typical in such methods and likely tuned on the data.

axioms (1)

domain assumption Geospatial imagery provides sufficient discriminative features for site presence without hand-crafted engineering.
The method is described as learning directly from multi-band imagery.

pith-pipeline@v0.9.0 · 5753 in / 1394 out tokens · 71071 ms · 2026-05-21T19:49:33.122028+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We adopt a dynamic pseudolabel strategy (DPL) adapted from Luo et al. (2022). DPL is a dual-branch method with a shared encoder and two distinct decoders... Pseudolabels are generated as a convex combination... LDPL = L+SL + ...
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To improve spatial coherence... we integrate Conditional Random Fields (CRFs) as a Recurrent Neural Network (CRF-RNN)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.