CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

Ilya Ilyankou; James Haworth; Natchapon Jongwiriyanurak; Tao Cheng

arxiv: 2506.12214 · v1 · submitted 2025-06-13 · 💻 cs.CV

CLIP the Landscape: Automated Tagging of Crowdsourced Landscape Images

Ilya Ilyankou , Natchapon Jongwiriyanurak , Tao Cheng , James Haworth This is my paper

Pith reviewed 2026-05-19 08:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords CLIPlandscape imagesgeographical taggingmulti-modal classificationcrowdsourced dataGeographimage taggingGeoAI

0 comments

The pith

Combining location and title embeddings with image features improves exact-match accuracy on tagging landscape photos with 49 geographical context labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a CLIP-based classifier that assigns geographical tags to crowdsourced landscape images from the Geograph archive covering the British Isles. It shows that feeding location and title embeddings alongside image features into a simple classification head raises performance over image-only baselines on a Kaggle task that demands exact matches across all relevant tags from a set of 49. A reader might care because such tags could help fill gaps in spatial data where traditional mapping is absent, aiding downstream GeoAI work in remote areas. The method keeps training lightweight by relying on frozen pre-trained CLIP embeddings and a modest head, runnable on a laptop. The authors release the pipeline and test it on the competition subset of the larger eight-million-image collection.

Core claim

The authors establish that a multi-modal setup fusing CLIP-derived image embeddings with separate embeddings for photo location and title produces higher exact-match accuracy on the 49-tag multi-label task than image embeddings alone, while remaining computationally light enough to train on ordinary hardware.

What carries the argument

Multi-modal fusion of CLIP image embeddings with location and title text embeddings passed through a classification head for multi-label prediction of geographical context tags.

If this is right

Tags generated this way can serve as input to build location embedders for GeoAI applications.
The approach can enrich spatial understanding in regions that lack points of interest or street-level imagery.
A lightweight training pipeline using frozen CLIP embeddings makes the method accessible without heavy compute.
The released code allows direct application to the broader Geograph archive beyond the Kaggle subset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar multi-modal tagging could extend to other large crowdsourced photo collections that pair images with location metadata.
Adding time-of-day or weather embeddings might further refine tag predictions in variable landscapes.
The tags could help bootstrap training data for models that infer location purely from visual cues in data-sparse zones.

Load-bearing premise

The Kaggle competition subset of Geograph images is representative enough that accuracy gains will hold for the full eight million images or for landscape photos outside the British Isles.

What would settle it

Running the same multi-modal versus image-only comparison on the full remaining Geograph collection or on an equivalent set of landscape photos from a non-British region and finding that the accuracy lift disappears.

Figures

Figures reproduced from arXiv: 2506.12214 by Ilya Ilyankou, James Haworth, Natchapon Jongwiriyanurak, Tao Cheng.

**Figure 3.** Figure 3: Tag distribution across the training set images. [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of title lengths [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗

**Figure 2.** Figure 2: National grid cells in Britain containing at least one [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

We present a CLIP-based, multi-modal, multi-label classifier for predicting geographical context tags from landscape photos in the Geograph dataset--a crowdsourced image archive spanning the British Isles, including remote regions lacking POIs and street-level imagery. Our approach addresses a Kaggle competition\footnote{https://www.kaggle.com/competitions/predict-geographic-context-from-landscape-photos} task based on a subset of Geograph's 8M images, with strict evaluation: exact match accuracy is required across 49 possible tags. We show that combining location and title embeddings with image features improves accuracy over using image embeddings alone. We release a lightweight pipeline\footnote{https://github.com/SpaceTimeLab/ClipTheLandscape} that trains on a modest laptop, using pre-trained CLIP image and text embeddings and a simple classification head. Predicted tags can support downstream tasks such as building location embedders for GeoAI applications, enriching spatial understanding in data-sparse regions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a clear accuracy lift from fusing location and title embeddings with CLIP image features on the Geograph Kaggle subset and ships a lightweight reproducible pipeline.

read the letter

The main point is that adding location and title embeddings to CLIP image features raises exact-match accuracy on the 49-tag Kaggle Geograph task, and the authors release a simple pipeline that trains on a laptop using pre-trained embeddings plus a basic classification head. This is a direct empirical application rather than a new framework, but it targets a real gap in crowdsourced landscape data for remote British Isles areas that lack POIs or street imagery. The work does well by keeping the setup accessible and open: the GitHub code lets others run or adapt it quickly, and the focus on downstream GeoAI uses like location embedders is practical. The core claim holds up on the evaluated subset with no load-bearing fitting or circularity issues, and the stress-test confirms the ablation is internally valid. Soft spots are mostly about scope. The 49 fixed tags and UK-only data mean generalization to the full 8 million images or non-UK regions rests on an untested assumption about representativeness, which affects motivation more than the reported results. The abstract flags the multi-modal gain but leaves numbers and full ablations for the manuscript to detail. This is useful for applied researchers in GeoAI or computer vision working with crowdsourced geographic imagery who need a quick baseline or tool. It is not a theoretical advance, so it will not change how most labs think about embeddings, but the reproducibility and focused evaluation make it worth referee time. I would send it to peer review so the details on the accuracy numbers and any generalization tests can be checked and strengthened.

Referee Report

1 major / 3 minor

Summary. The manuscript presents a CLIP-based multi-modal multi-label classifier for assigning one of 49 geographical context tags to landscape images from the Geograph crowdsourced dataset. The central empirical claim is that fusing pre-trained location and title embeddings with CLIP image features produces higher exact-match accuracy than image features alone on the Kaggle competition subset; a lightweight training pipeline using a simple classification head on frozen embeddings is released for reproducibility.

Significance. If the reported accuracy gains are robust and clearly quantified, the work offers a practical, low-resource method for enriching metadata in large-scale crowdsourced geo-image archives, especially in remote areas without POI or street-level coverage. The explicit release of the training pipeline is a strength that supports reproducibility and downstream use in GeoAI tasks.

major comments (1)

Results section: The central claim of accuracy improvement from multi-modal fusion requires explicit reporting of the exact-match accuracy values for the image-only baseline versus the combined model, together with any standard deviations, number of runs, or statistical tests; without these numbers the magnitude and reliability of the gain cannot be assessed from the text alone.

minor comments (3)

Abstract: The footnote referencing the Kaggle competition would benefit from a brief parenthetical note on the exact evaluation protocol (exact-match across 49 tags) to make the task constraints immediately clear to readers.
Methods: The description of how the three embedding streams are fused before the classification head (concatenation, weighted sum, etc.) should be stated explicitly, even if the implementation is simple, to aid exact reproduction.
Discussion: The limitation that results are obtained on a British-Isles Kaggle subset and may not generalize to the full 8 M images or non-UK regions is mentioned but could be expanded with a short forward-looking sentence on planned validation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below and will incorporate the requested details into the revised manuscript.

read point-by-point responses

Referee: Results section: The central claim of accuracy improvement from multi-modal fusion requires explicit reporting of the exact-match accuracy values for the image-only baseline versus the combined model, together with any standard deviations, number of runs, or statistical tests; without these numbers the magnitude and reliability of the gain cannot be assessed from the text alone.

Authors: We agree that the specific numerical values should be stated explicitly to allow readers to assess the magnitude of the improvement. The current manuscript states that multi-modal fusion improves accuracy over image embeddings alone but does not report the precise exact-match percentages in the main text. In the revised version we will add these values directly in the Results section (or in a new table) for both the image-only baseline and the location+title+image model. Because the CLIP embeddings are frozen and the classification head is a simple linear layer trained to convergence on a fixed dataset split, each configuration was run once; we will note this explicitly and will not report standard deviations or statistical tests, as no repeated trials were performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation of multi-modal fusion

full rationale

The manuscript reports an empirical ML experiment: pre-trained CLIP image and text embeddings are extracted, concatenated with location/title features, and passed through a simple trainable classification head to predict 49 tags on the Kaggle Geograph subset. The claimed accuracy improvement is measured directly on held-out competition data via standard train/validation splits and exact-match metric. No derivation chain, equations, or fitted parameters are presented that reduce the reported result to the inputs by construction; the pipeline is released for external reproduction. The work is therefore self-contained against the external Kaggle benchmark and does not rely on self-citation load-bearing premises or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions of pre-trained CLIP transfer and the representativeness of the competition subset; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Pre-trained CLIP embeddings transfer effectively to geographical context tagging without domain-specific fine-tuning of the backbone.
Invoked by the decision to use frozen CLIP image and text encoders plus a simple head.

pith-pipeline@v0.9.0 · 5705 in / 1104 out tokens · 22382 ms · 2026-05-19T08:58:43.576387+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ CLIP [12], a vision-language model... freeze both encoders and use their pre-trained weights... multimodal fusion... linear layer and a multi-layer perceptron (MLP)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

exact match accuracy is required across 49 possible tags... combining location and title embeddings with image features improves accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Aitor Àvila Callau, María Yolanda Pérez Albert, Joan Jurado Rota, and David Ser- rano Giné. 2019. Landscape characterization using photographs from crowd- sourced platforms: content analysis of social media photographs. Open Geo- sciences 11, 1 (Jan. 2019), 558–571. https://doi.org/10.1515/geo-2019-0046

work page doi:10.1515/geo-2019-0046 2019
[2]

Weiming Huang, Wang , Jing, , and Gao Cong. 2024. Zero-shot urban function in- ference with street view images through prompting a pretrained vision-language model. International Journal of Geographical Information Science 38, 7 (July 2024), 1414–1442. https://doi.org/10.1080/13658816.2024.2347322

work page doi:10.1080/13658816.2024.2347322 2024
[3]

Barry Hunter. 2024. Predict Geographic Context from Landscape Photos. https://kaggle.com/competitions/predict-geographic-context-from- landscape-photos. Kaggle

work page 2024
[4]

Ilya Ilyankou, Aldo Lipani, Stefano Cavazzi, Xiaowei Gao, and James Haworth

work page
[5]

Do Sentence Transformers Learn Quasi-Geospatial Concepts from General Text? GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024, Glasgow, Scotland (2024)

work page 2024
[6]

Natchapon Jongwiriyanurak, Zichao Zeng, June Moh Goo, James Haworth, Xinglei Wang, Kerkritt Sriroongvikrai, Nicola Christie, Ilya Ilyankou, Meihui Wang, and Huanfa Chen. 2025. V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? arXiv:2408.10872 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Gar- avig Tanaksaranond, and Jan Boehm. 2023. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Geographic Information Science (GIScience 2023)

work page 2023
[8]

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. 2024. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. http://arxiv.org/abs/2311.17179

work page arXiv 2024
[9]

Xiang Li, Congcong Wen, Yuan Hu, and Nan Zhou. 2023. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation 124 (Nov. 2023), 103497. https://doi.org/10.1016/j.jag.2023.103497

work page doi:10.1016/j.jag.2023.103497 2023
[10]

Jane Wang

Jianzhe Lin, Tianze Yu, and Z. Jane Wang. 2022. Rethinking Crowdsourcing Annotation: Partial Annotation With Salient Labels for Multilabel Aerial Image Classification. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–12. https://doi.org/10.1109/TGRS.2022.3191735

work page doi:10.1109/tgrs.2022.3191735 2022
[11]

Xinyi Liu, James Haworth, and Meihui Wang. 2023. A New Approach to Assess- ing Perceived Walkability: Combining Street View Imagery with Multimodal Contrastive Learning Model. In Proceedings of the 2nd ACM SIGSPATIAL Inter- national Workshop on Spatial Big Data and AI for Industrial Applications . ACM, Hamburg Germany, 16–21. https://doi.org/10.1145/36158...

work page doi:10.1145/3615888.3627811 2023
[12]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2024
[13]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021
[14]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. http://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Xinglei Wang, Tao Cheng, Stephen Law, Zichao Zeng, Lu Yin, and Junyuan Liu

work page
[16]

Computers, Environment and Urban Systems 120 (Sept

Multi-modal contrastive learning of urban space representations from POI data. Computers, Environment and Urban Systems 120 (Sept. 2025), 102299. https://doi.org/10.1016/j.compenvurbsys.2025.102299

work page doi:10.1016/j.compenvurbsys.2025.102299 2025
[17]

Xinchen Wang, Alesja Gilvear, Yijing Li, and Ilya Ilyankou. 2025. Can CLIP See Safe Streets? Comparing Human and VLM Perceptions of Walkability and Safety. Presented at the Walking the X-min City workshop, AGILE 2025, Dresden

work page 2025
[18]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. https://doi.org/10.48550/arXiv. 1710.09412 arXiv:1710.09412 [cs]. Received 29 May 2025 CLIP the Landscape Group Tag Description Topography Coastal Landforms that occur where the land meets the sea. Islands Areas of land completely surround...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2018

[1] [1]

Aitor Àvila Callau, María Yolanda Pérez Albert, Joan Jurado Rota, and David Ser- rano Giné. 2019. Landscape characterization using photographs from crowd- sourced platforms: content analysis of social media photographs. Open Geo- sciences 11, 1 (Jan. 2019), 558–571. https://doi.org/10.1515/geo-2019-0046

work page doi:10.1515/geo-2019-0046 2019

[2] [2]

Weiming Huang, Wang , Jing, , and Gao Cong. 2024. Zero-shot urban function in- ference with street view images through prompting a pretrained vision-language model. International Journal of Geographical Information Science 38, 7 (July 2024), 1414–1442. https://doi.org/10.1080/13658816.2024.2347322

work page doi:10.1080/13658816.2024.2347322 2024

[3] [3]

Barry Hunter. 2024. Predict Geographic Context from Landscape Photos. https://kaggle.com/competitions/predict-geographic-context-from- landscape-photos. Kaggle

work page 2024

[4] [4]

Ilya Ilyankou, Aldo Lipani, Stefano Cavazzi, Xiaowei Gao, and James Haworth

work page

[5] [5]

Do Sentence Transformers Learn Quasi-Geospatial Concepts from General Text? GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024, Glasgow, Scotland (2024)

work page 2024

[6] [6]

Natchapon Jongwiriyanurak, Zichao Zeng, June Moh Goo, James Haworth, Xinglei Wang, Kerkritt Sriroongvikrai, Nicola Christie, Ilya Ilyankou, Meihui Wang, and Huanfa Chen. 2025. V-RoAst: Visual Road Assessment. Can VLM be a Road Safety Assessor Using the iRAP Standard? arXiv:2408.10872 [cs.CV]

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Natchapon Jongwiriyanurak, Zichao Zeng, Meihui Wang, James Haworth, Gar- avig Tanaksaranond, and Jan Boehm. 2023. Framework for Motorcycle Risk Assessment Using Onboard Panoramic Camera. In 12th International Conference on Geographic Information Science (GIScience 2023)

work page 2023

[8] [8]

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. 2024. SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery. http://arxiv.org/abs/2311.17179

work page arXiv 2024

[9] [9]

Xiang Li, Congcong Wen, Yuan Hu, and Nan Zhou. 2023. RS-CLIP: Zero shot remote sensing scene classification via contrastive vision-language supervision. International Journal of Applied Earth Observation and Geoinformation 124 (Nov. 2023), 103497. https://doi.org/10.1016/j.jag.2023.103497

work page doi:10.1016/j.jag.2023.103497 2023

[10] [10]

Jane Wang

Jianzhe Lin, Tianze Yu, and Z. Jane Wang. 2022. Rethinking Crowdsourcing Annotation: Partial Annotation With Salient Labels for Multilabel Aerial Image Classification. IEEE Transactions on Geoscience and Remote Sensing 60 (2022), 1–12. https://doi.org/10.1109/TGRS.2022.3191735

work page doi:10.1109/tgrs.2022.3191735 2022

[11] [11]

Xinyi Liu, James Haworth, and Meihui Wang. 2023. A New Approach to Assess- ing Perceived Walkability: Combining Street View Imagery with Multimodal Contrastive Learning Model. In Proceedings of the 2nd ACM SIGSPATIAL Inter- national Workshop on Spatial Big Data and AI for Industrial Applications . ACM, Hamburg Germany, 16–21. https://doi.org/10.1145/36158...

work page doi:10.1145/3615888.3627811 2023

[12] [12]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.07193 2024

[13] [13]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/arXiv.2103.00020

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2103.00020 2021

[14] [14]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. http://arxiv.org/abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Xinglei Wang, Tao Cheng, Stephen Law, Zichao Zeng, Lu Yin, and Junyuan Liu

work page

[16] [16]

Computers, Environment and Urban Systems 120 (Sept

Multi-modal contrastive learning of urban space representations from POI data. Computers, Environment and Urban Systems 120 (Sept. 2025), 102299. https://doi.org/10.1016/j.compenvurbsys.2025.102299

work page doi:10.1016/j.compenvurbsys.2025.102299 2025

[17] [17]

Xinchen Wang, Alesja Gilvear, Yijing Li, and Ilya Ilyankou. 2025. Can CLIP See Safe Streets? Comparing Human and VLM Perceptions of Walkability and Safety. Presented at the Walking the X-min City workshop, AGILE 2025, Dresden

work page 2025

[18] [18]

mixup: Beyond Empirical Risk Minimization

Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. 2018. mixup: Beyond Empirical Risk Minimization. https://doi.org/10.48550/arXiv. 1710.09412 arXiv:1710.09412 [cs]. Received 29 May 2025 CLIP the Landscape Group Tag Description Topography Coastal Landforms that occur where the land meets the sea. Islands Areas of land completely surround...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2018