Habitat Classification from Ground-Level Imagery Using Deep Neural Networks
Pith reviewed 2026-05-19 06:26 UTC · model grok-4.3
The pith
Vision transformers classify 18 UK habitats from ground photos as accurately as experienced experts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Vision transformers consistently outperform convolutional neural network baselines on classification of 18 habitat types from UK Countryside Survey ground-level imagery, reaching 91% top-3 accuracy and 0.66 Matthews correlation coefficient. Supervised contrastive learning further improves discrimination among similar habitats by producing a more separable embedding space. The strongest model achieves accuracy on par with experienced ecological experts when classifying the same images.
What carries the argument
Vision transformers combined with supervised contrastive learning, which builds a discriminative embedding space to separate visually similar habitats such as Improved Grassland and Neutral Grassland.
If this is right
- Expert field surveys could be supplemented or partially replaced by automated image analysis for routine habitat monitoring.
- National-scale biodiversity assessments could become more frequent and less expensive by processing large numbers of ground photos.
- Misclassification rates between similar habitats drop when contrastive learning is used instead of standard supervised training.
- Models that reach expert-level performance on ground imagery enable integration of AI outputs directly into conservation planning workflows.
Where Pith is reading between the lines
- The same pipeline could be tested on ground-level images from other countries to check whether the learned distinctions transfer across different ecosystems.
- Pairing the model's interpretable attention maps with expert review might reveal which visual cues humans and networks both rely on for habitat decisions.
- Mobile apps that run the model locally could let land managers or volunteers collect and classify habitat data in the field without waiting for specialist input.
Load-bearing premise
The labeled UK Countryside Survey ground-level images represent the full range of real-world variation in the 18 habitat classes so that patterns learned by the model apply to new photos.
What would settle it
Applying the best trained model to a fresh collection of ground-level habitat photographs labeled independently by multiple experts and measuring whether its accuracy stays within the range of expert-to-expert agreement.
Figures
read the original abstract
Habitat assessment at local scales -- critical for enhancing biodiversity and guiding conservation priorities -- often relies on expert field surveys that can be costly, motivating the exploration of AI-driven tools to automate and refine this process. While most AI-driven habitat mapping depends on remote sensing, it is often constrained by sensor availability, weather, and coarse resolution. In contrast, ground-level imagery captures essential structural and compositional cues invisible from above and remains underexplored for robust, fine-grained habitat classification. This study addresses this gap by applying state-of-the-art deep neural network architectures to ground-level habitat imagery. Leveraging data from the UK Countryside Survey covering 18 broad habitat types, we evaluate two families of models - convolutional neural networks (CNNs) and vision transformers (ViTs) - under both supervised and supervised contrastive learning paradigms. Our results demonstrate that ViTs consistently outperform state-of-the-art CNN baselines on key classification metrics (Top-3 accuracy = 91%, MCC = 0.66) and offer more interpretable scene understanding tailored to ground-level images. Moreover, supervised contrastive learning significantly reduces misclassification rates among visually similar habitats (e.g., Improved vs. Neutral Grassland), driven by a more discriminative embedding space. Finally, our best model performs on par with experienced ecological experts in habitat classification from images, underscoring the promise of expert-level automated assessment. By integrating advanced AI with ecological expertise, this research establishes a scalable, cost-effective framework for ground-level habitat monitoring to accelerate biodiversity conservation and inform land-use decisions at a national scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates CNN and Vision Transformer architectures, including supervised contrastive learning, for classifying 18 broad habitat types from ground-level images in the UK Countryside Survey dataset. It reports concrete metrics for the best ViT+contrastive model (Top-3 accuracy 91%, MCC 0.66), notes improved discrimination among visually similar classes, and claims that this performance is on par with experienced ecological experts, positioning the approach as a scalable tool for automated habitat monitoring.
Significance. If the expert-parity claim is substantiated, the work would demonstrate a practical advance in applying modern vision models to fine-grained ecological classification tasks that are invisible from remote sensing. The explicit demonstration that contrastive learning reduces confusion between similar habitats (e.g., Improved vs. Neutral Grassland) is a concrete strength that could be leveraged in other fine-grained ecological datasets.
major comments (2)
- [Abstract and Results] The central claim that the best model 'performs on par with experienced ecological experts' (Abstract and Results) is not supported by any quantitative expert baseline. No per-expert accuracy, MCC, inter-rater agreement (Fleiss' kappa or equivalent), number of experts, or evaluation protocol (isolated images vs. additional context) is reported. Because the model is trained to reproduce the same expert labels, this omission directly limits interpretation of the headline metrics as 'expert-level'.
- [Methods] Methods section provides no description of train/validation/test splits, cross-validation procedure, statistical significance testing of performance differences across models, or any analysis of label noise or inter-expert disagreement. These omissions make it impossible to assess whether the reported Top-3 accuracy and MCC are robust or generalizable under real-world variation.
minor comments (2)
- [Abstract] The abstract states that ViTs 'offer more interpretable scene understanding' but does not specify the interpretability method (e.g., attention maps, Grad-CAM) or show supporting figures; a brief clarification or reference to a figure would help.
- Table or figure comparing all models on the full set of metrics (accuracy, Top-3, MCC, per-class F1) would improve readability and allow direct assessment of the contrastive-learning gains.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments highlight important areas for improving clarity and rigor, particularly around the expert comparison claim and methodological transparency. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract and Results] The central claim that the best model 'performs on par with experienced ecological experts' (Abstract and Results) is not supported by any quantitative expert baseline. No per-expert accuracy, MCC, inter-rater agreement (Fleiss' kappa or equivalent), number of experts, or evaluation protocol (isolated images vs. additional context) is reported. Because the model is trained to reproduce the same expert labels, this omission directly limits interpretation of the headline metrics as 'expert-level'.
Authors: We agree that the manuscript does not provide a quantitative expert baseline or inter-rater metrics to support the 'on par with experienced ecological experts' claim. The model was trained and evaluated using the same expert-provided labels from the UK Countryside Survey, but no separate blinded comparison against multiple experts (with reported agreement statistics or protocol details) was performed. To correct this, we will revise the abstract and results sections to remove the expert-parity phrasing. The updated text will instead highlight the achieved metrics (Top-3 accuracy 91%, MCC 0.66) on the expert-annotated dataset and position the work as a promising scalable complement to expert surveys without claiming direct equivalence. revision: yes
-
Referee: [Methods] Methods section provides no description of train/validation/test splits, cross-validation procedure, statistical significance testing of performance differences across models, or any analysis of label noise or inter-expert disagreement. These omissions make it impossible to assess whether the reported Top-3 accuracy and MCC are robust or generalizable under real-world variation.
Authors: We acknowledge these omissions limit the ability to evaluate robustness. We will expand the Methods section to explicitly describe the train/validation/test split ratios and any stratification by habitat class or survey year. We will clarify whether a single split or cross-validation was used and, if the latter, detail the number of folds and aggregation method. Statistical significance testing (e.g., paired bootstrap or McNemar tests with p-values) for differences between CNN and ViT models will be added. For label noise and inter-expert disagreement, we will report any available dataset metadata on label provenance and include a brief discussion of potential label variability as a limitation, along with any post-hoc analysis of confusion patterns that may reflect such noise. revision: yes
Circularity Check
No circularity: standard empirical evaluation on external dataset
full rationale
The paper applies off-the-shelf CNN and ViT architectures (with optional contrastive learning) to a pre-existing UK Countryside Survey dataset of ground-level images labeled by ecological experts. Reported results consist of standard test-set metrics (Top-3 accuracy, MCC) and a direct comparison against expert performance on the same images. No equations, fitted parameters, self-referential predictions, or uniqueness theorems appear; the central claims are data-driven empirical outcomes rather than derivations that collapse to their own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- model architecture and training hyperparameters
axioms (1)
- domain assumption Ground-level imagery contains essential structural and compositional cues invisible from above that enable robust classification of the 18 habitat types.
Reference graph
Works this paper leans on
-
[1]
Fast unfolding of communities in large networks. J. Stat. Mech.-Theory Exp. 2008, P10008. Breiman, L.,
work page 2008
-
[2]
Communications in Statistics 3, 1–27
A dendrite method for cluster analysis. Communications in Statistics 3, 1–27. URL:https://www.tandfonline.com/doi/abs/10.1080/03610927408827101, doi:10.1080/03610927408827101, arXiv:https://www.tandfonline.com/doi/pdf/10.1080/03610927408827101. Cao,R.,Liao,C.,Li,Q.,Tu,W.,Zhu,R.,Luo,N.,Qiu,G.,Shi,W.,2023. Integratingsatelliteandstreet-levelimagesforlocalcl...
-
[3]
Countryside survey: Uk headline messages from 2007 . Chen, Q., Wu, T.T., Fang, M.,
work page 2007
-
[4]
A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 224–227. doi:10.1109/TPAMI.1979.4766909. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.,
-
[5]
Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255. Díaz-Ireland, G., Gülçin, D., López-Sánchez, A., Pla, E., Burton, J., Velázquez, J.,
work page 2009
-
[6]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 . zu Ermgassen, S.O., Marsh, S., Ryland, K., Church, E., Marsh, R., Bull, J.W.,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[7]
Conservation Letters 14, e12820
Exploring the ecological outcomes of mandatory biodiversity net gain using evidence from early-adopter jurisdictions in england. Conservation Letters 14, e12820. European Commission, . Eu biodiversity strategy for 2030: Bringing nature back into our lives. URL:https://eur-lex.europa.eu/ legal-content/EN/TXT/?uri=celex:52020DC0380. Fabio, D.R., Fabio, D., ...
work page 2030
-
[8]
Resolution limit in community detection. Proc. Natl. Acad. Sci. U. S. A. 104, 36–41. Fox,J.,Siebenbrunner,A.,Reitinger,S.,Peer,D.,Rodríguez-Sánchez,A.,2024. Automatingavalanchedetectioninground-basedphotographswith deep learning. Cold Regions Science and Technology 223, 104179. Geisz, J.K., Wernette, P.A., Esselman, P.C.,
work page 2024
-
[9]
Gómez-Ríos,A.,Tabik,S.,Luengo,J.,Shihavuddin,A.,Krawczyk,B.,Herrera,F.,2019. Towardshighlyaccuratecoraltextureimagesclassification using deep convolutional neural networks and data augmentation. Expert Systems with Applications 118, 315–328. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y.,
work page 2019
-
[10]
MIT press Cambridge. GOV.UK,2023. Biodiversitynetgain. URL:https://www.gov.uk/government/collections/biodiversity-net-gain.lastupdated:15 March
work page 2023
-
[11]
A soft modularity function for detecting fuzzy communities in social networks. IEEE Trans. Fuzzy Syst. 21, 1170–1175. :Preprint submitted to Elsevier Page 16 of 26 He,K.,Zhang,X.,Ren,S.,Sun,J.,2016. Deepresiduallearningforimagerecognition,in:ProceedingsoftheIEEEconferenceoncomputervision and pattern recognition, pp. 770–778. Hullermeier, E., Rifqi, M.,
work page 2016
-
[12]
International Journal of Applied Earth Observation and Geoinformation 120, 103333
Wetmapformer: A unified deep cnn and vision transformer for complex wetland mapping. International Journal of Applied Earth Observation and Geoinformation 120, 103333. JointNatureConservationCommittee,n.d.Ukbappriorityhabitats.URL:https://jncc.gov.uk/our-work/uk-bap-priority-habitats/. accessed: 2025-07-01. Joly, A., Picek, L., Kahl, S., Goëau, H., Espita...
work page 2025
-
[13]
Overview of lifeclef 2024: Challenges on species distribution prediction and identification, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer. pp. 183–207. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., Krishnan, D.,
work page 2024
-
[14]
Liu,Z.,Lin,Y.,Cao,Y.,Hu,H.,Wei,Y.,Zhang,Z.,Lin,S.,Guo,B.,2021. Swintransformer:Hierarchicalvisiontransformerusingshiftedwindows, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022. Loshchilov, I., Hutter, F.,
work page 2021
-
[15]
Decoupled Weight Decay Regularization
Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 . Lou,H.,Li,S.,Zhao,Y.,2013. Detectingcommunitystructureusinglabelpropagationwithweightedcoherentneighborhoodpropinquity. Physica A. 392, 3095–3105. Majewski,P.,Zapotoczny,P.,Lampa,P.,Burduk,R.,Reiner,J.,2022. Multipurposemonitoringsystemforedibleinsectbreedingbasedonmachine learning. ...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[16]
arXiv preprint arXiv:2312.06960 . Marcinkowska-Ochtyra,A.,Ochtyra,A.,Raczko,E.,Kopeć,D.,2023.Natura2000grasslandhabitatsmappingbasedonspectro-temporaldimension of sentinel-2 images with machine learning. Remote Sensing 15,
-
[17]
Marjani,M.,Mohammadimanesh,F.,Mahdianpari,M.,Gill,E.W.,2025. Anovelspatio-temporalvisiontransformermodelforimprovingwetland mapping using multi-seasonal sentinel data. Remote Sensing Applications: Society and Environment 37, 101401. Martinez-Sanchez, L., See, L., Yordanov, M., Verhegghen, A., Elvekjaer, N., Muraro, D., d’Andrimont, R., Van der Velde, M.,
work page 2025
-
[18]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 . Morueta-Holme, N., Iversen, L., Corcoran, D., Rahbek, C., Normand, S.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
URL:https://publications.naturalengland
Natural England. URL:https://publications.naturalengland. org.uk/file/5432513149272064. Nawoya,S.,Ssemakula,F.,Akol,R.,Geissmann,Q.,Karstoft,H.,Bjerge,K.,Mwikirize,C.,Katumba,A.,Gebreyesus,G.,2024. Computervision and deep learning in insects for food and feed production: A review. Computers and Electronics in Agriculture 216, 108503. Nepusz, T., Petróczi,...
-
[20]
Acceleratingecosystemmonitoringthroughcomputervisionwithdeepmetriclearning
Oba,Y.,Doi,H.,2025. Acceleratingecosystemmonitoringthroughcomputervisionwithdeepmetriclearning. EcologicalComplexity62,101124. Perrett, A., Pollard, H., Barnes, C., Schofield, M., Qie, L., Bosilj, P., Brown, J.M.,
work page 2025
-
[21]
Pl@ntNet: Plant identification platform.https://plantnet.org/. Accessed: 2025-05-14. Praticò,S.,Solano,F.,DiFazio,S.,Modica,G.,2021. Machinelearningclassificationofmediterraneanforesthabitatsingoogleearthenginebased on seasonal sentinel-2 time-series and input image composition optimisation. Remote sensing 13,
work page 2025
-
[22]
Near linear time algorithm to detect community structures in large-scale networks. Phys. Rev E. 76, 036106. Ratnayake,M.N.,Dyer,A.G.,Dorin,A.,2021. Trackingindividualhoneybeesamongwildflowerclusterswithcomputervision-facilitatedpollinator monitoring. Plos one 16, e0239504. :Preprint submitted to Elsevier Page 17 of 26 Redmon, J., Divvala, S., Girshick, R....
work page 2021
-
[23]
Ecological Indicators 145, 109698
Using deep learning to detect an indicator arid shrub in ultra-high-resolution uav imagery. Ecological Indicators 145, 109698. Reynolds,S.A.,Beery,S.,Burgess,N.,Burgman,M.,Butchart,S.H.,Cooke,S.J.,Coomes,D.,Danielsen,F.,DiMinin,E.,Durán,A.P.,etal.,2025. The potential for ai to revolutionize conservation: a horizon scan. Trends in ecology & evolution 40, 1...
work page 2025
-
[24]
Selvaraju,R.R.,Cogswell,M.,Das,A.,Vedantam,R.,Parikh,D.,Batra,D.,2017
Why aren’t more landowners enrolling in land-based carbon credit exchanges? Rangelands 46, 117–131. Selvaraju,R.R.,Cogswell,M.,Das,A.,Vedantam,R.,Parikh,D.,Batra,D.,2017. Grad-cam:Visualexplanationsfromdeepnetworksviagradient- based localization, in: Proceedings of the IEEE international conference on computer vision, pp. 618–626. Simonyan, K., Zisserman, A.,
work page 2017
-
[25]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 . Sittaro, F., Hutengs, C., Semella, S., Vohland, M.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
A machine learning framework for the classification of natura 2000 habitat types at large spatial scales using modis surface reflectance data. Remote Sensing 14,
work page 2000
-
[27]
Identification of overlapping and non-overlapping community structure by fuzzy clustering in complex networks. Inf. Sci. 181, 1060–1071. Tan,M.,Le,Q.,2021.Efficientnetv2:Smallermodelsandfastertraining,in:Internationalconferenceonmachinelearning,PMLR.pp.10096–10106. UK Biodiversity Group,
work page 2021
-
[28]
gov.uk/assets/0b7943ea-2eee-47a9-bd13-76d1d66d471f
Uk biodiversity action plan priority habitat descriptions.https://hub.jncc. gov.uk/assets/0b7943ea-2eee-47a9-bd13-76d1d66d471f. URL:https://hub.jncc.gov.uk/assets/ 0b7943ea-2eee-47a9-bd13-76d1d66d471f. accessed: 2025-07-01. UK Centre for Ecology & Hydrology,
work page 2025
-
[29]
URL:https://uk-scape.ceh.ac.uk/our-science/projects/ countryside-survey
Ukceh countryside survey. URL:https://uk-scape.ceh.ac.uk/our-science/projects/ countryside-survey. accessed: 2025-05-13. UKHab Ltd,
work page 2025
-
[30]
UK Habitat Classification System. URL:https://www.ukhab.org/. professional and Basic editions; registration required for full documentation. VanAn,N.,Quang,N.H.,Son,T.P.H.,An,T.T.,2023. High-resolutionbenthichabitatmappingfrommachinelearningonplanetscopeimageryand icesat-2 data. Geocarto International 38, 2184875. Vaswani, A., Shazeer, N., Parmar, N., Usz...
work page 2023
-
[31]
Ubiquitousness of link-density and link-pattern communities in real-world networks. Eur. Phys. J. B. 85, 1–11. Wang,Q.,2025. Plantspeciesrecognitionandclassificationalgorithmbasedondeeplearning,in:20253rdInternationalConferenceonIntegrated Circuits and Communication Systems (ICICACS), IEEE. pp. 1–5. Wang, W., Liu, D., Liu, X., Pan, L.,
work page 2025
-
[32]
Earth System Science Data 9, 445–459
Long-term vegetation monitoring in great britain–the countryside survey 1978–2007 and beyond. Earth System Science Data 9, 445–459. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.,
work page 1978
-
[33]
Wide residual networks. arXiv preprint arXiv:1605.07146 . Zhang, S., Wang, R., Zhang, X.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Identification of overlapping community structure in complex networks using fuzzy c-means clustering. Physica A. 374, 483–490. Zhang,Y.,Yeung,D.,2012. Overlappingcommunitydetectionviaboundednonnegativematrixtri-factorization,in:InProc.ACMSIGKDDConf., pp. 606–614. Zhong,J.,Li,M.,Zhang,H.,Qin,J.,2023. Fine-grained3dmodelingandsemanticmappingofcoralreefsusin...
work page 2012
-
[35]
IEEE Transactions on Geoscience and Remote Sensing
Benthic mapping of coral reef areas at varied water depths using integrated active and passive remote sensing data and novel visual transformer models. IEEE Transactions on Geoscience and Remote Sensing . :Preprint submitted to Elsevier Page 18 of 26 Table 5 Comparison of Calinski–Harabasz (CH) Index and Davies–Bouldin (DB) Index for supervised learning (...
work page 1974
-
[36]
•Woodland and cropland receive consistently strong predictions from both humans and the model. This likely reflects their limited within-class variability (two woodland sub-classes, one cropland class) and distinctive visual features, in contrast to the five visually similar grassland sub-classes. •ForrarerhabitatssuchasLittoralSedimentandInlandRock,theex...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.