From Articles to Canopies: Knowledge-Driven Pseudo-Labelling for Tree Species Classification using LLM Experts
Pith reviewed 2026-05-10 08:14 UTC · model grok-4.3
The pith
A semi-supervised method extracts ecological cohabitation priors from articles with LLMs and folds them into canopy-graph pseudo-labelling to raise hyperspectral tree-species accuracy by 5.6 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LLM-derived cohabitation likelihoods, encoded as a matrix and inserted into biologically inspired pseudo-labelling over a precomputed canopy graph, allow accurate species classification from multi-sensor forest data at low training cost, delivering a measured 5.6 percent accuracy gain and expert-validated prior quality.
What carries the argument
The LLM-generated cohabitation matrix that supplies species co-occurrence likelihoods to guide pseudo-labelling across nodes of a canopy graph constructed from hyperspectral and laser-scanning observations.
If this is right
- The method reduces dependence on large sets of manually labeled pixels while still respecting spectral mixing and class imbalance.
- Structural information from laser scanning and spectral information from hyperspectral imaging are jointly used to define both the graph and the pseudo-labels.
- Domain knowledge enters the model automatically rather than through repeated manual expert annotation for each new area.
- Classification decisions become constrained by documented species co-occurrence patterns, limiting biologically implausible label assignments.
Where Pith is reading between the lines
- The same pipeline could be applied to other remote-sensing tasks where literature contains reliable interaction or co-occurrence rules.
- As language models become more reliable at distilling ecological facts, the quality of the resulting priors and therefore the classification accuracy would be expected to rise.
- Operational forest-inventory systems could adopt the approach to lower the cost of ground-truth collection while maintaining ecological consistency.
Load-bearing premise
The cohabitation likelihoods extracted by the LLMs from articles accurately reflect real ecological interactions and can be integrated into the canopy-graph pseudo-labelling without introducing new systematic errors.
What would settle it
Replace the LLM cohabitation matrix with a uniform or random matrix and rerun the full pipeline on the same forest dataset; the 5.6 percent gain should disappear if the priors are the load-bearing component.
Figures
read the original abstract
Hyperspectral tree species classification is challenging due to limited and imbalanced class labels, spectral mixing (overlapping light signatures from multiple species), and ecological heterogeneity (variability among ecological systems). Addressing these challenges requires methods that integrate biological and structural characteristics of vegetation, such as canopy architecture and interspecific interactions, rather than relying solely on spectral signatures. This paper presents a biologically informed, semi-supervised deep learning method that integrates multi-sensor Earth observation data, specifically hyperspectral imaging (HSI) and airborne laser scanning (ALS), with expert, ecological knowledge. The approach relies on biologically inspired pseudo-labelling over a precomputed canopy graph, yielding accurate classification at low training cost. In addition, ecological priors on species cohabitation are automatically derived from reliable sources using large language models (LLMs) and encoded as a cohabitation matrix with likelihoods of species occurring together. These priors are incorporated into the pseudo-labelling strategy, effectively introducing expert knowledge into the model. Experiments on a real-world forest dataset demonstrate 5.6% improvement over the best reference method. Expert evaluation of cohabitation priors reveals high accuracy with differences no larger than 15%.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a semi-supervised deep learning framework for hyperspectral tree species classification that fuses HSI and ALS data over a precomputed canopy graph. It augments the graph-based pseudo-labelling step with a cohabitation matrix whose entries are likelihoods automatically extracted by LLMs from scientific literature; these priors encode expert ecological knowledge on interspecific interactions. On a real-world forest dataset the method yields a 5.6 % accuracy gain over the strongest baseline, while an expert review finds that the LLM-derived priors deviate by at most 15 % from reference values.
Significance. If the reported improvement can be attributed to the LLM priors rather than to the canopy graph or multi-sensor fusion alone, the work would demonstrate a practical route for injecting domain knowledge into remote-sensing pipelines with limited labels. Such an approach could scale to other ecological classification tasks where literature-derived priors are available.
major comments (2)
- [Experiments] Experiments section: the headline 5.6 % improvement is presented without an ablation that holds the canopy graph, HSI+ALS features, and semi-supervised training fixed while removing or uniformizing the cohabitation matrix. Because the graph already encodes canopy architecture and interspecific interactions from ALS, it is impossible to determine whether the reported delta is driven by the LLM priors or by the remainder of the pipeline.
- [Experiments] The manuscript supplies no dataset statistics (number of plots, pixels per class, train/test split), no description of the reference methods, and no statistical significance tests for the 5.6 % gain. These omissions prevent verification that the data support the central performance claim.
minor comments (2)
- [Method] The precise mechanism by which the cohabitation matrix modulates the pseudo-label assignment (e.g., as an additive term, a constraint, or a re-weighting) is described only at a high level; an explicit equation or algorithmic step would improve reproducibility.
- [Abstract] The abstract states the 5.6 % improvement and the ≤15 % prior error but omits dataset details, class counts, and baseline descriptions; the same information should appear in the main text with a dedicated table.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to strengthen the experimental validation and reporting.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the headline 5.6 % improvement is presented without an ablation that holds the canopy graph, HSI+ALS features, and semi-supervised training fixed while removing or uniformizing the cohabitation matrix. Because the graph already encodes canopy architecture and interspecific interactions from ALS, it is impossible to determine whether the reported delta is driven by the LLM priors or by the remainder of the pipeline.
Authors: We agree that an ablation isolating the LLM-derived cohabitation priors is required to attribute the performance gain. In the revised manuscript we will add an ablation that keeps the canopy graph, HSI+ALS fusion, and semi-supervised pseudo-labelling pipeline fixed while replacing the cohabitation matrix with a uniform matrix (all entries 0.5) or removing it entirely. The resulting accuracy difference will be reported to quantify the specific contribution of the literature-derived priors. revision: yes
-
Referee: [Experiments] The manuscript supplies no dataset statistics (number of plots, pixels per class, train/test split), no description of the reference methods, and no statistical significance tests for the 5.6 % gain. These omissions prevent verification that the data support the central performance claim.
Authors: We acknowledge that these details are missing from the current version. In the revision we will add: (i) full dataset statistics (number of plots, pixel counts per class, and explicit train/test split ratios); (ii) expanded descriptions of all baseline methods; and (iii) statistical significance tests (e.g., McNemar’s test with p-values or bootstrap confidence intervals) for the reported accuracy improvements. revision: yes
Circularity Check
No circularity detected; empirical improvement reported from external dataset validation without self-referential derivations
full rationale
The paper describes a semi-supervised classification pipeline that incorporates LLM-derived cohabitation priors into pseudo-labelling on a precomputed canopy graph from ALS data. No equations, parameter fittings, or derivation steps are present in the provided text that would reduce the reported 5.6% accuracy gain or the cohabitation matrix to tautological inputs by construction. The improvement is claimed from experiments on a real-world forest dataset, and the priors receive separate expert validation (differences ≤15%). This is an empirical integration of external knowledge sources rather than a closed mathematical chain. No self-citations, ansatzes, or renamings of known results are invoked in a load-bearing way that collapses the central claim. The method remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-extracted cohabitation likelihoods from literature are sufficiently accurate to serve as useful priors for classification.
- domain assumption The precomputed canopy graph from ALS data captures biologically relevant neighborhood structure for species interactions.
Reference graph
Works this paper leans on
-
[1]
Brockerhoff, Luc Barbaro, Bastien Castagneyrol, David I
Eckehard G. Brockerhoff, Luc Barbaro, Bastien Castagneyrol, David I. Forrester, Barry Gardiner, Juan Ramón González-Olabarria, Phil O’B. Lyver, and et al. Forest biodiversity, ecosystem func- tioning and the provision of ecosystem services.Biodiversity and Conservation, 26(13):3005–3035, 2017
work page 2017
-
[2]
Gordon B. Bonan. Forests and climate change: Forcings, feedbacks, and the climate benefits of forests.Science, 320(5882):1444–1449, 2008
work page 2008
-
[3]
Higher levels of multiple ecosystemservicesarefoundinforestswithmoretreespecies
Lars Gamfeldt, Tord Snäll, Robert Bagchi, Mari Jonsson, Lena Gustafsson, Petter Kjellander, and et al. Higher levels of multiple ecosystemservicesarefoundinforestswithmoretreespecies. Nature Communications, 4:1340, 2013
work page 2013
-
[4]
Jialing Bai, Chunying Ren, Xinying Shi, Hengxing Xiang, Wenmin Zhang, Hailing Jiang, and et al. Tree species diversity impacts on ecosystem services of temperate forests.Ecological Indicators, 167:112639, 2024
work page 2024
-
[5]
RajeshVanguri,GiovanniLaneve,andAgataHościło.Mappingforest tree species and its biodiversity using EnMAP hyperspectral data alongwithSentinel-2temporaldata:Anapproachoftreespeciesclas- sification and diversity indices.Ecological Indicators, 167:112671, 2024
work page 2024
-
[6]
Aragão, Lisa Hammarström, Markus Immitzer, and European Com- mission
Clement Atzberger, Guido Zeug, Pierre Defourny, Luiz E.O.C. Aragão, Lisa Hammarström, Markus Immitzer, and European Com- mission. Monitoring of forests through remote sensing – final report. Technical Report KH-03-20-754-EN-N, Publications Office of the European Union, 2020
work page 2020
-
[7]
EUSTAFOR position on the EU forest monitoring framework regulation
European State Forest Association (EUSTAFOR). EUSTAFOR position on the EU forest monitoring framework regulation. Position paper, 2024. Brussels, 27 March 2024
work page 2024
-
[8]
Steve Ahlswede, Christian Schulz, Christiano Gava, Patrick Helber, Benjamin Bischke, Michael Förster, Florencia Arias, Jörn Hees, Begüm Demir, and Birgit Kleinschmit. TreeSatAI Benchmark Archive: A multi-sensor, multi-label dataset for tree species classi- fication in remote sensing.Earth SystemScience Data, 15:681–695, 2023
work page 2023
-
[9]
RemoteSensing ofEnvironment, 304:114069, 2024
Lukas Blickensdörfer, Katja Oehmichen, Dirk Pflugmacher, Birgit Kleinschmit,andPatrickHostert.Nationaltreespeciesmappingusing Sentinel-1/2 time series and German National Forest Inventory data. RemoteSensing ofEnvironment, 304:114069, 2024
work page 2024
-
[10]
TingQinandetal. Multi-branchandmulti-labeltreespeciesclassifi- cation using multimodal remote sensing datasets.ScientificReports, 15(32710), 2025
work page 2025
-
[11]
Fassnacht, Hooman Latifi, Krzysztof Stereńczak, Aneta Modzelewska, Michael Lefsky, Lars T
Fabian E. Fassnacht, Hooman Latifi, Krzysztof Stereńczak, Aneta Modzelewska, Michael Lefsky, Lars T. Waser, Christoph Straub, and Arnab Ghosh. Review of studies on tree species classification from remotely sensed data.Remote Sensing of Environment, 186:64–87, 2016
work page 2016
-
[12]
Lihui Zhong, Zhengquan Dai, Panfei Fang, Yong Cao, and Leiguang Wang. A review: Tree species classification based on remote sensing data and classic deep learning-based methods.Forests, 15(5), 2024
work page 2024
-
[13]
Ruiliang Pu. Mapping tree species using advanced remote sensing technologies: A state-of-the-art review and perspective.Journal of RemoteSensing, 2021:1–26, 2021
work page 2021
-
[14]
SergioMarconi,BenGWeinstein,ShengZou,StephanieABohlman, Alina Zare, Aditya Singh, Dylan Stewart, Ira Harmon, Ashley Steinkraus, and Ethan P White. Continental-scale hyperspectral tree species classification in the United States National Ecological ObservatoryNetwork. RemoteSensingofEnvironment,282:113264, 2022
work page 2022
-
[15]
Ning Zhang, Yueting Wang, and Xiaoli Zhang. Extraction of tree crownsdamagedbydendrolimustabulaeformistsaietliuviaspectral- spatial classification using UAV-based hyperspectral images.Plant Methods, 16(1):135, 2020
work page 2020
-
[16]
JanNiedzielko,DominikKopeć,JustynaWylazłowska,AdamKania, Jakub Charyton, Anna Halladin-Dąbrowska, Maria Niedzielko, and Karol Berłowski. Airborne data and machine learning for urban tree species mapping: Enhancing the legend design to improve the map applicability for city greenery management.International Journal of AppliedEarth Observationand Geoinforma...
work page 2024
-
[17]
Hao Zhong, Wenshu Lin, Haoran Liu, Nan Ma, Kangkang Liu, Rongzhen Cao, Tiantian Wang, and Zhengzhao Ren. Identification of tree species based on the fusion of uav hyperspectral image and LiDARdatainaconiferousandbroad-leavedmixedforestinnortheast china. FrontiersinPlantScience, 13:964769, 2022
work page 2022
-
[18]
Yifang Shi, Tiejun Wang, Andrew K Skidmore, Stefanie Holzwarth, Uta Heiden, and Marco Heurich. Mapping individual silver fir trees using hyperspectral and LiDAR data in a Central European mixedforest. InternationalJournalofAppliedEarth Observationand Geoinformation, 98:102311, 2021
work page 2021
-
[19]
Giovanni D’Amico, Mats Nilsson, Arvid Axelsson, and Gherardo Chirici. Data homogeneity impact in tree species classification basedonSentinel-2multitemporaldatacasestudyincentralSweden. International Journal of Remote Sensing, 45(15):5050–5075, 2024. Publisher: Taylor & Francis
work page 2024
-
[20]
Victoria M Scholl, Megan E Cattau, Maxwell B Joseph, and Jen- nifer K Balch. Integrating National Ecological Observatory Network (NEON) airborne remote sensing and in-situ data for optimal tree species classification.RemoteSensing, 12(9):1414, 2020
work page 2020
-
[21]
Ying Quan, Mingze Li, Yuanshuo Hao, Jianyang Liu, and Bin Wang. Tree species classification in a typical natural secondary forest using UAV-borne LiDAR and hyperspectral data.GIScience & Remote Sensing, 60(1):2171706, 2023
work page 2023
-
[22]
Marcin Kluczek, Bogdan Zagajewski, and Tomasz Zwijacz-Kozica. Mountain tree species mapping using sentinel-2, planetscope, and airborne hyspex hyperspectral imagery.Remote Sensing, 15(3):844, 2023
work page 2023
-
[23]
Nasir Farsad Layegh, Roshanak Darvishzadeh, Andrew K Skidmore, Claudio Persello, and Nina Krüger. Integrating semi-supervised learning with an expert system for vegetation cover classification using Sentinel-2 and RapidEye data.Remote Sensing, 14(15):3605, 2022
work page 2022
-
[24]
Graves, Sergio Marconi, Dylan Stewart, Ira Harmon, Ben Weinstein, Yuzi Kanazawa, Victoria M
Sarah J. Graves, Sergio Marconi, Dylan Stewart, Ira Harmon, Ben Weinstein, Yuzi Kanazawa, Victoria M. Scholl, Maxwell B. Joseph, JosephMcGlinchy,LukeBrowne,MeganK.Sullivan,SergioEstrada- Villegas, Daisy Zhe Wang, Aditya Singh, Stephanie Bohlman, Alina Zare, and Ethan P. White. Data science competition for cross-site individual tree species identification ...
work page 2023
-
[25]
Xianfei Guo, Hui Li, Linhai Jing, Ping Wang, Xianfei Guo, Hui Li, Linhai Jing, and Ping Wang. Individual Tree Species Classification Based on Convolutional Neural Networks and Multitemporal High- Resolution Remote Sensing Images.Sensors, 22(9), 2022
work page 2022
-
[26]
A hybrid convolution neural net- workfortheclassificationoftreespeciesusinghyperspectralimagery
Jian Wang and Yongchang Jiang. A hybrid convolution neural net- workfortheclassificationoftreespeciesusinghyperspectralimagery. PLOSONE, 19(5):e0304469, 2024
work page 2024
-
[27]
WengeNi-Meister,AnthonyAlbanese,andFrancescaLingo. Assess- ing data preparation and machine learning for tree species classifi- cation using hyperspectral imagery.Remote Sensing, 16(17):3313, 2024
work page 2024
-
[28]
Zhongwei Li, Yuewen Wang, Leiquan Wang, Fangming Guo, Ya- jie Yang, and Jie Wei. Pseudo-labelling contrastive learning for semi-supervised hyperspectral and LiDAR data classification.IEEE JournalofSelectedTopicsinAppliedEarthObservationsandRemote Sensing, 2024
work page 2024
-
[29]
Nearest neighbor-based contrastive learning for hyperspectral and LiDAR data classification
Meng Wang, Feng Gao, Junyu Dong, Heng-Chao Li, and Qian Du. Nearest neighbor-based contrastive learning for hyperspectral and LiDAR data classification. IEEE Transactions on Geoscience and RemoteSensing, 61:1–16, 2023
work page 2023
-
[30]
Ting Lu, Kexin Ding, Wei Fu, Shutao Li, and Anjing Guo. Coupled adversarial learning for fusion classification of hyperspectral and LiDAR data.InformationFusion, 93:118–131, 2023
work page 2023
-
[31]
Fangming Guo, Zhongwei Li, Qiao Meng, Guangbo Ren, Leiquan Wang, Jianbu Wang, Huawei Qin, and Jie Zhang. Semi-supervised cross-domainfeaturefusionclassificationnetworkforcoastalwetland classification with hyperspectral and LiDAR data. International Journal of Applied Earth Observation and Geoinformation, 120:103354, 2023
work page 2023
-
[32]
Xiaozhen Wang, Jiahang Liu, Weijian Chi, Weigang Wang, and Yue Ni. Advances in hyperspectral image classification methods with small samples: A review.RemoteSensing, 15(15):3795, 2023
work page 2023
-
[33]
FeiTongandYunZhang. Treespeciesclassificationonhyperspectral imagery using fewer training samples.The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, 48:71–76, 2024
work page 2024
-
[34]
Meichen Jiang, Jiaxin Kong, Zhaochen Zhang, Jianbo Hu, Yuchu Qin, Kankan Shang, Mingshui Zhao, and Jian Zhang. Seeing trees fromdrones:Theroleofleafphenologytransitioninmappingspecies distributioninspecies-richmontaneforests. Forests,14(5):908,2023
work page 2023
-
[35]
Janne Mäyrä, Sarita Keski-Saari, Sonja Kivinen, Topi Tanhuanpää, PekkaHurskainen,PeterKullberg,LauraPoikolainen,ArtoViinikka, Sakari Tuominen, Timo Kumpula, et al. Tree species classification from airborne hyperspectral and LiDAR data using 3D convolutional neuralnetworks. RemoteSensingofEnvironment,256:112322,2021
work page 2021
-
[36]
Transfer learning of species co-occurrence patterns between plant communities
Johannes Hirn, Verónica Sanz, José Enrique García, Marta Gob- erna, Alicia Montesinos-Navarro, José Antonio Navarro-Cano, Ri- cardo Sánchez-Martín, Alfonso Valiente-Banuet, and Miguel Verdú. Transfer learning of species co-occurrence patterns between plant communities. Ecological Informatics, 83:102826, 2024
work page 2024
-
[37]
’Small Data’ for big insights in ecology.Trends in Ecology & Evolution, 38(7):615–622, 2023
Lindsay C Todman, Alex Bush, and Amelia SC Hood. ’Small Data’ for big insights in ecology.Trends in Ecology & Evolution, 38(7):615–622, 2023
work page 2023
-
[38]
Mingyang Zhang, Zhaoyang Wang, Xiangyu Wang, Maoguo Gong, Yue Wu, and Hao Li. Features kept generative adversarial network data augmentation strategy for hyperspectral image classification. Pattern Recognition, 142:109701, 2023
work page 2023
-
[39]
Fei Tong and Yun Zhang. Spectral–spatial and cascaded multilayer randomforestsfortreespeciesclassificationinairbornehyperspectral images. IEEE Transactions on Geoscience and Remote Sensing, 60:1–11, 2022
work page 2022
-
[40]
Andrew V Gougherty and Hannah L Clipp. Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature.npjBiodiversity, 3(1):13, 2024
work page 2024
-
[41]
Adam S Hayes. ’Conversing’ with qualitative data: Enhancing qual- itative research through large language models (llms).International Journal of QualitativeMethods, 24:16094069251322346, 2025
work page 2025
- [42]
-
[43]
PARGE – orthorectification and geocod- ing software, 2022
ReSe Applications GmbH. PARGE – orthorectification and geocod- ing software, 2022
work page 2022
- [44]
-
[45]
Milan Chytr `y, Lubomír Tich `y, Stephan M Hennekens, Ilona Knollová,JohnAMJanssen,JohnSRodwell,TomášPeterka,Corrado Marcenò, Flavia Landucci, Jiří Danihelka, et al. EUNIS habitat classification:Expertsystem,characteristicspeciescombinationsand distribution maps of european habitats.Applied VegetationScience, 23(4):648–675, 2020
work page 2020
-
[46]
Rajiv Pandey, Monika Rawat, Vishal Singh, Rasoul Yousefpour, and Zafar A Reshi. Large scale field-based evaluation of niche breadth, niche overlap and interspecific association of Western Himalayan temperate forest tree species. Ecological Indicators, 146:109876, 2023
work page 2023
-
[47]
John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. Structuredinformation extractionfromscientific textwith large language models.Naturecommunications, 15(1):1418, 2024
work page 2024
-
[48]
Anna Jarocińska, Dominik Kopeć, and Marlena Kycko. Comparison of dimensionality reduction methods on hyperspectral images for the identification of heathlands and mires. Scientific Reports, 14(1), 2024
work page 2024
-
[49]
Envi API programming guide, 2023
Harris Geospatial Solutions. Envi API programming guide, 2023
work page 2023
-
[50]
The lidr package: Airborne LiDAR data manipulation and visualization for forestry applications, 2023
Jean-Romain Roussel et al. The lidr package: Airborne LiDAR data manipulation and visualization for forestry applications, 2023
work page 2023
-
[51]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boostingwithcategoricalfeatures.In AdvancesinNeuralInformation ProcessingSystems, volume 31. Curran Associates, Inc., 2018. M. Romaszewski et al.:Preprint submitted to ElsevierPage 14 of 19 Tree Classification from HSI/LiDAR
work page 2018
-
[52]
Forest Data Bank,.https://www.bdl.lasy.gov.pl, 2026
BULiGL. Forest Data Bank,.https://www.bdl.lasy.gov.pl, 2026. [Online; accessed 4-March-2026]
work page 2026
-
[53]
Lasy i Zarośla.Zbiorowiska roślinne Polski—Ilustrowany przewodnik
Władysław Matuszkiewicz, Piotr Sikorski, Wojciech Szwed, and Marek Wierzba. Lasy i Zarośla.Zbiorowiska roślinne Polski—Ilustrowany przewodnik. Wydawnictwo Naukowe PWN, 2012
work page 2012
-
[54]
BenGWeinstein,SergioMarconi,AlinaZare,StephanieABohlman, AdityaSingh,SarahJGraves,LukasMagee,DanielJJohnson,Sydne Record, Vanessa E Rubio, et al. Individual canopy tree species maps for the National Ecological Observatory Network.PLoS Biology, 22(7):e3002700, 2024
work page 2024
-
[55]
Rubén Valbuena, Matti Maltamo, Lauri Mehtätalo, and Petteri Packalen. Key structural features of boreal forests may be detected directly using L-moments from airborne LiDAR data. Remote Sensingof Environment, 194:437–446, 2017. M. Romaszewski et al.:Preprint submitted to ElsevierPage 15 of 19 Tree Classification from HSI/LiDAR Table 4: Hyperspectral-deriv...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.