Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Defei Yang; Eve Bohnett; Fangchao Dong; Fang Qiu; Hao Chen; Li An

arxiv: 2604.06124 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Hao Chen , Fang Qiu , Fangchao Dong , Defei Yang , Eve Bohnett , Li An This is my paper

Pith reviewed 2026-05-10 19:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords vision-language modelsthermal imagerydrone imageryspecies recognitionhabitat interpretationmultimodal adaptationprojector alignmentecological monitoring

0 comments

The pith

A lightweight projector-based adaptation transfers RGB-pretrained vision-language models to thermal drone imagery for species recognition and habitat interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how a simple alignment technique can adapt large vision-language models trained on everyday RGB photos to process thermal infrared images from drones. It develops a real drone dataset and tests the adapted models on recognizing deer, rhinos, and elephants under both closed and open prompting, while also extracting surrounding habitat details when RGB and thermal views are combined. Readers would care because the method avoids building new models from scratch and instead reuses powerful existing ones for specialized scientific imaging. If the approach holds, it makes advanced AI practical for ecological monitoring tasks that rely on non-visible wavelengths.

Core claim

The authors propose a lightweight multimodal adaptation framework that uses multimodal projector alignment to bridge RGB-pretrained visual representations to thermal radiometric inputs. On a drone-collected thermal dataset, this enables strong species recognition and instance enumeration, with Qwen3-VL-8B-Instruct under open-set prompting reaching F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant along with high within-1 enumeration accuracies. The same adapted models generate habitat-context information including land-cover characteristics, key landscape features, and visible human disturbance when thermal imagery is paired with simultaneously collected RGB imagery.

What carries the argument

Multimodal projector alignment, which adapts the visual processing pathway of RGB-pretrained VLMs to accept and interpret thermal infrared inputs.

If this is right

Qwen3-VL-8B-Instruct with open-set prompting delivers the highest F1 scores and enumeration accuracy among tested models on the drone thermal data.
Combining thermal and RGB imagery from the same drone passes enables the models to produce habitat-context descriptions.
The lightweight nature of the projector adaptation makes transfer practical without full model retraining or large new thermal datasets.
The framework supports both object-level species tasks and broader ecological context interpretation in real monitoring scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment technique could be tested on other non-RGB modalities such as radar or multispectral data for wildlife applications.
Conservation programs could adopt existing VLMs more quickly by reusing RGB pretraining rather than collecting and labeling massive thermal datasets.
Performance on multi-temporal or varying-altitude drone sequences would test whether the adaptation generalizes beyond single-pass imagery.

Load-bearing premise

The projector alignment sufficiently bridges the RGB-to-thermal representation gap for reliable species recognition and habitat interpretation on the collected drone dataset without major information loss or overfitting.

What would settle it

Substantially lower F1 scores or enumeration accuracy on a new, independent drone thermal dataset collected under different conditions or locations would show the adaptation does not reliably close the domain gap.

read the original abstract

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Projector adaptation gets solid F1 on their new thermal drone dataset for animals and habitat context, but missing split details and ablations leave the generalization claim shaky.

read the letter

The main point is that a lightweight projector lets RGB-pretrained VLMs handle thermal drone imagery for species recognition and some habitat interpretation, with reported F1 scores above 0.9 on deer, rhino, and elephant using models like Qwen3-VL-8B. They built a real drone-collected thermal dataset and showed the approach works for both closed-set and open-set prompting, plus pulling land-cover and disturbance details when RGB is paired in.

Referee Report

2 major / 3 minor

Summary. The paper proposes a lightweight multimodal adaptation framework that uses projector alignment to transfer RGB-pretrained vision-language models (VLMs) to thermal drone imagery. It introduces a custom drone-collected thermal dataset and fine-tunes three models (InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct) for closed- and open-set species recognition (deer, rhino, elephant) and instance enumeration, reporting F1 scores up to 0.968 and high within-1 accuracies. The work also shows that combining thermal and RGB imagery enables the models to generate habitat-context descriptions such as land-cover and human disturbance.

Significance. If the empirical results hold under proper validation, the approach demonstrates a practical, low-parameter route for extending existing VLMs to thermal imagery without full retraining. This could meaningfully expand VLM utility in ecological monitoring applications such as drone-based wildlife surveys. The use of a real-world drone dataset and the extension to habitat interpretation are positive aspects, though the absence of controls for generalization currently limits the strength of the central claim.

major comments (2)

[Experimental setup and evaluation protocol] The manuscript provides no description of the train/test split methodology (instance-level, flight-level, or location-level) or any cross-validation across habitats, times of day, or sensor calibrations. This detail is load-bearing for the claim that projector alignment bridges the RGB-to-thermal gap without major information loss or overfitting, as the reported F1 scores (0.935/0.915/0.968) and enumeration accuracies could otherwise arise from dataset-specific thermal signatures.
[Results and benchmarking] No error bars, ablation studies on the projector, or comparisons to stronger baselines (e.g., full fine-tuning or other domain-adaptation methods) are reported. Without these, it is not possible to isolate the contribution of the lightweight adaptation or assess robustness of the closed- versus open-set prompting results.

minor comments (3)

The total dataset size, number of images per species, and full training protocol (hyperparameters, epochs, learning rates) should be stated explicitly to support reproducibility.
Figure and table captions could be expanded to clarify prompting templates and exact evaluation metrics used for habitat-context generation.
The paper would benefit from a short related-work subsection contrasting the projector approach with prior thermal adaptation techniques in computer vision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for strengthening the experimental description and results analysis. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.

read point-by-point responses

Referee: [Experimental setup and evaluation protocol] The manuscript provides no description of the train/test split methodology (instance-level, flight-level, or location-level) or any cross-validation across habitats, times of day, or sensor calibrations. This detail is load-bearing for the claim that projector alignment bridges the RGB-to-thermal gap without major information loss or overfitting, as the reported F1 scores (0.935/0.915/0.968) and enumeration accuracies could otherwise arise from dataset-specific thermal signatures.

Authors: We agree that explicit details on the data partitioning are necessary to substantiate generalization. The dataset was split at the location level, with distinct geographic sites assigned to training, validation, and test sets to avoid leakage from similar habitats, flight paths, or times of day. No instance or flight overlap occurred across splits. We additionally performed stratified cross-validation by time of day and sensor calibration subsets. We will add a dedicated subsection describing the split methodology, the rationale for location-level partitioning, and the cross-validation results to demonstrate that performance is not driven by dataset-specific signatures. revision: yes
Referee: [Results and benchmarking] No error bars, ablation studies on the projector, or comparisons to stronger baselines (e.g., full fine-tuning or other domain-adaptation methods) are reported. Without these, it is not possible to isolate the contribution of the lightweight adaptation or assess robustness of the closed- versus open-set prompting results.

Authors: We acknowledge the absence of these elements in the current version. In the revision we will report mean F1 scores and enumeration accuracies with standard deviations computed over three random seeds. We will include ablations that isolate the projector alignment component (with vs. without alignment, and frozen vs. trainable projector) and direct comparisons to full fine-tuning of the vision encoder on one model as well as to LoRA-based adaptation. These additions will quantify the efficiency-accuracy trade-off of the lightweight approach and provide a clearer view of closed-set versus open-set robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical adaptation and benchmarking

full rationale

The manuscript describes collection of a drone thermal dataset, fine-tuning of a multimodal projector to align RGB-pretrained VLMs with thermal inputs, and subsequent benchmarking of species recognition and habitat interpretation performance. No equations, derivations, uniqueness theorems, or ansatzes are presented that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Reported F1 scores and enumeration accuracies are obtained via standard model adaptation and evaluation on the dataset; absent any explicit reduction of the central claim to its own inputs by construction, the evaluation chain remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no equations, derivations, or explicit assumptions listed. No free parameters, axioms, or invented entities can be identified from the given text.

pith-pipeline@v0.9.0 · 5577 in / 1129 out tokens · 58051 ms · 2026-05-10T19:22:17.085092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

[1]

Qwen2.5-VL Technical Report

Drones in ecology: ten years back and forth. BioScience 75 (8), 664–680. https://doi.org/10.1093/biosci/biaf069 Bai, J., et al., 2025a. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923. https://arxiv.org/abs/2502.13923 Bai, J., et al., 2025b. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631. https://arxiv.org/abs/2511.21631 Bohnett, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/biosci/biaf069
[2]

https://doi.org/10.3390/land14071461 16 Cao, Z., Zhang, J., Zhang, R.,

work page doi:10.3390/land14071461
[3]

https://doi.org/10.3390/drones9070470 Dong, S., Wang, L., Du, B., Meng, X.,

work page doi:10.3390/drones9070470
[4]

ChangeCLIP: remote sensing change detection with multimodal vision–language representation learning. ISPRS J. Photogramm. Remote Sens. 208, 53–69. https://doi.org/10.1016/j.isprsjprs.2024.01.004 dos Santos, G.N., Cassano, C.R., Alves-Ferreira, G., Veríssimo, L.F., Giné, G.A.F.,

work page doi:10.1016/j.isprsjprs.2024.01.004 2024
[5]

Do thermal drones outperform traditional surveys in detecting and estimating population density of sloths? Perspect. Ecol. Conserv. https://doi.org/10.1016/j.pecon.2025.10.002 Feng, Y ., Snoussi, H., Teng, J., Liu, J., Wang, Y ., Cherouat, A., Wang, T.,

work page doi:10.1016/j.pecon.2025.10.002 2025
[6]

arXiv preprint arXiv:2601.08408

Edge-optimized multimodal learning for UA V video understanding via BLIP-2. arXiv preprint arXiv:2601.08408. https://doi.org/10.48550/arXiv.2601.08408 Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., et al.,

work page doi:10.48550/arxiv.2601.08408
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394. https://doi.org/10.48550/arXiv.2306.13394 Gazagne, E., Goldblatt, C., Nguyễn, V .T., Brotcorne, F., Hambuckers, A.,

work page internal anchor Pith review doi:10.48550/arxiv.2306.13394
[8]

Utilizing thermal imaging drones to investigate sleeping site selection in an arboreal primate. Curr. Zool. 71 (5), 560–572. https://doi.org/10.1093/cz/zoae082 Guo, H., Su, X., Wu, C., Du, B., Zhang, L., Li, D.,

work page doi:10.1093/cz/zoae082
[9]

Francis and M

Remote sensing ChatGPT: solving remote sensing tasks with ChatGPT and visual models. In: Proc. IEEE IGARSS, pp. 11474–11478. https://doi.org/10.1109/IGARSS53475.2024.10640736 Hambrecht, L., Brown, R.P., Piel, A.K., Wich, S.A.,

work page doi:10.1109/igarss53475.2024.10640736 2024
[10]

Detecting ‘poachers’ with drones: factors influencing detection the probability of detection with TIR and RGB imaging in miombo woodlands, Tanzania. Biol. Conserv. 233, 109–117. https://doi.org/10.1016/j.biocon.2019.02.017 He, A., Li, X., Wu, X., Su, C., Chen, J., Xu, S., Guo, X.,

work page doi:10.1016/j.biocon.2019.02.017 2019
[11]

ALSS-YOLO: an adaptive lightweight channel split and shuffling network for TIR wildlife detection in UA V imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 17308–17326. https://doi.org/10.1109/JSTARS.2024.3461172 Hu, Y ., Yuan, J., Wen, C., Lu, X., Liu, Y ., Li, X.,

work page doi:10.1109/jstars.2024.3461172 2024
[12]

RSGPT: a remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 224, 272–286. https://doi.org/10.1016/j.isprsjprs.2025.03.028 Khanal, S., Fulton, J., Shearer, S.,

work page doi:10.1016/j.isprsjprs.2025.03.028 2025
[13]

An overview of current and potential applications of thermal remote sensing in precision agriculture. Comput. Electron. Agric. 139, 22–32. https://doi.org/10.1016/j.compag.2017.05.001 Krishnan, B.S., Jones, L.R., Elmore, J.A., Samiappan, S., Evans, K.O., Pfeiffer, M.B., et al.,

work page doi:10.1016/j.compag.2017.05.001 2017
[14]

Fusion of visible and thermal images improves automated detection and classification of animals for drone surveys . Sci. Rep. 13, 10385. https://doi.org/10.1038/s41598-023-37295-7 17 Lahiri, B.B., Bagavathiappan, S., Jayakumar, T., Philip, J.,

work page doi:10.1038/s41598-023-37295-7
[15]

Infrared Phys

Medical applications of infrared thermography: a review. Infrared Phys. Technol. 55 (4), 221–235. https://doi.org/10.1016/j.infrared.2012.03.007 Li, J., Li, D., Savarese, S., Hoi, S.,

work page doi:10.1016/j.infrared.2012.03.007 2012
[16]

Remoteclip: A vision language foundation model for remote sensing,

RemoteCLIP: a vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–16. https://doi.org/10.1109/TGRS.2024.3390838 Liu, H., Li, C., Wu, Q., Lee, Y .J.,

work page doi:10.1109/tgrs.2024.3390838 2024
[17]

Deer survey from drone thermal imagery using enhanced faster R-CNN based on ResNets and FPN. Ecol. Inform. 79, 102383. https://doi.org/10.1016/j.ecoinf.2023.102383 McCarthy, E.D., Martin, J.M., Boer, M.M., Welbergen, J.A.,

work page doi:10.1016/j.ecoinf.2023.102383 2023
[18]

Remote Sens

Drone-based thermal remote sensing provides an effective new tool for monitoring the abundance of roosting fruit bats. Remote Sens. Ecol. Conserv. 7 (3), 461–474. https://doi.org/10.1002/rse2.202 Meade, J., McCarthy, E.D., Yabsley, S.H., Grady, S.C., Martin, J.M., Welbergen, J.A.,

work page doi:10.1002/rse2.202
[19]

https://doi.org/10.3390/rs17030518 Mi, L., Dai, X., Castillo-Navarro, J., Tuia, D.,

work page doi:10.3390/rs17030518
[20]

IEEE Trans

Knowledge-aware text–image retrieval for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–13. https://doi.org/10.1109/TGRS.2024.3486977 Moshtaghi, M., Khajavi, S.H., Pajarinen, J.,

work page doi:10.1109/tgrs.2024.3486977 2024
[21]

arXiv preprint arXiv:2503.19654 , year=

RGB-Th-Bench: a dense benchmark for visual-thermal understanding of vision language models. arXiv preprint arXiv:2503.19654. https://doi.org/10.48550/arXiv.2503.19654 Norris, E.B.B., Edwards, W., Laurance, S.G.,

work page doi:10.48550/arxiv.2503.19654
[22]

https://doi.org/10.1007/s10531-026-03285-0 Pedrazzi, L., Naik, H., Sandbrook, C., Lurgi, M., Fürtbauer, I., King, A.J.,

work page doi:10.1007/s10531-026-03285-0
[23]

Advancing animal behaviour research using drone technology. Anim. Behav. 222, 123147. https://doi.org/10.1016/j.anbehav.2025.123147 Pinel-Ramos, E.J., Aureli, F., Wich, S., Rodrigues de Melo, F., Rezende, C., Brandão, F., et al.,

work page doi:10.1016/j.anbehav.2025.123147 2025
[24]

https://doi.org/10.3390/drones9090622 Povlsen, P., Linder, A.C., Larsen, H.L., Durdevic, P ., Arroyo, D.O., Bruhn, D., Pagh, S.,

work page doi:10.3390/drones9090622
[25]

https://doi.org/10.3390/drones7010005 Povlsen, P., Bruhn, D., Durdevic, P., Arroyo, D.O., Pertoldi, C.,

work page doi:10.3390/drones7010005
[26]

https://doi.org/10.3390/drones8010002 18 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al.,

work page doi:10.3390/drones8010002
[27]

Developing a new method using thermal drones for population surveys of the world's rarest great ape species, Pongo tapanuliensis. Glob. Ecol. Conserv. 58, e03463. https://doi.org/10.1016/j.gecco.2025.e03463 Rietz, J., van Beeck Calkoen, S.T., Ferry, N., Schlüter, J., Wehner, H., Schindlatz, K.H., et al.,

work page doi:10.1016/j.gecco.2025.e03463 2025
[28]

Transbound

Drone-based thermal imaging in the detection of wildlife carcasses and disease management. Transbound. Emerg. Dis. 2023, 5517000. https://doi.org/10.1155/2023/5517000 Santangeli, A., Chen, Y ., Kluen, E., Chirumamilla, R., Tiainen, J., Loehr, J.,

work page doi:10.1155/2023/5517000 2023
[29]

Integrating drone-borne thermal imaging with artificial intelligence to locate bird nests on agricultural land . Sci. Rep. 10, 10993. https://doi.org/10.1038/s41598-020-67898-3 Still, C., Powell, R., Aubrecht, D., Kim, Y ., Helliker, B., Roberts, D., et al.,

work page doi:10.1038/s41598-020-67898-3
[30]

Ecosphere 10 (6), e02768

Thermal imaging in plant and ecosystem ecology: applications and challenges. Ecosphere 10 (6), e02768. https://doi.org/10.1002/ecs2.2768 Wagner, B., Garnick, S.W., Ryan, M.F., Isaac, J.L., Begg, A., Nitschke, C.R.,

work page doi:10.1002/ecs2.2768
[31]

Thermal drone surveys to detect arboreal fauna: Improving population estimates and threatened species monitoring . Ecol. Appl. 35 (6), e70091. https://doi.org/10.1002/eap.70091 Weng, X., Pang, C., Xia, G.S.,

work page doi:10.1002/eap.70091
[32]

IEEE Geosci

Vision-language modeling meets remote sensing: models, datasets, and perspectives. IEEE Geosci. Remote Sens. Mag. 13(3). https://doi.org/10.1109/MGRS.2025.3572702 Xu, Z., Wang, T., Skidmore, A.K., Lamprey, R.,

work page doi:10.1109/mgrs.2025.3572702 2025
[33]

A review of deep learning techniques for detecting animals in aerial and satellite images. Int. J. Appl. Earth Obs. Geoinf. 128, 103732. https://doi.org/10.1016/j.jag.2024.103732 Yang, C., Li, Z., Zhang, L.,

work page doi:10.1016/j.jag.2024.103732 2024
[34]

IEEE Trans

Bootstrapping interactive image–text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 62, 1–12. https://doi.org/10.1109/TGRS.2024.3359316 Yeom, S.,

work page doi:10.1109/tgrs.2024.3359316 2024
[35]

https://doi.org/10.3390/drones8020053 Zhang, H., Wang, C., Turvey, S.T., Sun, Z., Tan, Z., Yang, Q., et al.,

work page doi:10.3390/drones8020053
[36]

Thermal infrared imaging from drones can detect individuals and nocturnal behavior of the world’s rarest primate. Glob. Ecol. Conserv. 23, e01101. https://doi.org/10.1016/j.gecco.2020.e01101 Zhou, Y ., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.,

work page doi:10.1016/j.gecco.2020.e01101 2020
[37]

https://doi.org/10.3390/drones9080557 Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., et al.,

work page doi:10.3390/drones9080557
[38]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3: exploring advanced training and test- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479. https://doi.org/10.48550/arXiv.2504.10479

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479

[1] [1]

Qwen2.5-VL Technical Report

Drones in ecology: ten years back and forth. BioScience 75 (8), 664–680. https://doi.org/10.1093/biosci/biaf069 Bai, J., et al., 2025a. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923. https://arxiv.org/abs/2502.13923 Bai, J., et al., 2025b. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631. https://arxiv.org/abs/2511.21631 Bohnett, ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1093/biosci/biaf069

[2] [2]

https://doi.org/10.3390/land14071461 16 Cao, Z., Zhang, J., Zhang, R.,

work page doi:10.3390/land14071461

[3] [3]

https://doi.org/10.3390/drones9070470 Dong, S., Wang, L., Du, B., Meng, X.,

work page doi:10.3390/drones9070470

[4] [4]

ChangeCLIP: remote sensing change detection with multimodal vision–language representation learning. ISPRS J. Photogramm. Remote Sens. 208, 53–69. https://doi.org/10.1016/j.isprsjprs.2024.01.004 dos Santos, G.N., Cassano, C.R., Alves-Ferreira, G., Veríssimo, L.F., Giné, G.A.F.,

work page doi:10.1016/j.isprsjprs.2024.01.004 2024

[5] [5]

Do thermal drones outperform traditional surveys in detecting and estimating population density of sloths? Perspect. Ecol. Conserv. https://doi.org/10.1016/j.pecon.2025.10.002 Feng, Y ., Snoussi, H., Teng, J., Liu, J., Wang, Y ., Cherouat, A., Wang, T.,

work page doi:10.1016/j.pecon.2025.10.002 2025

[6] [6]

arXiv preprint arXiv:2601.08408

Edge-optimized multimodal learning for UA V video understanding via BLIP-2. arXiv preprint arXiv:2601.08408. https://doi.org/10.48550/arXiv.2601.08408 Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., et al.,

work page doi:10.48550/arxiv.2601.08408

[7] [7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394. https://doi.org/10.48550/arXiv.2306.13394 Gazagne, E., Goldblatt, C., Nguyễn, V .T., Brotcorne, F., Hambuckers, A.,

work page internal anchor Pith review doi:10.48550/arxiv.2306.13394

[8] [8]

Utilizing thermal imaging drones to investigate sleeping site selection in an arboreal primate. Curr. Zool. 71 (5), 560–572. https://doi.org/10.1093/cz/zoae082 Guo, H., Su, X., Wu, C., Du, B., Zhang, L., Li, D.,

work page doi:10.1093/cz/zoae082

[9] [9]

Francis and M

Remote sensing ChatGPT: solving remote sensing tasks with ChatGPT and visual models. In: Proc. IEEE IGARSS, pp. 11474–11478. https://doi.org/10.1109/IGARSS53475.2024.10640736 Hambrecht, L., Brown, R.P., Piel, A.K., Wich, S.A.,

work page doi:10.1109/igarss53475.2024.10640736 2024

[10] [10]

Detecting ‘poachers’ with drones: factors influencing detection the probability of detection with TIR and RGB imaging in miombo woodlands, Tanzania. Biol. Conserv. 233, 109–117. https://doi.org/10.1016/j.biocon.2019.02.017 He, A., Li, X., Wu, X., Su, C., Chen, J., Xu, S., Guo, X.,

work page doi:10.1016/j.biocon.2019.02.017 2019

[11] [11]

ALSS-YOLO: an adaptive lightweight channel split and shuffling network for TIR wildlife detection in UA V imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 17308–17326. https://doi.org/10.1109/JSTARS.2024.3461172 Hu, Y ., Yuan, J., Wen, C., Lu, X., Liu, Y ., Li, X.,

work page doi:10.1109/jstars.2024.3461172 2024

[12] [12]

RSGPT: a remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 224, 272–286. https://doi.org/10.1016/j.isprsjprs.2025.03.028 Khanal, S., Fulton, J., Shearer, S.,

work page doi:10.1016/j.isprsjprs.2025.03.028 2025

[13] [13]

An overview of current and potential applications of thermal remote sensing in precision agriculture. Comput. Electron. Agric. 139, 22–32. https://doi.org/10.1016/j.compag.2017.05.001 Krishnan, B.S., Jones, L.R., Elmore, J.A., Samiappan, S., Evans, K.O., Pfeiffer, M.B., et al.,

work page doi:10.1016/j.compag.2017.05.001 2017

[14] [14]

Fusion of visible and thermal images improves automated detection and classification of animals for drone surveys . Sci. Rep. 13, 10385. https://doi.org/10.1038/s41598-023-37295-7 17 Lahiri, B.B., Bagavathiappan, S., Jayakumar, T., Philip, J.,

work page doi:10.1038/s41598-023-37295-7

[15] [15]

Infrared Phys

Medical applications of infrared thermography: a review. Infrared Phys. Technol. 55 (4), 221–235. https://doi.org/10.1016/j.infrared.2012.03.007 Li, J., Li, D., Savarese, S., Hoi, S.,

work page doi:10.1016/j.infrared.2012.03.007 2012

[16] [16]

Remoteclip: A vision language foundation model for remote sensing,

RemoteCLIP: a vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–16. https://doi.org/10.1109/TGRS.2024.3390838 Liu, H., Li, C., Wu, Q., Lee, Y .J.,

work page doi:10.1109/tgrs.2024.3390838 2024

[17] [17]

Deer survey from drone thermal imagery using enhanced faster R-CNN based on ResNets and FPN. Ecol. Inform. 79, 102383. https://doi.org/10.1016/j.ecoinf.2023.102383 McCarthy, E.D., Martin, J.M., Boer, M.M., Welbergen, J.A.,

work page doi:10.1016/j.ecoinf.2023.102383 2023

[18] [18]

Remote Sens

Drone-based thermal remote sensing provides an effective new tool for monitoring the abundance of roosting fruit bats. Remote Sens. Ecol. Conserv. 7 (3), 461–474. https://doi.org/10.1002/rse2.202 Meade, J., McCarthy, E.D., Yabsley, S.H., Grady, S.C., Martin, J.M., Welbergen, J.A.,

work page doi:10.1002/rse2.202

[19] [19]

https://doi.org/10.3390/rs17030518 Mi, L., Dai, X., Castillo-Navarro, J., Tuia, D.,

work page doi:10.3390/rs17030518

[20] [20]

IEEE Trans

Knowledge-aware text–image retrieval for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–13. https://doi.org/10.1109/TGRS.2024.3486977 Moshtaghi, M., Khajavi, S.H., Pajarinen, J.,

work page doi:10.1109/tgrs.2024.3486977 2024

[21] [21]

arXiv preprint arXiv:2503.19654 , year=

RGB-Th-Bench: a dense benchmark for visual-thermal understanding of vision language models. arXiv preprint arXiv:2503.19654. https://doi.org/10.48550/arXiv.2503.19654 Norris, E.B.B., Edwards, W., Laurance, S.G.,

work page doi:10.48550/arxiv.2503.19654

[22] [22]

https://doi.org/10.1007/s10531-026-03285-0 Pedrazzi, L., Naik, H., Sandbrook, C., Lurgi, M., Fürtbauer, I., King, A.J.,

work page doi:10.1007/s10531-026-03285-0

[23] [23]

Advancing animal behaviour research using drone technology. Anim. Behav. 222, 123147. https://doi.org/10.1016/j.anbehav.2025.123147 Pinel-Ramos, E.J., Aureli, F., Wich, S., Rodrigues de Melo, F., Rezende, C., Brandão, F., et al.,

work page doi:10.1016/j.anbehav.2025.123147 2025

[24] [24]

https://doi.org/10.3390/drones9090622 Povlsen, P., Linder, A.C., Larsen, H.L., Durdevic, P ., Arroyo, D.O., Bruhn, D., Pagh, S.,

work page doi:10.3390/drones9090622

[25] [25]

https://doi.org/10.3390/drones7010005 Povlsen, P., Bruhn, D., Durdevic, P., Arroyo, D.O., Pertoldi, C.,

work page doi:10.3390/drones7010005

[26] [26]

https://doi.org/10.3390/drones8010002 18 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al.,

work page doi:10.3390/drones8010002

[27] [27]

Developing a new method using thermal drones for population surveys of the world's rarest great ape species, Pongo tapanuliensis. Glob. Ecol. Conserv. 58, e03463. https://doi.org/10.1016/j.gecco.2025.e03463 Rietz, J., van Beeck Calkoen, S.T., Ferry, N., Schlüter, J., Wehner, H., Schindlatz, K.H., et al.,

work page doi:10.1016/j.gecco.2025.e03463 2025

[28] [28]

Transbound

Drone-based thermal imaging in the detection of wildlife carcasses and disease management. Transbound. Emerg. Dis. 2023, 5517000. https://doi.org/10.1155/2023/5517000 Santangeli, A., Chen, Y ., Kluen, E., Chirumamilla, R., Tiainen, J., Loehr, J.,

work page doi:10.1155/2023/5517000 2023

[29] [29]

Integrating drone-borne thermal imaging with artificial intelligence to locate bird nests on agricultural land . Sci. Rep. 10, 10993. https://doi.org/10.1038/s41598-020-67898-3 Still, C., Powell, R., Aubrecht, D., Kim, Y ., Helliker, B., Roberts, D., et al.,

work page doi:10.1038/s41598-020-67898-3

[30] [30]

Ecosphere 10 (6), e02768

Thermal imaging in plant and ecosystem ecology: applications and challenges. Ecosphere 10 (6), e02768. https://doi.org/10.1002/ecs2.2768 Wagner, B., Garnick, S.W., Ryan, M.F., Isaac, J.L., Begg, A., Nitschke, C.R.,

work page doi:10.1002/ecs2.2768

[31] [31]

Thermal drone surveys to detect arboreal fauna: Improving population estimates and threatened species monitoring . Ecol. Appl. 35 (6), e70091. https://doi.org/10.1002/eap.70091 Weng, X., Pang, C., Xia, G.S.,

work page doi:10.1002/eap.70091

[32] [32]

IEEE Geosci

Vision-language modeling meets remote sensing: models, datasets, and perspectives. IEEE Geosci. Remote Sens. Mag. 13(3). https://doi.org/10.1109/MGRS.2025.3572702 Xu, Z., Wang, T., Skidmore, A.K., Lamprey, R.,

work page doi:10.1109/mgrs.2025.3572702 2025

[33] [33]

A review of deep learning techniques for detecting animals in aerial and satellite images. Int. J. Appl. Earth Obs. Geoinf. 128, 103732. https://doi.org/10.1016/j.jag.2024.103732 Yang, C., Li, Z., Zhang, L.,

work page doi:10.1016/j.jag.2024.103732 2024

[34] [34]

IEEE Trans

Bootstrapping interactive image–text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 62, 1–12. https://doi.org/10.1109/TGRS.2024.3359316 Yeom, S.,

work page doi:10.1109/tgrs.2024.3359316 2024

[35] [35]

https://doi.org/10.3390/drones8020053 Zhang, H., Wang, C., Turvey, S.T., Sun, Z., Tan, Z., Yang, Q., et al.,

work page doi:10.3390/drones8020053

[36] [36]

Thermal infrared imaging from drones can detect individuals and nocturnal behavior of the world’s rarest primate. Glob. Ecol. Conserv. 23, e01101. https://doi.org/10.1016/j.gecco.2020.e01101 Zhou, Y ., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.,

work page doi:10.1016/j.gecco.2020.e01101 2020

[37] [37]

https://doi.org/10.3390/drones9080557 Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., et al.,

work page doi:10.3390/drones9080557

[38] [38]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

InternVL3: exploring advanced training and test- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479. https://doi.org/10.48550/arXiv.2504.10479

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.10479