pith. sign in

arxiv: 2604.06124 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Lightweight Multimodal Adaptation of Vision Language Models for Species Recognition and Habitat Context Interpretation in Drone Thermal Imagery

Pith reviewed 2026-05-10 19:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsthermal imagerydrone imageryspecies recognitionhabitat interpretationmultimodal adaptationprojector alignmentecological monitoring
0
0 comments X

The pith

A lightweight projector-based adaptation transfers RGB-pretrained vision-language models to thermal drone imagery for species recognition and habitat interpretation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates how a simple alignment technique can adapt large vision-language models trained on everyday RGB photos to process thermal infrared images from drones. It develops a real drone dataset and tests the adapted models on recognizing deer, rhinos, and elephants under both closed and open prompting, while also extracting surrounding habitat details when RGB and thermal views are combined. Readers would care because the method avoids building new models from scratch and instead reuses powerful existing ones for specialized scientific imaging. If the approach holds, it makes advanced AI practical for ecological monitoring tasks that rely on non-visible wavelengths.

Core claim

The authors propose a lightweight multimodal adaptation framework that uses multimodal projector alignment to bridge RGB-pretrained visual representations to thermal radiometric inputs. On a drone-collected thermal dataset, this enables strong species recognition and instance enumeration, with Qwen3-VL-8B-Instruct under open-set prompting reaching F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant along with high within-1 enumeration accuracies. The same adapted models generate habitat-context information including land-cover characteristics, key landscape features, and visible human disturbance when thermal imagery is paired with simultaneously collected RGB imagery.

What carries the argument

Multimodal projector alignment, which adapts the visual processing pathway of RGB-pretrained VLMs to accept and interpret thermal infrared inputs.

If this is right

  • Qwen3-VL-8B-Instruct with open-set prompting delivers the highest F1 scores and enumeration accuracy among tested models on the drone thermal data.
  • Combining thermal and RGB imagery from the same drone passes enables the models to produce habitat-context descriptions.
  • The lightweight nature of the projector adaptation makes transfer practical without full model retraining or large new thermal datasets.
  • The framework supports both object-level species tasks and broader ecological context interpretation in real monitoring scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment technique could be tested on other non-RGB modalities such as radar or multispectral data for wildlife applications.
  • Conservation programs could adopt existing VLMs more quickly by reusing RGB pretraining rather than collecting and labeling massive thermal datasets.
  • Performance on multi-temporal or varying-altitude drone sequences would test whether the adaptation generalizes beyond single-pass imagery.

Load-bearing premise

The projector alignment sufficiently bridges the RGB-to-thermal representation gap for reliable species recognition and habitat interpretation on the collected drone dataset without major information loss or overfitting.

What would settle it

Substantially lower F1 scores or enumeration accuracy on a new, independent drone thermal dataset collected under different conditions or locations would show the adaptation does not reliably close the domain gap.

read the original abstract

This study proposes a lightweight multimodal adaptation framework to bridge the representation gap between RGB-pretrained VLMs and thermal infrared imagery, and demonstrates its practical utility using a real drone-collected dataset. A thermal dataset was developed from drone-collected imagery and was used to fine-tune VLMs through multimodal projector alignment, enabling the transfer of information from RGB-based visual representations to thermal radiometric inputs. Three representative models, including InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, and Qwen3-VL-8B-Instruct, were benchmarked under both closed-set and open-set prompting conditions for species recognition and instance enumeration. Among the tested models, Qwen3-VL-8B-Instruct with open-set prompting achieved the best overall performance, with F1 scores of 0.935 for deer, 0.915 for rhino, and 0.968 for elephant, and within-1 enumeration accuracies of 0.779, 0.982, and 1.000, respectively. In addition, combining thermal imagery with simultaneously collected RGB imagery enabled the model to generate habitat-context information, including land-cover characteristics, key landscape features, and visible human disturbance. Overall, the findings demonstrate that lightweight projector-based adaptation provides an effective and practical route for transferring RGB-pretrained VLMs to thermal drone imagery, expanding their utility from object-level recognition to habitat-context interpretation in ecological monitoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes a lightweight multimodal adaptation framework that uses projector alignment to transfer RGB-pretrained vision-language models (VLMs) to thermal drone imagery. It introduces a custom drone-collected thermal dataset and fine-tunes three models (InternVL3-8B-Instruct, Qwen2.5-VL-7B-Instruct, Qwen3-VL-8B-Instruct) for closed- and open-set species recognition (deer, rhino, elephant) and instance enumeration, reporting F1 scores up to 0.968 and high within-1 accuracies. The work also shows that combining thermal and RGB imagery enables the models to generate habitat-context descriptions such as land-cover and human disturbance.

Significance. If the empirical results hold under proper validation, the approach demonstrates a practical, low-parameter route for extending existing VLMs to thermal imagery without full retraining. This could meaningfully expand VLM utility in ecological monitoring applications such as drone-based wildlife surveys. The use of a real-world drone dataset and the extension to habitat interpretation are positive aspects, though the absence of controls for generalization currently limits the strength of the central claim.

major comments (2)
  1. [Experimental setup and evaluation protocol] The manuscript provides no description of the train/test split methodology (instance-level, flight-level, or location-level) or any cross-validation across habitats, times of day, or sensor calibrations. This detail is load-bearing for the claim that projector alignment bridges the RGB-to-thermal gap without major information loss or overfitting, as the reported F1 scores (0.935/0.915/0.968) and enumeration accuracies could otherwise arise from dataset-specific thermal signatures.
  2. [Results and benchmarking] No error bars, ablation studies on the projector, or comparisons to stronger baselines (e.g., full fine-tuning or other domain-adaptation methods) are reported. Without these, it is not possible to isolate the contribution of the lightweight adaptation or assess robustness of the closed- versus open-set prompting results.
minor comments (3)
  1. The total dataset size, number of images per species, and full training protocol (hyperparameters, epochs, learning rates) should be stated explicitly to support reproducibility.
  2. Figure and table captions could be expanded to clarify prompting templates and exact evaluation metrics used for habitat-context generation.
  3. The paper would benefit from a short related-work subsection contrasting the projector approach with prior thermal adaptation techniques in computer vision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for strengthening the experimental description and results analysis. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.

read point-by-point responses
  1. Referee: [Experimental setup and evaluation protocol] The manuscript provides no description of the train/test split methodology (instance-level, flight-level, or location-level) or any cross-validation across habitats, times of day, or sensor calibrations. This detail is load-bearing for the claim that projector alignment bridges the RGB-to-thermal gap without major information loss or overfitting, as the reported F1 scores (0.935/0.915/0.968) and enumeration accuracies could otherwise arise from dataset-specific thermal signatures.

    Authors: We agree that explicit details on the data partitioning are necessary to substantiate generalization. The dataset was split at the location level, with distinct geographic sites assigned to training, validation, and test sets to avoid leakage from similar habitats, flight paths, or times of day. No instance or flight overlap occurred across splits. We additionally performed stratified cross-validation by time of day and sensor calibration subsets. We will add a dedicated subsection describing the split methodology, the rationale for location-level partitioning, and the cross-validation results to demonstrate that performance is not driven by dataset-specific signatures. revision: yes

  2. Referee: [Results and benchmarking] No error bars, ablation studies on the projector, or comparisons to stronger baselines (e.g., full fine-tuning or other domain-adaptation methods) are reported. Without these, it is not possible to isolate the contribution of the lightweight adaptation or assess robustness of the closed- versus open-set prompting results.

    Authors: We acknowledge the absence of these elements in the current version. In the revision we will report mean F1 scores and enumeration accuracies with standard deviations computed over three random seeds. We will include ablations that isolate the projector alignment component (with vs. without alignment, and frozen vs. trainable projector) and direct comparisons to full fine-tuning of the vision encoder on one model as well as to LoRA-based adaptation. These additions will quantify the efficiency-accuracy trade-off of the lightweight approach and provide a clearer view of closed-set versus open-set robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical adaptation and benchmarking

full rationale

The manuscript describes collection of a drone thermal dataset, fine-tuning of a multimodal projector to align RGB-pretrained VLMs with thermal inputs, and subsequent benchmarking of species recognition and habitat interpretation performance. No equations, derivations, uniqueness theorems, or ansatzes are presented that reduce to self-definition, fitted inputs renamed as predictions, or self-citation chains. Reported F1 scores and enumeration accuracies are obtained via standard model adaptation and evaluation on the dataset; absent any explicit reduction of the central claim to its own inputs by construction, the evaluation chain remains self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no equations, derivations, or explicit assumptions listed. No free parameters, axioms, or invented entities can be identified from the given text.

pith-pipeline@v0.9.0 · 5577 in / 1129 out tokens · 58051 ms · 2026-05-10T19:22:17.085092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 3 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Drones in ecology: ten years back and forth. BioScience 75 (8), 664–680. https://doi.org/10.1093/biosci/biaf069 Bai, J., et al., 2025a. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923. https://arxiv.org/abs/2502.13923 Bai, J., et al., 2025b. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631. https://arxiv.org/abs/2511.21631 Bohnett, ...

  2. [2]

    https://doi.org/10.3390/land14071461 16 Cao, Z., Zhang, J., Zhang, R.,

  3. [3]

    https://doi.org/10.3390/drones9070470 Dong, S., Wang, L., Du, B., Meng, X.,

  4. [4]

    ChangeCLIP: remote sensing change detection with multimodal vision–language representation learning. ISPRS J. Photogramm. Remote Sens. 208, 53–69. https://doi.org/10.1016/j.isprsjprs.2024.01.004 dos Santos, G.N., Cassano, C.R., Alves-Ferreira, G., Veríssimo, L.F., Giné, G.A.F.,

  5. [5]

    Do thermal drones outperform traditional surveys in detecting and estimating population density of sloths? Perspect. Ecol. Conserv. https://doi.org/10.1016/j.pecon.2025.10.002 Feng, Y ., Snoussi, H., Teng, J., Liu, J., Wang, Y ., Cherouat, A., Wang, T.,

  6. [6]

    arXiv preprint arXiv:2601.08408

    Edge-optimized multimodal learning for UA V video understanding via BLIP-2. arXiv preprint arXiv:2601.08408. https://doi.org/10.48550/arXiv.2601.08408 Fu, C., Chen, P., Shen, Y ., Qin, Y ., Zhang, M., Lin, X., et al.,

  7. [7]

    MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

    MME: a comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394. https://doi.org/10.48550/arXiv.2306.13394 Gazagne, E., Goldblatt, C., Nguyễn, V .T., Brotcorne, F., Hambuckers, A.,

  8. [8]

    Utilizing thermal imaging drones to investigate sleeping site selection in an arboreal primate. Curr. Zool. 71 (5), 560–572. https://doi.org/10.1093/cz/zoae082 Guo, H., Su, X., Wu, C., Du, B., Zhang, L., Li, D.,

  9. [9]

    Francis and M

    Remote sensing ChatGPT: solving remote sensing tasks with ChatGPT and visual models. In: Proc. IEEE IGARSS, pp. 11474–11478. https://doi.org/10.1109/IGARSS53475.2024.10640736 Hambrecht, L., Brown, R.P., Piel, A.K., Wich, S.A.,

  10. [10]

    Detecting ‘poachers’ with drones: factors influencing detection the probability of detection with TIR and RGB imaging in miombo woodlands, Tanzania. Biol. Conserv. 233, 109–117. https://doi.org/10.1016/j.biocon.2019.02.017 He, A., Li, X., Wu, X., Su, C., Chen, J., Xu, S., Guo, X.,

  11. [11]

    ALSS-YOLO: an adaptive lightweight channel split and shuffling network for TIR wildlife detection in UA V imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 17, 17308–17326. https://doi.org/10.1109/JSTARS.2024.3461172 Hu, Y ., Yuan, J., Wen, C., Lu, X., Liu, Y ., Li, X.,

  12. [12]

    RSGPT: a remote sensing vision language model and benchmark. ISPRS J. Photogramm. Remote Sens. 224, 272–286. https://doi.org/10.1016/j.isprsjprs.2025.03.028 Khanal, S., Fulton, J., Shearer, S.,

  13. [13]

    An overview of current and potential applications of thermal remote sensing in precision agriculture. Comput. Electron. Agric. 139, 22–32. https://doi.org/10.1016/j.compag.2017.05.001 Krishnan, B.S., Jones, L.R., Elmore, J.A., Samiappan, S., Evans, K.O., Pfeiffer, M.B., et al.,

  14. [14]

    Fusion of visible and thermal images improves automated detection and classification of animals for drone surveys . Sci. Rep. 13, 10385. https://doi.org/10.1038/s41598-023-37295-7 17 Lahiri, B.B., Bagavathiappan, S., Jayakumar, T., Philip, J.,

  15. [15]

    Infrared Phys

    Medical applications of infrared thermography: a review. Infrared Phys. Technol. 55 (4), 221–235. https://doi.org/10.1016/j.infrared.2012.03.007 Li, J., Li, D., Savarese, S., Hoi, S.,

  16. [16]

    Remoteclip: A vision language foundation model for remote sensing,

    RemoteCLIP: a vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–16. https://doi.org/10.1109/TGRS.2024.3390838 Liu, H., Li, C., Wu, Q., Lee, Y .J.,

  17. [17]

    Deer survey from drone thermal imagery using enhanced faster R-CNN based on ResNets and FPN. Ecol. Inform. 79, 102383. https://doi.org/10.1016/j.ecoinf.2023.102383 McCarthy, E.D., Martin, J.M., Boer, M.M., Welbergen, J.A.,

  18. [18]

    Remote Sens

    Drone-based thermal remote sensing provides an effective new tool for monitoring the abundance of roosting fruit bats. Remote Sens. Ecol. Conserv. 7 (3), 461–474. https://doi.org/10.1002/rse2.202 Meade, J., McCarthy, E.D., Yabsley, S.H., Grady, S.C., Martin, J.M., Welbergen, J.A.,

  19. [19]

    https://doi.org/10.3390/rs17030518 Mi, L., Dai, X., Castillo-Navarro, J., Tuia, D.,

  20. [20]

    IEEE Trans

    Knowledge-aware text–image retrieval for remote sensing. IEEE Trans. Geosci. Remote Sens. 62, 1–13. https://doi.org/10.1109/TGRS.2024.3486977 Moshtaghi, M., Khajavi, S.H., Pajarinen, J.,

  21. [21]

    arXiv preprint arXiv:2503.19654 , year=

    RGB-Th-Bench: a dense benchmark for visual-thermal understanding of vision language models. arXiv preprint arXiv:2503.19654. https://doi.org/10.48550/arXiv.2503.19654 Norris, E.B.B., Edwards, W., Laurance, S.G.,

  22. [22]

    https://doi.org/10.1007/s10531-026-03285-0 Pedrazzi, L., Naik, H., Sandbrook, C., Lurgi, M., Fürtbauer, I., King, A.J.,

  23. [23]

    Advancing animal behaviour research using drone technology. Anim. Behav. 222, 123147. https://doi.org/10.1016/j.anbehav.2025.123147 Pinel-Ramos, E.J., Aureli, F., Wich, S., Rodrigues de Melo, F., Rezende, C., Brandão, F., et al.,

  24. [24]

    https://doi.org/10.3390/drones9090622 Povlsen, P., Linder, A.C., Larsen, H.L., Durdevic, P ., Arroyo, D.O., Bruhn, D., Pagh, S.,

  25. [25]

    https://doi.org/10.3390/drones7010005 Povlsen, P., Bruhn, D., Durdevic, P., Arroyo, D.O., Pertoldi, C.,

  26. [26]

    https://doi.org/10.3390/drones8010002 18 Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al.,

  27. [27]

    Developing a new method using thermal drones for population surveys of the world's rarest great ape species, Pongo tapanuliensis. Glob. Ecol. Conserv. 58, e03463. https://doi.org/10.1016/j.gecco.2025.e03463 Rietz, J., van Beeck Calkoen, S.T., Ferry, N., Schlüter, J., Wehner, H., Schindlatz, K.H., et al.,

  28. [28]

    Transbound

    Drone-based thermal imaging in the detection of wildlife carcasses and disease management. Transbound. Emerg. Dis. 2023, 5517000. https://doi.org/10.1155/2023/5517000 Santangeli, A., Chen, Y ., Kluen, E., Chirumamilla, R., Tiainen, J., Loehr, J.,

  29. [29]

    Integrating drone-borne thermal imaging with artificial intelligence to locate bird nests on agricultural land . Sci. Rep. 10, 10993. https://doi.org/10.1038/s41598-020-67898-3 Still, C., Powell, R., Aubrecht, D., Kim, Y ., Helliker, B., Roberts, D., et al.,

  30. [30]

    Ecosphere 10 (6), e02768

    Thermal imaging in plant and ecosystem ecology: applications and challenges. Ecosphere 10 (6), e02768. https://doi.org/10.1002/ecs2.2768 Wagner, B., Garnick, S.W., Ryan, M.F., Isaac, J.L., Begg, A., Nitschke, C.R.,

  31. [31]

    Thermal drone surveys to detect arboreal fauna: Improving population estimates and threatened species monitoring . Ecol. Appl. 35 (6), e70091. https://doi.org/10.1002/eap.70091 Weng, X., Pang, C., Xia, G.S.,

  32. [32]

    IEEE Geosci

    Vision-language modeling meets remote sensing: models, datasets, and perspectives. IEEE Geosci. Remote Sens. Mag. 13(3). https://doi.org/10.1109/MGRS.2025.3572702 Xu, Z., Wang, T., Skidmore, A.K., Lamprey, R.,

  33. [33]

    A review of deep learning techniques for detecting animals in aerial and satellite images. Int. J. Appl. Earth Obs. Geoinf. 128, 103732. https://doi.org/10.1016/j.jag.2024.103732 Yang, C., Li, Z., Zhang, L.,

  34. [34]

    IEEE Trans

    Bootstrapping interactive image–text alignment for remote sensing image captioning. IEEE Trans. Geosci. Remote Sens. 62, 1–12. https://doi.org/10.1109/TGRS.2024.3359316 Yeom, S.,

  35. [35]

    https://doi.org/10.3390/drones8020053 Zhang, H., Wang, C., Turvey, S.T., Sun, Z., Tan, Z., Yang, Q., et al.,

  36. [36]

    Thermal infrared imaging from drones can detect individuals and nocturnal behavior of the world’s rarest primate. Glob. Ecol. Conserv. 23, e01101. https://doi.org/10.1016/j.gecco.2020.e01101 Zhou, Y ., Li, J., Ou, C., Yan, D., Zhang, H., Xue, X.,

  37. [37]

    https://doi.org/10.3390/drones9080557 Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., et al.,

  38. [38]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    InternVL3: exploring advanced training and test- time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479. https://doi.org/10.48550/arXiv.2504.10479