pith. sign in

arxiv: 2605.17042 · v1 · pith:M4PUITFNnew · submitted 2026-05-16 · 💻 cs.CV

Thermal-Only Crowd Counting with Deployment-Time Privacy Protection

Pith reviewed 2026-05-20 15:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords crowd countingthermal imagingprivacy protectiondiffusion modelsRGB-T fusionsingle-step denoisingdepth conditioning
0
0 comments X

The pith

Thermal-only crowd counting matches RGB-T fusion accuracy by using single-step depth-to-RGB diffusion features during training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework for counting people in crowds that operates on thermal images alone once deployed. During training it borrows structural features from a depth-to-RGB diffusion model to make the thermal signal more informative. The authors find that running only a single LCM denoising step keeps those features tightly linked to the depth input, while additional steps loosen the connection and add errors. This design reaches accuracy levels comparable to methods that fuse RGB and thermal data at every stage, yet removes the need for continuous visible-light capture that raises privacy issues in public surveillance.

Core claim

Single-step LCM denoising within depth-to-RGB diffusion models produces features most faithful to the conditioning depth signal and therefore most useful for strengthening thermal representations, yielding a thermal-only inference pipeline that delivers competitive performance against RGB-T fusion methods on the RGBT-CC and DroneRGBT datasets.

What carries the argument

Single-step LCM denoising acting as a cross-modal bridge that extracts structural features from depth conditioning to augment thermal inputs for counting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same diffusion-bridge idea could be tested on other privacy-sensitive thermal tasks such as person detection or activity recognition.
  • Checking whether the single-step preference holds across different diffusion architectures would clarify how general the observation is.
  • The method opens a route to train once with auxiliary depth or RGB data and then run lightweight thermal-only models in the field.

Load-bearing premise

The depth-to-RGB diffusion model can extract features that genuinely improve thermal counting accuracy without the denoising process itself introducing errors that lower performance.

What would settle it

An experiment in which multi-step denoising or a version without any diffusion features produces equal or higher counting accuracy on RGBT-CC or DroneRGBT would undermine the claim that single-step LCM is required.

Figures

Figures reproduced from arXiv: 2605.17042 by Bowen Deng, Chun Pong Lau, Chun Tong Lei, Michael P. Pound, Xiaopeng Hong, Yifei Qian, Zhongliang Guo.

Figure 1
Figure 1. Figure 1: Misalignment issues between RGB and thermal images in RGBT [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall framework of the proposed TDCount. We leverage ControlNet-based depth-to-RGB translation to extract complementary features FTD, which are integrated with thermal features FT through a simple feature enhancement module, and aligned via a prototype alignment loss (LPA) for robust crowd density estimation. C. Cross-Modal Generation: Cross-modal generation has witnessed remarkable progress with the… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization results of our thermal-only method versus traditional RGB-T approaches. Our method produces accurate density maps using only thermal [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison demonstrating the effectiveness of incorporating ControlNet-derived features. The yellow boxes highlight challenging regions [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Post-convergence analysis demonstrating counting performance sta [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes the first thermal-only crowd counting framework for privacy protection in surveillance applications. It trains a thermal encoder by using a depth-to-RGB diffusion model (LCM) as a cross-modal bridge to inject structural features, with the key technical assertion that single-step LCM denoising produces the most faithful features to the depth conditioning signal while multi-step denoising decouples and injects errors. The method is claimed to achieve competitive performance against state-of-the-art RGB-T fusion approaches on the RGBT-CC and DroneRGBT datasets while requiring only thermal input at inference time.

Significance. If the performance and ablation claims are substantiated, the work would offer a meaningful advance for privacy-conscious crowd monitoring by eliminating continuous RGB capture. The use of diffusion models specifically as a deployment-time bridge for thermal ambiguity is a novel angle, and the commitment to public code release supports reproducibility.

major comments (2)
  1. [Abstract] Abstract: the claim of 'competitive performance against state-of-the-art RGB-T fusion methods' is presented without any quantitative numbers, error bars, tables, or ablation details, leaving the central empirical claim unverified and difficult to assess.
  2. [Method and Experiments] Method and Experiments: the assertion that single-step LCM denoising yields features 'most faithful to the structural content of the depth conditioning signal' while multi-step runs accumulate errors is load-bearing for the reported advantage over plain thermal baselines, yet no direct faithfulness metrics (e.g., SSIM, edge preservation) or controlled ablations isolating step count are described.
minor comments (1)
  1. [Abstract] The abstract could briefly note the specific quantitative metrics used for 'competitive performance' to give readers an immediate sense of the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the work's significance for privacy-preserving surveillance. We address each major comment below and have revised the manuscript to provide stronger empirical support and methodological justification.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'competitive performance against state-of-the-art RGB-T fusion methods' is presented without any quantitative numbers, error bars, tables, or ablation details, leaving the central empirical claim unverified and difficult to assess.

    Authors: We agree that the abstract would benefit from explicit quantitative backing to make the central claim immediately verifiable. In the revised manuscript, we will add specific MAE and RMSE values from the RGBT-CC and DroneRGBT experiments, including direct numerical comparisons to the cited state-of-the-art RGB-T fusion baselines. These figures are already reported in the experimental section and will be concisely incorporated into the abstract without exceeding length limits. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments: the assertion that single-step LCM denoising yields features 'most faithful to the structural content of the depth conditioning signal' while multi-step runs accumulate errors is load-bearing for the reported advantage over plain thermal baselines, yet no direct faithfulness metrics (e.g., SSIM, edge preservation) or controlled ablations isolating step count are described.

    Authors: We acknowledge that direct quantitative faithfulness metrics and isolated step-count ablations were not included, which would strengthen the justification for preferring single-step denoising. We will add these in the revised manuscript: (i) SSIM and edge-preservation scores measuring feature fidelity to the depth conditioning signal across step counts, and (ii) a controlled ablation table varying denoising steps while holding all other components fixed. These additions will provide direct evidence that single-step LCM denoising best preserves structural content. revision: yes

Circularity Check

0 steps flagged

No circularity; novel pipeline with external experimental validation

full rationale

The paper introduces a new thermal-only crowd counting framework that uses a depth-to-RGB diffusion model (LCM) as a cross-modal bridge to enhance thermal features, with the explicit goal of eliminating RGB at inference for privacy. The assertion that single-step LCM denoising yields features most faithful to the depth signal is presented as an empirical demonstration rather than a definitional necessity or fitted input renamed as a prediction. No equations, self-definitional reductions, load-bearing self-citations, or uniqueness theorems imported from the authors' prior work appear in the described derivation. Performance is validated through direct comparisons on the external RGBT-CC and DroneRGBT datasets against RGB-T fusion baselines, rendering the chain self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the unproven effectiveness of the depth-to-RGB diffusion model as a bridge for thermal data and on the superiority of single-step denoising; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Depth-to-RGB diffusion models can extract discriminative features that enhance thermal representations and mitigate thermal ambiguity
    Invoked to justify the cross-modal bridge that replaces RGB input at inference
  • domain assumption Single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal
    Stated as the critical empirical finding that enables the method's performance

pith-pipeline@v0.9.0 · 5722 in / 1363 out tokens · 39730 ms · 2026-05-20T15:49:33.054744+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 1 internal anchor

  1. [1]

    Counting crowds in bad weather,

    Z.-K. Huang, W.-T. Chen, Y .-C. Chiang, S.-Y . Kuo, and M.-H. Yang, “Counting crowds in bad weather,” in2023 IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 251–23 262

  2. [2]

    Scene-adaptive unsupervised crowd counting for video surveillance,

    R. Ma, Y . Hou, C. Li, H. Jia, and X. Xie, “Scene-adaptive unsupervised crowd counting for video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 7, pp. 6910–6925, 2025

  3. [3]

    Frame-recurrent video crowd counting,

    Y . Hou, S. Zhang, R. Ma, H. Jia, and X. Xie, “Frame-recurrent video crowd counting,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 5186–5199, 2023

  4. [4]

    Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting,

    L. Liu, J. Chen, H. Wu, G. Li, C. Li, and L. Lin, “Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4823–4833

  5. [5]

    Mc 3 net: multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,

    W. Zhou, X. Yang, J. Lei, W. Yan, and L. Yu, “Mc 3 net: multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 5, pp. 4156–4165, 2023

  6. [6]

    Cross-modal collaborative feature representation via transformer-based multimodal mixers for rgb-t crowd counting,

    W. Kong, J. Liu, Y . Hong, H. Li, and J. Shen, “Cross-modal collaborative feature representation via transformer-based multimodal mixers for rgb-t crowd counting,”Expert Systems with Applications, vol. 255, p. 124483, 2024

  7. [7]

    Misf- net: Modality-invariant and-specific fusion network for rgb-t crowd counting,

    B. Mu, F. Shao, Z. Xie, H. Chen, Z. Zhu, X. Li, and Q. Jiang, “Misf- net: Modality-invariant and-specific fusion network for rgb-t crowd counting,”IEEE Transactions on Multimedia, 2025

  8. [8]

    Multi-modal crowd counting via modal emulation,

    C. Wang, X. Hong, Z. Ma, Y . Wei, Y . Wang, and X. Fan, “Multi-modal crowd counting via modal emulation,” in35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024. BMV A, 2024. [Online]. Available: https://papers.bmvc2024.org/0115.pdf

  9. [9]

    Semi-supervised crowd counting with contextual modeling: Facilitating holistic understanding of crowd scenes,

    Y . Qian, X. Hong, Z. Guo, O. Arandjelovi ´c, and C. R. Donovan, “Semi-supervised crowd counting with contextual modeling: Facilitating holistic understanding of crowd scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8230–8241, 2024

  10. [10]

    Perspective-assisted prototype-based learning for semi-supervised crowd counting,

    Y . Qian, L. Zhang, Z. Guo, X. Hong, O. Arandjelovi ´c, and C. R. Dono- van, “Perspective-assisted prototype-based learning for semi-supervised crowd counting,”Pattern Recognition, vol. 158, p. 111073, 2025

  11. [11]

    Learning crowd scale and distribution for weakly supervised crowd counting and localization,

    Y . Fan, J. Wan, and A. J. Ma, “Learning crowd scale and distribution for weakly supervised crowd counting and localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 1, pp. 713– 727, 2025

  12. [12]

    Single-image crowd counting via multi-column convolutional neural network,

    Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597

  13. [13]

    Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,

    Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1091–1100

  14. [14]

    Bayesian loss for crowd count estimation with point supervision,

    Z. Ma, X. Wei, X. Hong, and Y . Gong, “Bayesian loss for crowd count estimation with point supervision,” inProceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6142–6151

  15. [15]

    Distribution matching for crowd counting,

    B. Wang, H. Liu, D. Samaras, and M. Hoai, “Distribution matching for crowd counting,” inAdvances in Neural Information Processing Systems, 2020

  16. [16]

    Direct measure matching for crowd counting,

    H. Lin, X. Hong, Z. Ma, X. Wei, Y . Qiu, Y . Wang, and Y . Gong, “Direct measure matching for crowd counting,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. ijcai.org, 2021, pp. 837–844

  17. [17]

    Focus for free in density-based counting,

    Z. Shi, P. Mettes, and C. G. M. Snoek, “Focus for free in density-based counting,”ArXiv, vol. abs/2306.05129, 2023

  18. [18]

    Pcc net: Perspective crowd counting via spatial convolutional network,

    J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 3486–3498, 2019

  19. [19]

    Boosting crowd counting with transformers,

    G. Sun, Y . Liu, T. Probst, D. P. Paudel, N. Popovic, and L. V . Gool, “Boosting crowd counting with transformers,”ArXiv, vol. abs/2105.10926, 2021

  20. [20]

    Seg- mentation assisted u-shaped multi-scale transformer for crowd counting

    Y . Qian, L. Zhang, X. Hong, C. Donovan, and O. Arandjelovic, “Seg- mentation assisted u-shaped multi-scale transformer for crowd counting.” inBMVC, 2022, p. 397

  21. [21]

    Boosting crowd counting via multifaceted attention,

    H. Lin, Z. Ma, R. Ji, Y . Wang, and X. Hong, “Boosting crowd counting via multifaceted attention,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 19 596–19 605

  22. [22]

    Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting,

    H. Tang, Y . Wang, and L.-P. Chau, “Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting,” in2022 IEEE international symposium on circuits and systems (ISCAS). IEEE, 2022, pp. 3299– 3303

  23. [23]

    Spatio-channel attention blocks for cross-modal crowd counting,

    Y . Zhang, S. Choi, and S. Hong, “Spatio-channel attention blocks for cross-modal crowd counting,” inProceedings of the Asian conference on computer vision, 2022, pp. 90–107

  24. [24]

    Bgdfnet: bidirectional gated and dynamic fusion network for rgb-t crowd counting in smart city system,

    Z. Xie, F. Shao, B. Mu, H. Chen, Q. Jiang, C. Lu, and Y .-S. Ho, “Bgdfnet: bidirectional gated and dynamic fusion network for rgb-t crowd counting in smart city system,”IEEE Transactions on Instru- mentation and Measurement, 2024

  25. [25]

    Rgb-t multi-modal crowd counting based on transformer,

    Z. Liu, W. Wu, Y . Tan, and G. Zhang, “Rgb-t multi-modal crowd counting based on transformer,” inThe 33rd British Machine Vision Conference, 2022

  26. [26]

    Free lunch enhancements for multi-modal crowd counting,

    H. Meng, X. Hong, Z. Lai, and M. Shang, “Free lunch enhancements for multi-modal crowd counting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 013–14 023

  27. [27]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851

  28. [28]

    Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,

    H. Sasaki, C. G. Willcocks, and T. P. Breckon, “Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,”arXiv preprint arXiv:2104.05358, 2021

  29. [29]

    Semantic image synthesis via diffusion models,

    W. Wang, J. Bao, W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,”arXiv preprint arXiv:2207.00050, 2022

  30. [30]

    Image-to-image translation with conditional adversarial networks,

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134

  31. [31]

    Bbdm: Image-to-image translation with brownian bridge diffusion models,

    B. Li, K. Xue, B. Liu, and Y .-K. Lai, “Bbdm: Image-to-image translation with brownian bridge diffusion models,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 1952– 1961

  32. [32]

    Adding conditional control to text-to-image diffusion models,

    L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847

  33. [33]

    LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,

    S. Luo, Y . Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, “Lcm-lora: A universal stable-diffusion acceleration module, 2023,”URL https://arxiv. org/abs/2311.05556, 2023

  34. [34]

    Clip-ebc: Clip can count ac- curately through enhanced blockwise classification,

    Y . Ma, V . Sanchez, and T. Guha, “Clip-ebc: Clip can count ac- curately through enhanced blockwise classification,”arXiv preprint arXiv:2403.09281, 2024

  35. [35]

    Label-efficient semantic segmentation with diffusion models,

    D. Baranchuk, A. V oynov, I. Rubachev, V . Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=SlxSY2UZQT

  36. [36]

    Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, March 2024

    D. Kim, C.-H. Lai, W.-H. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Ermon, “Consistency trajectory models: Learning probability flow ode trajectory of diffusion,”arXiv preprint arXiv:2310.02279, 2023

  37. [37]

    Multimodal crowd counting with mutual attention transformers,

    Z. Wu, L. Liu, Y . Zhang, M. Mao, L. Lin, and G. Li, “Multimodal crowd counting with mutual attention transformers,” in2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6

  38. [38]

    Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,

    W. Zhou, Y . Pan, J. Lei, L. Ye, and L. Yu, “Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 540–24 549, 2022

  39. [39]

    Learning the cross-modal discriminative feature representation for rgb-t crowd counting,

    H. Li, S. Zhang, and W. Kong, “Learning the cross-modal discriminative feature representation for rgb-t crowd counting,”Knowledge-Based Systems, vol. 257, p. 109944, 2022

  40. [40]

    Mc3net: Multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,

    W. Zhou, X. Yang, J. Lei, W. Yan, and L. Yu, “Mc3net: Multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 5, pp. 4156–4165, 2024

  41. [41]

    Visual prompt multi-branch fusion network for rgb-thermal crowd counting,

    B. Mu, F. Shao, Z. Xie, H. Chen, Q. Jiang, and Y .-S. Ho, “Visual prompt multi-branch fusion network for rgb-thermal crowd counting,” IEEE Internet of Things Journal, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

  42. [42]

    Consistency-constrained rgb- t crowd counting via mutual information maximization,

    Q. Guo, P. Yuan, X. Huang, and Y . Ye, “Consistency-constrained rgb- t crowd counting via mutual information maximization,”Complex & Intelligent Systems, vol. 10, no. 4, pp. 5049–5070, 2024

  43. [43]

    Cagnet: Coordinated attention guidance network for rgb-t crowd counting,

    X. Yang, W. Zhou, W. Yan, and X. Qian, “Cagnet: Coordinated attention guidance network for rgb-t crowd counting,”Expert Systems with Applications, vol. 243, p. 122753, 2024

  44. [44]

    Mjpnet- s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,

    W. Zhou, X. Yang, X. Dong, M. Fang, W. Yan, and T. Luo, “Mjpnet- s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,”IEEE Internet of Things Journal, 2024

  45. [45]

    Multi-modal crowd counting via a broker modality,

    H. Meng, X. Hong, C. Wang, M. Shang, and W. Zuo, “Multi-modal crowd counting via a broker modality,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 231–250

  46. [46]

    Rgbt-booster: Detail- boosted fusion network for rgb-thermal crowd counting with local contrastive learning,

    B. Mu, F. Shao, Z. Xie, L. Xu, and Q. Jiang, “Rgbt-booster: Detail- boosted fusion network for rgb-thermal crowd counting with local contrastive learning,”IEEE Internet of Things Journal, 2025

  47. [47]

    Memory-efficient cross-modal atten- tion for rgb-x segmentation and crowd counting,

    Y . Zhang, S. Choi, and S. Hong, “Memory-efficient cross-modal atten- tion for rgb-x segmentation and crowd counting,”Pattern Recognition, p. 111376, 2025

  48. [48]

    Modal-adaptive spatial-aware- fusion and propagation network for multimodal vision crowd counting,

    K. Liu, X. Zou, P. Zhu, and J. Sang, “Modal-adaptive spatial-aware- fusion and propagation network for multimodal vision crowd counting,” IEEE Transactions on Consumer Electronics, 2025

  49. [49]

    Cmfx: Cross-modal fusion network for rgb-x crowd counting,

    X.-M. Duan, H.-M. Sun, Z.-M. Zhang, L.-X. Qin, and R.-S. Jia, “Cmfx: Cross-modal fusion network for rgb-x crowd counting,”Neural Networks, vol. 184, p. 107070, 2025

  50. [50]

    Rgb-t crowd counting from drone: A bench- mark and mmccn network,

    T. Peng, Q. Li, and P. Zhu, “Rgb-t crowd counting from drone: A bench- mark and mmccn network,” inProceedings of the Asian Conference on Computer Vision (ACCV), November 2020

  51. [51]

    Extremely overlapping vehicle counting,

    R. Guerrero-G ´omez-Olmedo, B. Torre-Jim ´enez, R. L ´opez-Sastre, S. Maldonado-Basc ´on, and D. Onoro-Rubio, “Extremely overlapping vehicle counting,” inPattern Recognition and Image Analysis: 7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015, Proceedings 7. Springer, 2015, pp. 423–431

  52. [52]

    Depth anything v2,

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024

  53. [53]

    Improved Techniques for Training Consistency Models

    Y . Song and P. Dhariwal, “Improved techniques for training consistency models,”arXiv preprint arXiv:2310.14189, 2023

  54. [54]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021

  55. [55]

    Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,

    W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578

  56. [56]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

  57. [57]

    High- resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695

  58. [58]

    Extracting training data from diffusion models,

    N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” in32nd USENIX security symposium (USENIX Se- curity 23), 2023, pp. 5253–5270

  59. [59]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in neural information processing systems, 2017, pp. 6626–6637