Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

Bangjun Wang; Hao-Yuan Ma; Jie Gao; Li Zhang; Yan Zhang; Yushi Qiu

arxiv: 2606.18566 · v1 · pith:WRHWOAI5new · submitted 2026-06-17 · 💻 cs.CV · cs.AI· cs.GR

Multi-Modal Hyper-Graph Fusion for Low-Light Crowd Counting

Hao-Yuan Ma , Li Zhang , Yushi Qiu , Jie Gao , Yan Zhang , Bangjun Wang This is my paper

Pith reviewed 2026-06-26 21:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GR

keywords low-light crowd countingmulti-modal fusionhyper-graphdepth cuesedge structureRetinex modelingdeformable attentioncrowd density estimation

0 comments

The pith

A hyper-graph that fuses RGB with depth and edge cues improves crowd counting under low light.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Crowd counting breaks down in darkness because RGB images lose reliable structure. The paper builds three new benchmarks (two synthetic, one real) and adds depth geometry plus Canny edges as stable priors drawn from Retinex-style reflectance modeling. These cues become nodes in a single hyper-graph whose dynamic hyperedges and message passing explicitly link the three modalities at higher order. A deformable rectangular sparse attention layer then focuses computation on the most informative patches. The resulting LCNet reports the strongest numbers against prior methods on all three benchmarks.

Core claim

By representing RGB appearance, depth geometry, and edge structure as nodes inside one hyper-graph and performing dynamic hyperedge construction plus message passing, the Multi-Modal Hyper-Graph Fusion module captures complementary high-order relationships that remain usable when illumination collapses; combined with the Deformable Rectangular Sparse Attention module, this yields a Low-Light Counting Network that outperforms existing state-of-the-art approaches on the new SHA_Dark, SHB_Dark, and LC-Crowd benchmarks.

What carries the argument

Multi-Modal Hyper-Graph Fusion module that places RGB, depth, and Canny-edge features as nodes in a unified hyper-graph and uses dynamic hyperedge construction with message passing to link their high-order relationships.

If this is right

Crowd density maps become more accurate in darkness without requiring brighter sensors.
Computation is concentrated on informative image regions rather than uniform dark areas.
The three new benchmarks supply standardized test data for any future low-light counting work.
High-order cross-modal relationships can be modeled explicitly instead of through late fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hyper-graph construction could be tested on other dense-prediction tasks that suffer from photometric degradation.
If depth estimation itself degrades under low light, an alternative structural prior would be needed to keep the fusion intact.
The benchmarks make it possible to measure whether multi-modal priors close the gap to well-lit performance.
Sparse rectangular attention may transfer to other vision backbones that must allocate compute unevenly across an image.

Load-bearing premise

Depth maps and Canny edges remain reliable geometric and structural signals even when the scene is extremely dark and unevenly lit.

What would settle it

On the LC-Crowd real-world test set the LCNet fails to exceed the best prior method in mean absolute error or mean squared error.

Figures

Figures reproduced from arXiv: 2606.18566 by Bangjun Wang, Hao-Yuan Ma, Jie Gao, Li Zhang, Yan Zhang, Yushi Qiu.

**Figure 1.** Figure 1: Visual examples from the Low-Light Crowd Count [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: The overall architecture of the proposed LCNet for low-light crowd counting. The low-light RGB image is first encoded [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Deformable Rectangular Sparse At [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Visualization of anchor distribution in the proposed [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 6.** Figure 6: Visualization of crowd counting results on three [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

read the original abstract

Crowd counting is a fundamental task in computer vision. However, crowd counting in low-light environments remains largely underexplored, despite its practical importance in the real world. Existing methods mainly focus on well-lit scenes or rely on single-modality Red-Green-Blue (RGB) representations, which often become unreliable under extreme darkness and complex non-uniform illumination. To handle this problem, we construct three new low-light crowd counting benchmarks, which consist of two synthetic datasets, SHA\_Dark and SHB\_Dark, and a real-world benchmark LC-Crowd (Low-light Crowd Dataset). Inspired by Retinex-based physical modeling, we introduce depth and Canny edge cues as complementary geometric and structural priors to enhance the intrinsic reflectance representation under low-light conditions. We propose a Multi-Modal Hyper-Graph Fusion module, which formulates RGB appearance, depth geometry, and edge structure cues as nodes in a unified hyper-graph and explicitly captures their high-order complementary relationships via dynamic hyperedge construction and message passing. Furthermore, to adaptively allocate computation in dense prediction, we propose a Deformable Rectangular Sparse Attention (DRSA) module, which concentrates computation on informative regions through anchor-aware estimation and adaptive rectangular window modeling. Based on these designs, we develop a unified Low-Light Counting Network (LCNet) for robust low-light crowd counting. Extensive experiments on three benchmarks demonstrate that the proposed method achieves the best overall performance against existing state-of-the-art (SOTA) methods. The code is in the supplementary material. The datasets will be made public upon acceptance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New low-light crowd counting benchmarks plus hyper-graph fusion of RGB, depth and edges, but no numbers or ablations shown yet.

read the letter

Two things stand out. The paper releases three new low-light crowd counting benchmarks—synthetic SHA_Dark and SHB_Dark plus the real LC-Crowd set—and builds a network that fuses RGB appearance with depth geometry and Canny edges through a hyper-graph module, plus a deformable rectangular sparse attention block.

The datasets address a clear practical gap. Most counting work stays in well-lit scenes, so adding these test sets is useful on its own. The method takes a Retinex-inspired route to justify the extra modalities, treats them as nodes in a hyper-graph with dynamic hyperedges for higher-order relations, and tries to keep computation focused with the DRSA module. Releasing code and promising to release the data is the right step.

The soft spots are straightforward. The abstract states that the method reaches the best overall performance but gives zero numbers, no error bars, no baseline scores, and no ablation results. Without those, it is impossible to tell whether the fusion actually helps or by how much. The concern about depth and edge reliability in extreme darkness is reasonable to raise; standard estimators and Canny edges lose signal fast when light is scarce, and the paper will need to show those priors remain informative on the new sets rather than adding noise.

The work shows clear engagement with the problem and standard literature on crowd counting and low-light modeling.

This is for computer vision researchers who need test data or ideas for multi-modal fusion in adverse conditions. It deserves a serious referee because the datasets are new and the setting matters, even if the empirical claims require full verification.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces three new low-light crowd counting benchmarks (SHA_Dark, SHB_Dark, and the real-world LC-Crowd) and proposes LCNet, which integrates RGB appearance with depth geometry and Canny edge structure via a Multi-Modal Hyper-Graph Fusion module (dynamic hyperedge construction and message passing) and a Deformable Rectangular Sparse Attention (DRSA) module for adaptive computation. The approach is motivated by Retinex-based physical modeling and claims state-of-the-art performance over existing methods on the three benchmarks, with code in supplementary material and datasets to be released.

Significance. If the empirical results hold with proper validation, the new benchmarks would be a useful contribution to an underexplored setting, and the hyper-graph fusion of complementary modalities could provide a principled way to handle non-uniform illumination. Explicit credit is due for the planned public release of code and datasets.

major comments (2)

[Abstract] Abstract: the central claim that the method 'achieves the best overall performance against existing state-of-the-art (SOTA) methods' is stated without any quantitative metrics, tables, error bars, or ablation results, preventing assessment of effect size or robustness; this is load-bearing for the empirical contribution.
[Abstract] Abstract (Retinex-inspired modeling paragraph): the premise that depth maps and Canny edges supply reliable complementary geometric and structural priors under extreme low-light is not supported by any quantitative validation or ablation on LC-Crowd or the synthetic dark sets; standard monocular depth estimators and gradient detectors are known to degrade with low photon counts, so the hyper-graph construction's gains over RGB-only baselines remain unsubstantiated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript to strengthen the empirical presentation.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method 'achieves the best overall performance against existing state-of-the-art (SOTA) methods' is stated without any quantitative metrics, tables, error bars, or ablation results, preventing assessment of effect size or robustness; this is load-bearing for the empirical contribution.

Authors: We agree that the abstract would benefit from quantitative support. In the revision, we will add specific metrics (e.g., MAE/MSE improvements on SHA_Dark, SHB_Dark, and LC-Crowd versus prior SOTA) and reference the main results table to allow assessment of effect size. revision: yes
Referee: [Abstract] Abstract (Retinex-inspired modeling paragraph): the premise that depth maps and Canny edges supply reliable complementary geometric and structural priors under extreme low-light is not supported by any quantitative validation or ablation on LC-Crowd or the synthetic dark sets; standard monocular depth estimators and gradient detectors are known to degrade with low photon counts, so the hyper-graph construction's gains over RGB-only baselines remain unsubstantiated.

Authors: The full manuscript contains modality ablation studies and RGB-only baseline comparisons demonstrating performance gains from the hyper-graph fusion on the low-light benchmarks. We will revise the abstract to explicitly reference these ablation results (in the experiments section) that substantiate the utility of the priors in practice, while acknowledging that direct per-modality reliability metrics under low photon counts are not separately reported. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The provided abstract and description contain no equations, derivations, or parameter-fitting steps that could reduce predictions to inputs by construction. The method is described as constructing new benchmarks and proposing modules (hyper-graph fusion, DRSA) inspired by Retinex modeling, with performance claims resting on experimental results rather than any self-referential or fitted-input logic. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming patterns appear. This is the common case of an empirical method paper whose claims are externally falsifiable via the reported benchmarks and code.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level Retinex inspiration.

axioms (1)

domain assumption Retinex-based physical modeling supplies valid intrinsic reflectance priors for low-light scenes
Abstract states the method is inspired by Retinex and uses depth/edge as complementary priors.

pith-pipeline@v0.9.1-grok · 5825 in / 1160 out tokens · 17109 ms · 2026-06-26T21:21:07.709552+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 1 linked inside Pith

[1]

Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang
[2]

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement. InICCV
[3]

I Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, and Sy-Yen Kuo. 2024. Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance.European Conference on Computer Vision (ECCV)(2024)

2024
[4]

Xiao-Han Chen and Jian-Huang Lai. 2019. Detecting abnormal crowd behaviors based on the div-curl characteristics of flow fields.Pattern Recognition88 (2019), 342–355

2019
[6]

Yongqiang Chen, Chenglin Wen, Weifeng Liu, and Wei He. 2023. A depth iterative illumination estimation network for low-light image enhancement based on retinex theory.Scientific Reports13, 1 (2023), 19709

2023
[7]

Zhi-Qi Cheng, Qi Dai, Hong Li, Jingkuan Song, Xiao Wu, and Alexander G Hauptmann. 2022. Rethinking Spatial Invariance of Convolutional Networks for Object Counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19638–19648

2022
[8]

Xiao-Meng Duan, Hong-Mei Sun, Zeng-Min Zhang, Ling-Xiao Qin, and Rui- Sheng Jia. 2025. CMFX: Cross-modal fusion network for RGB-X crowd counting. Neural Networks184 (2025), 107070

2025
[9]

Camille Dupont, Luis Tobias, and Bertrand Luvison. 2017. Crowd-11: A dataset for fine grained crowd behaviour analysis. InProceedings of the IEEE conference on computer vision and pattern recognition workshops (ICCVW). 9–16

2017
[10]

Chunle Guo Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. 2020. Zero-reference deep curve estimation for low-light image enhancement. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 1780–1789

2020
[11]

Xiaojie Guo, Yu Li, and Haibin Ling. 2017. LIME: Low-Light Image Enhancement via Illumination Map Estimation.IEEE Transactions on Image Processing26, 2 (2017), 982–993

2017
[12]

Tao Han and etc. 2023. STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning.IEEE/CVF International Conference on Computer Vision (ICCV)(2023)

2023
[13]

Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, and Ming- Hsuan Yang. 2023. Counting Crowds in Bad Weather. InIEEE/CVF International Conference on Computer Vision (ICCV)

2023
[14]

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon
[15]

InProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Multispectral Pedestrian Detection: Benchmark Dataset and Baselines. InProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
[16]

Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Andy J Ma, Yaowei Wang, and Wei-Shi Zheng. 2023. Dilateformer: Multi-scale dilated transformer for visual recognition.IEEE transactions on multimedia (TMM)25 (2023), 8906–8919

2023
[17]

Kingma and Jimmy Lei Ba

Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. InProceedings of the International Conference on Learning Repre- sentations (ICLR)

2015
[18]

Harold W Kuhn. 1955. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly2, 1-2 (1955), 83–97

1955
[19]

Edwin H Land. 1977. The retinex theory of color vision.Scientific American237, 6 (1977), 108–129

1977
[20]

Hyeonbeen Lee and Jangho Lee. 2024. TinyCount: an efficient crowd counting network for intelligent surveillance.Journal of Real-Time Image Processing21, 4 (2024), 153

2024
[21]

Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated Convo- lutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1091–1100. Ma et al

2018
[22]

Dingkang Liang, Wei Xu, and Xiang Bai. 2022. An End-to-End Transformer Model for Crowd Localization. InProceedings of the European Conference on Computer Vision (ECCV). 38–54

2022
[23]

Chengxin Liu, Hao Lu, Zhiguo Cao, and Tongliang Liu. 2023. Point-Query Quadtree for Crowd Counting, Localization, and More. InIEEE/CVF International Conference on Computer Vision (ICCV)

2023
[24]

Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, and Liang Lin
[25]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 4823–4833
[26]

Weizhe Liu, Mathieu Salzmann, and Pascal Fua. 2019. Context-Aware Crowd Counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5099–5108

2019
[27]

Yan-Bo Liu, Guo Cao, Boshan Shi, and Yingxiang Hu. 2024. CCANet: A Collabo- rative Cross-Modal Attention Network for RGB-D Crowd Counting.IEEE Trans. Multim.26 (2024), 154–165

2024
[28]

Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. 2017. LLNet: A Deep Autoencoder Approach to Natural Low-light Image Enhancement.Pattern Recog- nition61 (2017), 650–662

2017
[29]

Chen Change Loy, Shaogang Gong, and Tao Xiang. 2013. From Semi-supervised to Transfer Counting of Crowds. In2013 IEEE International Conference on Com- puter Vision (ICCV). 2256–2263

2013
[30]

Hao-Yuan Ma and Li Zhang. 2024. Multi-head multi-scale pixel localization network for crowd counting with highly dense and small-scale samples. In2024 IEEE International Conference on Multimedia and Expo (ICME). 1–5

2024
[31]

Hao-Yuan Ma, Li Zhang, and Shuai Shi. 2024. VMambaCC: A Visual State Space Model for Crowd Counting.arXiv preprint arXiv:2405.03978(2024)

arXiv 2024
[32]

Hao-Yuan Ma, Li Zhang, and Xiang-Yi Wei. 2024. FGENet: Fine-Grained Ex- traction Network for Congested Crowd Counting. InProceedings of the 30th International Conference on Multimedia Modeling (MMM)

2024
[33]

Baoyang Mu, Feng Shao, Zhengxuan Xie, Hangwei Chen, Zhongjie Zhu, and Qiuping Jiang. 2025. MISF-Net: Modality-invariant and-specific fusion network for RGB-T crowd counting.IEEE Transactions on Multimedia27 (2025), 2593– 2607

2025
[34]

Baoyang Mu, Feng Shao, Zhengxuan Xie, Long Xu, and Qiuping Jiang. 2025. RGBT-Booster: Detail-boosted fusion network for RGB-thermal crowd counting with local contrastive learning.IEEE Internet of Things Journal(2025)

2025
[35]

Zuodong Niu, Huilong Pi, Guoqing Xiao, Shenghong Yang, Zhuo Tang, and Dazheng Liu. 2025. Low-Light Domain Enhancement and Multidomain Pro- gressive Fusion for RGB-T Day–Night Crowd Counting.IEEE Internet of Things Journal12, 20 (2025), 42533–42548

2025
[36]

Yi Pan, Wujie Zhou, Meixin Fang, and Fangfang Qiang. 2024. Graph enhancement and transformer aggregation network for RGB-thermal crowd counting.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5

2024
[37]

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Ban- dara, and Vishal M. Patel. 2024. CrowdDiff: Multi-Hypothesis Crowd Density Estimation Using Diffusion Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12809–12819

2024
[38]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: To- wards Real-Time Object Detection with Region Proposal Networks. InAdvances in Neural Information Processing Systems (NIPS), Vol. 28

2015
[39]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Net- works for Large-Scale Image Recognition.Computer Science(2014)

2014
[40]

Sindagi, Rajeev Yasarla, and Vishal M

Vishwanath A. Sindagi, Rajeev Yasarla, and Vishal M. Patel. 2022. JHU- CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method. IEEE Transactions on Pattern Analysis and Machine Intelligence(2022), 2594–2609

2022
[41]

Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yang Wu. 2021. Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3365–3374

2021
[42]

Qingyu Song, Changan Wang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Jian Wu, and Jiayi Ma. 2021. To choose or to fuse? scale selection for crowd counting. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2576–2583

2021
[43]

Haihan Tang, Yi Wang, and Lap-Pui Chau. 2022. Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting. In2022 IEEE international symposium on circuits and systems (ISCAS). IEEE, 3299–3303

2022
[44]

Qi Wang, Junyu Gao, Wei Lin, and Xuelong Li. 2021. NWPU-Crowd: A Large- Scale Benchmark for Crowd Counting and Localization.IEEE Transactions on Pattern Analysis and Machine Intelligence43, 6 (2021), 2141–2149

2021
[45]

Shuyu Wang, Weiwei Wu, Yinglin Li, Yuhang Xu, and Yan Lyu. 2024. MIANet: Bridging the gap in crowd density estimation with thermal and RGB interaction. IEEE Transactions on Intelligent Transportation Systems26, 1 (2024), 254–267

2024
[46]

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth Anything V2. InAdvances in Neural Information Processing Systems (NeurIPS)

2024
[47]

Shihui Zhang, Kun Chen, Gangzheng Zhai, He Li, and Shaojie Han. 2025. CMPNet: A cross-modal multi-scale perception network for RGB-T crowd counting.Future Generation Computer Systems164 (2025), 107596

2025
[48]

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. 2016. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 589–597

2016
[49]

Wujie Zhou, Yi Pan, Jingsheng Lei, Lv Ye, and Lu Yu. 2022. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting.IEEE Transactions on Intelligent Transportation Systems23, 12 (2022), 24540–24549

2022
[50]

Feng Zhu, Xiaogang Wang, and Nenghai Yu. 2016. Crowd tracking by group struc- ture evolution.IEEE Transactions on Circuits and Systems for Video Technology 28, 3 (2016), 772–786

2016
[51]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159(2020). A Evaluation Metrics To comprehensively evaluate the counting performance, we adopt three widely used metrics, namely Mean Absolute Error (MAE), Mean Squared Error (MS...

Pith/arXiv arXiv 2020

[1] [1]

Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang

[2] [2]

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement. InICCV

[3] [3]

I Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, and Sy-Yen Kuo. 2024. Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance.European Conference on Computer Vision (ECCV)(2024)

2024

[4] [4]

Xiao-Han Chen and Jian-Huang Lai. 2019. Detecting abnormal crowd behaviors based on the div-curl characteristics of flow fields.Pattern Recognition88 (2019), 342–355

2019

[5] [6]

Yongqiang Chen, Chenglin Wen, Weifeng Liu, and Wei He. 2023. A depth iterative illumination estimation network for low-light image enhancement based on retinex theory.Scientific Reports13, 1 (2023), 19709

2023

[6] [7]

Zhi-Qi Cheng, Qi Dai, Hong Li, Jingkuan Song, Xiao Wu, and Alexander G Hauptmann. 2022. Rethinking Spatial Invariance of Convolutional Networks for Object Counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19638–19648

2022

[7] [8]

Xiao-Meng Duan, Hong-Mei Sun, Zeng-Min Zhang, Ling-Xiao Qin, and Rui- Sheng Jia. 2025. CMFX: Cross-modal fusion network for RGB-X crowd counting. Neural Networks184 (2025), 107070

2025

[8] [9]

Camille Dupont, Luis Tobias, and Bertrand Luvison. 2017. Crowd-11: A dataset for fine grained crowd behaviour analysis. InProceedings of the IEEE conference on computer vision and pattern recognition workshops (ICCVW). 9–16

2017

[9] [10]

Chunle Guo Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong. 2020. Zero-reference deep curve estimation for low-light image enhancement. InProceedings of the IEEE conference on computer vision and pattern recognition (CVPR). 1780–1789

2020

[10] [11]

Xiaojie Guo, Yu Li, and Haibin Ling. 2017. LIME: Low-Light Image Enhancement via Illumination Map Estimation.IEEE Transactions on Image Processing26, 2 (2017), 982–993

2017

[11] [12]

Tao Han and etc. 2023. STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning.IEEE/CVF International Conference on Computer Vision (ICCV)(2023)

2023

[12] [13]

Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, and Ming- Hsuan Yang. 2023. Counting Crowds in Bad Weather. InIEEE/CVF International Conference on Computer Vision (ICCV)

2023

[13] [14]

Soonmin Hwang, Jaesik Park, Namil Kim, Yukyung Choi, and In So Kweon

[14] [15]

InProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Multispectral Pedestrian Detection: Benchmark Dataset and Baselines. InProceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

[15] [16]

Jiayu Jiao, Yu-Ming Tang, Kun-Yu Lin, Yipeng Gao, Andy J Ma, Yaowei Wang, and Wei-Shi Zheng. 2023. Dilateformer: Multi-scale dilated transformer for visual recognition.IEEE transactions on multimedia (TMM)25 (2023), 8906–8919

2023

[16] [17]

Kingma and Jimmy Lei Ba

Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. InProceedings of the International Conference on Learning Repre- sentations (ICLR)

2015

[17] [18]

Harold W Kuhn. 1955. The Hungarian Method for the Assignment Problem. Naval research logistics quarterly2, 1-2 (1955), 83–97

1955

[18] [19]

Edwin H Land. 1977. The retinex theory of color vision.Scientific American237, 6 (1977), 108–129

1977

[19] [20]

Hyeonbeen Lee and Jangho Lee. 2024. TinyCount: an efficient crowd counting network for intelligent surveillance.Journal of Real-Time Image Processing21, 4 (2024), 153

2024

[20] [21]

Yuhong Li, Xiaofan Zhang, and Deming Chen. 2018. CSRNet: Dilated Convo- lutional Neural Networks for Understanding the Highly Congested Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 1091–1100. Ma et al

2018

[21] [22]

Dingkang Liang, Wei Xu, and Xiang Bai. 2022. An End-to-End Transformer Model for Crowd Localization. InProceedings of the European Conference on Computer Vision (ECCV). 38–54

2022

[22] [23]

Chengxin Liu, Hao Lu, Zhiguo Cao, and Tongliang Liu. 2023. Point-Query Quadtree for Crowd Counting, Localization, and More. InIEEE/CVF International Conference on Computer Vision (ICCV)

2023

[23] [24]

Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, and Liang Lin

[24] [25]

InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). 4823–4833

[25] [26]

Weizhe Liu, Mathieu Salzmann, and Pascal Fua. 2019. Context-Aware Crowd Counting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5099–5108

2019

[26] [27]

Yan-Bo Liu, Guo Cao, Boshan Shi, and Yingxiang Hu. 2024. CCANet: A Collabo- rative Cross-Modal Attention Network for RGB-D Crowd Counting.IEEE Trans. Multim.26 (2024), 154–165

2024

[27] [28]

Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. 2017. LLNet: A Deep Autoencoder Approach to Natural Low-light Image Enhancement.Pattern Recog- nition61 (2017), 650–662

2017

[28] [29]

Chen Change Loy, Shaogang Gong, and Tao Xiang. 2013. From Semi-supervised to Transfer Counting of Crowds. In2013 IEEE International Conference on Com- puter Vision (ICCV). 2256–2263

2013

[29] [30]

Hao-Yuan Ma and Li Zhang. 2024. Multi-head multi-scale pixel localization network for crowd counting with highly dense and small-scale samples. In2024 IEEE International Conference on Multimedia and Expo (ICME). 1–5

2024

[30] [31]

Hao-Yuan Ma, Li Zhang, and Shuai Shi. 2024. VMambaCC: A Visual State Space Model for Crowd Counting.arXiv preprint arXiv:2405.03978(2024)

arXiv 2024

[31] [32]

Hao-Yuan Ma, Li Zhang, and Xiang-Yi Wei. 2024. FGENet: Fine-Grained Ex- traction Network for Congested Crowd Counting. InProceedings of the 30th International Conference on Multimedia Modeling (MMM)

2024

[32] [33]

Baoyang Mu, Feng Shao, Zhengxuan Xie, Hangwei Chen, Zhongjie Zhu, and Qiuping Jiang. 2025. MISF-Net: Modality-invariant and-specific fusion network for RGB-T crowd counting.IEEE Transactions on Multimedia27 (2025), 2593– 2607

2025

[33] [34]

Baoyang Mu, Feng Shao, Zhengxuan Xie, Long Xu, and Qiuping Jiang. 2025. RGBT-Booster: Detail-boosted fusion network for RGB-thermal crowd counting with local contrastive learning.IEEE Internet of Things Journal(2025)

2025

[34] [35]

Zuodong Niu, Huilong Pi, Guoqing Xiao, Shenghong Yang, Zhuo Tang, and Dazheng Liu. 2025. Low-Light Domain Enhancement and Multidomain Pro- gressive Fusion for RGB-T Day–Night Crowd Counting.IEEE Internet of Things Journal12, 20 (2025), 42533–42548

2025

[35] [36]

Yi Pan, Wujie Zhou, Meixin Fang, and Fangfang Qiang. 2024. Graph enhancement and transformer aggregation network for RGB-thermal crowd counting.IEEE Geoscience and Remote Sensing Letters21 (2024), 1–5

2024

[36] [37]

Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Ban- dara, and Vishal M. Patel. 2024. CrowdDiff: Multi-Hypothesis Crowd Density Estimation Using Diffusion Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 12809–12819

2024

[37] [38]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: To- wards Real-Time Object Detection with Region Proposal Networks. InAdvances in Neural Information Processing Systems (NIPS), Vol. 28

2015

[38] [39]

Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Net- works for Large-Scale Image Recognition.Computer Science(2014)

2014

[39] [40]

Sindagi, Rajeev Yasarla, and Vishal M

Vishwanath A. Sindagi, Rajeev Yasarla, and Vishal M. Patel. 2022. JHU- CROWD++: Large-Scale Crowd Counting Dataset and A Benchmark Method. IEEE Transactions on Pattern Analysis and Machine Intelligence(2022), 2594–2609

2022

[40] [41]

Qingyu Song, Changan Wang, Zhengkai Jiang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yang Wu. 2021. Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 3365–3374

2021

[41] [42]

Qingyu Song, Changan Wang, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Jian Wu, and Jiayi Ma. 2021. To choose or to fuse? scale selection for crowd counting. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2576–2583

2021

[42] [43]

Haihan Tang, Yi Wang, and Lap-Pui Chau. 2022. Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting. In2022 IEEE international symposium on circuits and systems (ISCAS). IEEE, 3299–3303

2022

[43] [44]

Qi Wang, Junyu Gao, Wei Lin, and Xuelong Li. 2021. NWPU-Crowd: A Large- Scale Benchmark for Crowd Counting and Localization.IEEE Transactions on Pattern Analysis and Machine Intelligence43, 6 (2021), 2141–2149

2021

[44] [45]

Shuyu Wang, Weiwei Wu, Yinglin Li, Yuhang Xu, and Yan Lyu. 2024. MIANet: Bridging the gap in crowd density estimation with thermal and RGB interaction. IEEE Transactions on Intelligent Transportation Systems26, 1 (2024), 254–267

2024

[45] [46]

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth Anything V2. InAdvances in Neural Information Processing Systems (NeurIPS)

2024

[46] [47]

Shihui Zhang, Kun Chen, Gangzheng Zhai, He Li, and Shaojie Han. 2025. CMPNet: A cross-modal multi-scale perception network for RGB-T crowd counting.Future Generation Computer Systems164 (2025), 107596

2025

[47] [48]

Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. 2016. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 589–597

2016

[48] [49]

Wujie Zhou, Yi Pan, Jingsheng Lei, Lv Ye, and Lu Yu. 2022. DEFNet: Dual-branch enhanced feature fusion network for RGB-T crowd counting.IEEE Transactions on Intelligent Transportation Systems23, 12 (2022), 24540–24549

2022

[49] [50]

Feng Zhu, Xiaogang Wang, and Nenghai Yu. 2016. Crowd tracking by group struc- ture evolution.IEEE Transactions on Circuits and Systems for Video Technology 28, 3 (2016), 772–786

2016

[50] [51]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable detr: Deformable transformers for end-to-end object detection.arXiv preprint arXiv:2010.04159(2020). A Evaluation Metrics To comprehensively evaluate the counting performance, we adopt three widely used metrics, namely Mean Absolute Error (MAE), Mean Squared Error (MS...

Pith/arXiv arXiv 2020