Thermal-Only Crowd Counting with Deployment-Time Privacy Protection
Pith reviewed 2026-05-20 15:49 UTC · model grok-4.3
The pith
Thermal-only crowd counting matches RGB-T fusion accuracy by using single-step depth-to-RGB diffusion features during training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Single-step LCM denoising within depth-to-RGB diffusion models produces features most faithful to the conditioning depth signal and therefore most useful for strengthening thermal representations, yielding a thermal-only inference pipeline that delivers competitive performance against RGB-T fusion methods on the RGBT-CC and DroneRGBT datasets.
What carries the argument
Single-step LCM denoising acting as a cross-modal bridge that extracts structural features from depth conditioning to augment thermal inputs for counting.
Where Pith is reading between the lines
- The same diffusion-bridge idea could be tested on other privacy-sensitive thermal tasks such as person detection or activity recognition.
- Checking whether the single-step preference holds across different diffusion architectures would clarify how general the observation is.
- The method opens a route to train once with auxiliary depth or RGB data and then run lightweight thermal-only models in the field.
Load-bearing premise
The depth-to-RGB diffusion model can extract features that genuinely improve thermal counting accuracy without the denoising process itself introducing errors that lower performance.
What would settle it
An experiment in which multi-step denoising or a version without any diffusion features produces equal or higher counting accuracy on RGBT-CC or DroneRGBT would undermine the claim that single-step LCM is required.
Figures
read the original abstract
While RGB-Thermal crowd counting has shown promise, the paradigm faces critical limitations: RGB data raises privacy concerns in public surveillance, and multi-modal misalignment degrades fusion performance. We propose the first thermal-only framework specifically designed for privacy-conscious crowd counting, eliminating RGB dependency at inference time and substantially reducing the privacy exposure associated with continuous RGB capture in public surveillance deployments. To mitigate thermal ambiguity, we leverage depth-to-RGB diffusion models as a cross-modal bridge, extracting discriminative features that enhance thermal representations. Critically, we demonstrate that single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal, while multi-step approaches progressively decouple features from the conditioning input and accumulate errors that degrade counting accuracy. Experiments on RGBT-CC and DroneRGBT datasets show our method achieves competitive performance against state-of-the-art RGB-T fusion methods, while requiring only thermal input during inference, eliminating the need for continuous RGB capture that constitutes the primary privacy concern in real-world surveillance deployment. The code will be made publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes the first thermal-only crowd counting framework for privacy protection in surveillance applications. It trains a thermal encoder by using a depth-to-RGB diffusion model (LCM) as a cross-modal bridge to inject structural features, with the key technical assertion that single-step LCM denoising produces the most faithful features to the depth conditioning signal while multi-step denoising decouples and injects errors. The method is claimed to achieve competitive performance against state-of-the-art RGB-T fusion approaches on the RGBT-CC and DroneRGBT datasets while requiring only thermal input at inference time.
Significance. If the performance and ablation claims are substantiated, the work would offer a meaningful advance for privacy-conscious crowd monitoring by eliminating continuous RGB capture. The use of diffusion models specifically as a deployment-time bridge for thermal ambiguity is a novel angle, and the commitment to public code release supports reproducibility.
major comments (2)
- [Abstract] Abstract: the claim of 'competitive performance against state-of-the-art RGB-T fusion methods' is presented without any quantitative numbers, error bars, tables, or ablation details, leaving the central empirical claim unverified and difficult to assess.
- [Method and Experiments] Method and Experiments: the assertion that single-step LCM denoising yields features 'most faithful to the structural content of the depth conditioning signal' while multi-step runs accumulate errors is load-bearing for the reported advantage over plain thermal baselines, yet no direct faithfulness metrics (e.g., SSIM, edge preservation) or controlled ablations isolating step count are described.
minor comments (1)
- [Abstract] The abstract could briefly note the specific quantitative metrics used for 'competitive performance' to give readers an immediate sense of the results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of the work's significance for privacy-preserving surveillance. We address each major comment below and have revised the manuscript to provide stronger empirical support and methodological justification.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'competitive performance against state-of-the-art RGB-T fusion methods' is presented without any quantitative numbers, error bars, tables, or ablation details, leaving the central empirical claim unverified and difficult to assess.
Authors: We agree that the abstract would benefit from explicit quantitative backing to make the central claim immediately verifiable. In the revised manuscript, we will add specific MAE and RMSE values from the RGBT-CC and DroneRGBT experiments, including direct numerical comparisons to the cited state-of-the-art RGB-T fusion baselines. These figures are already reported in the experimental section and will be concisely incorporated into the abstract without exceeding length limits. revision: yes
-
Referee: [Method and Experiments] Method and Experiments: the assertion that single-step LCM denoising yields features 'most faithful to the structural content of the depth conditioning signal' while multi-step runs accumulate errors is load-bearing for the reported advantage over plain thermal baselines, yet no direct faithfulness metrics (e.g., SSIM, edge preservation) or controlled ablations isolating step count are described.
Authors: We acknowledge that direct quantitative faithfulness metrics and isolated step-count ablations were not included, which would strengthen the justification for preferring single-step denoising. We will add these in the revised manuscript: (i) SSIM and edge-preservation scores measuring feature fidelity to the depth conditioning signal across step counts, and (ii) a controlled ablation table varying denoising steps while holding all other components fixed. These additions will provide direct evidence that single-step LCM denoising best preserves structural content. revision: yes
Circularity Check
No circularity; novel pipeline with external experimental validation
full rationale
The paper introduces a new thermal-only crowd counting framework that uses a depth-to-RGB diffusion model (LCM) as a cross-modal bridge to enhance thermal features, with the explicit goal of eliminating RGB at inference for privacy. The assertion that single-step LCM denoising yields features most faithful to the depth signal is presented as an empirical demonstration rather than a definitional necessity or fitted input renamed as a prediction. No equations, self-definitional reductions, load-bearing self-citations, or uniqueness theorems imported from the authors' prior work appear in the described derivation. Performance is validated through direct comparisons on the external RGBT-CC and DroneRGBT datasets against RGB-T fusion baselines, rendering the chain self-contained against independent benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Depth-to-RGB diffusion models can extract discriminative features that enhance thermal representations and mitigate thermal ambiguity
- domain assumption Single-step LCM denoising yields features most faithful to the structural content of the depth conditioning signal
Reference graph
Works this paper leans on
-
[1]
Counting crowds in bad weather,
Z.-K. Huang, W.-T. Chen, Y .-C. Chiang, S.-Y . Kuo, and M.-H. Yang, “Counting crowds in bad weather,” in2023 IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 251–23 262
work page 2023
-
[2]
Scene-adaptive unsupervised crowd counting for video surveillance,
R. Ma, Y . Hou, C. Li, H. Jia, and X. Xie, “Scene-adaptive unsupervised crowd counting for video surveillance,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 7, pp. 6910–6925, 2025
work page 2025
-
[3]
Frame-recurrent video crowd counting,
Y . Hou, S. Zhang, R. Ma, H. Jia, and X. Xie, “Frame-recurrent video crowd counting,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 5186–5199, 2023
work page 2023
-
[4]
L. Liu, J. Chen, H. Wu, G. Li, C. Li, and L. Lin, “Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4823–4833
work page 2021
-
[5]
Mc 3 net: multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,
W. Zhou, X. Yang, J. Lei, W. Yan, and L. Yu, “Mc 3 net: multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 5, pp. 4156–4165, 2023
work page 2023
-
[6]
W. Kong, J. Liu, Y . Hong, H. Li, and J. Shen, “Cross-modal collaborative feature representation via transformer-based multimodal mixers for rgb-t crowd counting,”Expert Systems with Applications, vol. 255, p. 124483, 2024
work page 2024
-
[7]
Misf- net: Modality-invariant and-specific fusion network for rgb-t crowd counting,
B. Mu, F. Shao, Z. Xie, H. Chen, Z. Zhu, X. Li, and Q. Jiang, “Misf- net: Modality-invariant and-specific fusion network for rgb-t crowd counting,”IEEE Transactions on Multimedia, 2025
work page 2025
-
[8]
Multi-modal crowd counting via modal emulation,
C. Wang, X. Hong, Z. Ma, Y . Wei, Y . Wang, and X. Fan, “Multi-modal crowd counting via modal emulation,” in35th British Machine Vision Conference 2024, BMVC 2024, Glasgow, UK, November 25-28, 2024. BMV A, 2024. [Online]. Available: https://papers.bmvc2024.org/0115.pdf
work page 2024
-
[9]
Y . Qian, X. Hong, Z. Guo, O. Arandjelovi ´c, and C. R. Donovan, “Semi-supervised crowd counting with contextual modeling: Facilitating holistic understanding of crowd scenes,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 9, pp. 8230–8241, 2024
work page 2024
-
[10]
Perspective-assisted prototype-based learning for semi-supervised crowd counting,
Y . Qian, L. Zhang, Z. Guo, X. Hong, O. Arandjelovi ´c, and C. R. Dono- van, “Perspective-assisted prototype-based learning for semi-supervised crowd counting,”Pattern Recognition, vol. 158, p. 111073, 2025
work page 2025
-
[11]
Learning crowd scale and distribution for weakly supervised crowd counting and localization,
Y . Fan, J. Wan, and A. J. Ma, “Learning crowd scale and distribution for weakly supervised crowd counting and localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 1, pp. 713– 727, 2025
work page 2025
-
[12]
Single-image crowd counting via multi-column convolutional neural network,
Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597
work page 2016
-
[13]
Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,
Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1091–1100
work page 2018
-
[14]
Bayesian loss for crowd count estimation with point supervision,
Z. Ma, X. Wei, X. Hong, and Y . Gong, “Bayesian loss for crowd count estimation with point supervision,” inProceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6142–6151
work page 2019
-
[15]
Distribution matching for crowd counting,
B. Wang, H. Liu, D. Samaras, and M. Hoai, “Distribution matching for crowd counting,” inAdvances in Neural Information Processing Systems, 2020
work page 2020
-
[16]
Direct measure matching for crowd counting,
H. Lin, X. Hong, Z. Ma, X. Wei, Y . Qiu, Y . Wang, and Y . Gong, “Direct measure matching for crowd counting,” inProceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. ijcai.org, 2021, pp. 837–844
work page 2021
-
[17]
Focus for free in density-based counting,
Z. Shi, P. Mettes, and C. G. M. Snoek, “Focus for free in density-based counting,”ArXiv, vol. abs/2306.05129, 2023
-
[18]
Pcc net: Perspective crowd counting via spatial convolutional network,
J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, pp. 3486–3498, 2019
work page 2019
-
[19]
Boosting crowd counting with transformers,
G. Sun, Y . Liu, T. Probst, D. P. Paudel, N. Popovic, and L. V . Gool, “Boosting crowd counting with transformers,”ArXiv, vol. abs/2105.10926, 2021
-
[20]
Seg- mentation assisted u-shaped multi-scale transformer for crowd counting
Y . Qian, L. Zhang, X. Hong, C. Donovan, and O. Arandjelovic, “Seg- mentation assisted u-shaped multi-scale transformer for crowd counting.” inBMVC, 2022, p. 397
work page 2022
-
[21]
Boosting crowd counting via multifaceted attention,
H. Lin, Z. Ma, R. Ji, Y . Wang, and X. Hong, “Boosting crowd counting via multifaceted attention,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 19 596–19 605
work page 2022
-
[22]
Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting,
H. Tang, Y . Wang, and L.-P. Chau, “Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting,” in2022 IEEE international symposium on circuits and systems (ISCAS). IEEE, 2022, pp. 3299– 3303
work page 2022
-
[23]
Spatio-channel attention blocks for cross-modal crowd counting,
Y . Zhang, S. Choi, and S. Hong, “Spatio-channel attention blocks for cross-modal crowd counting,” inProceedings of the Asian conference on computer vision, 2022, pp. 90–107
work page 2022
-
[24]
Z. Xie, F. Shao, B. Mu, H. Chen, Q. Jiang, C. Lu, and Y .-S. Ho, “Bgdfnet: bidirectional gated and dynamic fusion network for rgb-t crowd counting in smart city system,”IEEE Transactions on Instru- mentation and Measurement, 2024
work page 2024
-
[25]
Rgb-t multi-modal crowd counting based on transformer,
Z. Liu, W. Wu, Y . Tan, and G. Zhang, “Rgb-t multi-modal crowd counting based on transformer,” inThe 33rd British Machine Vision Conference, 2022
work page 2022
-
[26]
Free lunch enhancements for multi-modal crowd counting,
H. Meng, X. Hong, Z. Lai, and M. Shang, “Free lunch enhancements for multi-modal crowd counting,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 14 013–14 023
work page 2025
-
[27]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 6840–6851
work page 2020
-
[28]
Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,
H. Sasaki, C. G. Willcocks, and T. P. Breckon, “Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,”arXiv preprint arXiv:2104.05358, 2021
-
[29]
Semantic image synthesis via diffusion models,
W. Wang, J. Bao, W. Zhou, D. Chen, D. Chen, L. Yuan, and H. Li, “Semantic image synthesis via diffusion models,”arXiv preprint arXiv:2207.00050, 2022
-
[30]
Image-to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125– 1134
work page 2017
-
[31]
Bbdm: Image-to-image translation with brownian bridge diffusion models,
B. Li, K. Xue, B. Liu, and Y .-K. Lai, “Bbdm: Image-to-image translation with brownian bridge diffusion models,” in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 1952– 1961
work page 2023
-
[32]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847
work page 2023
-
[33]
LCM-LoRA: A universal stable-diffusion acceleration module.arXiv preprint arXiv:2311.05556,
S. Luo, Y . Tan, S. Patil, D. Gu, P. von Platen, A. Passos, L. Huang, J. Li, and H. Zhao, “Lcm-lora: A universal stable-diffusion acceleration module, 2023,”URL https://arxiv. org/abs/2311.05556, 2023
-
[34]
Clip-ebc: Clip can count ac- curately through enhanced blockwise classification,
Y . Ma, V . Sanchez, and T. Guha, “Clip-ebc: Clip can count ac- curately through enhanced blockwise classification,”arXiv preprint arXiv:2403.09281, 2024
-
[35]
Label-efficient semantic segmentation with diffusion models,
D. Baranchuk, A. V oynov, I. Rubachev, V . Khrulkov, and A. Babenko, “Label-efficient semantic segmentation with diffusion models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=SlxSY2UZQT
work page 2022
-
[36]
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion, March 2024
D. Kim, C.-H. Lai, W.-H. Liao, N. Murata, Y . Takida, T. Uesaka, Y . He, Y . Mitsufuji, and S. Ermon, “Consistency trajectory models: Learning probability flow ode trajectory of diffusion,”arXiv preprint arXiv:2310.02279, 2023
-
[37]
Multimodal crowd counting with mutual attention transformers,
Z. Wu, L. Liu, Y . Zhang, M. Mao, L. Lin, and G. Li, “Multimodal crowd counting with mutual attention transformers,” in2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2022, pp. 1–6
work page 2022
-
[38]
Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,
W. Zhou, Y . Pan, J. Lei, L. Ye, and L. Yu, “Defnet: Dual-branch enhanced feature fusion network for rgb-t crowd counting,”IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 24 540–24 549, 2022
work page 2022
-
[39]
Learning the cross-modal discriminative feature representation for rgb-t crowd counting,
H. Li, S. Zhang, and W. Kong, “Learning the cross-modal discriminative feature representation for rgb-t crowd counting,”Knowledge-Based Systems, vol. 257, p. 109944, 2022
work page 2022
-
[40]
Mc3net: Multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,
W. Zhou, X. Yang, J. Lei, W. Yan, and L. Yu, “Mc3net: Multimodality cross-guided compensation coordination network for rgb-t crowd count- ing,”IEEE Transactions on Intelligent Transportation Systems, vol. 25, no. 5, pp. 4156–4165, 2024
work page 2024
-
[41]
Visual prompt multi-branch fusion network for rgb-thermal crowd counting,
B. Mu, F. Shao, Z. Xie, H. Chen, Q. Jiang, and Y .-S. Ho, “Visual prompt multi-branch fusion network for rgb-thermal crowd counting,” IEEE Internet of Things Journal, 2024. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
work page 2024
-
[42]
Consistency-constrained rgb- t crowd counting via mutual information maximization,
Q. Guo, P. Yuan, X. Huang, and Y . Ye, “Consistency-constrained rgb- t crowd counting via mutual information maximization,”Complex & Intelligent Systems, vol. 10, no. 4, pp. 5049–5070, 2024
work page 2024
-
[43]
Cagnet: Coordinated attention guidance network for rgb-t crowd counting,
X. Yang, W. Zhou, W. Yan, and X. Qian, “Cagnet: Coordinated attention guidance network for rgb-t crowd counting,”Expert Systems with Applications, vol. 243, p. 122753, 2024
work page 2024
-
[44]
W. Zhou, X. Yang, X. Dong, M. Fang, W. Yan, and T. Luo, “Mjpnet- s*: Multistyle joint-perception network with knowledge distillation for drone rgb-thermal crowd density estimation in smart cities,”IEEE Internet of Things Journal, 2024
work page 2024
-
[45]
Multi-modal crowd counting via a broker modality,
H. Meng, X. Hong, C. Wang, M. Shang, and W. Zuo, “Multi-modal crowd counting via a broker modality,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 231–250
work page 2024
-
[46]
B. Mu, F. Shao, Z. Xie, L. Xu, and Q. Jiang, “Rgbt-booster: Detail- boosted fusion network for rgb-thermal crowd counting with local contrastive learning,”IEEE Internet of Things Journal, 2025
work page 2025
-
[47]
Memory-efficient cross-modal atten- tion for rgb-x segmentation and crowd counting,
Y . Zhang, S. Choi, and S. Hong, “Memory-efficient cross-modal atten- tion for rgb-x segmentation and crowd counting,”Pattern Recognition, p. 111376, 2025
work page 2025
-
[48]
Modal-adaptive spatial-aware- fusion and propagation network for multimodal vision crowd counting,
K. Liu, X. Zou, P. Zhu, and J. Sang, “Modal-adaptive spatial-aware- fusion and propagation network for multimodal vision crowd counting,” IEEE Transactions on Consumer Electronics, 2025
work page 2025
-
[49]
Cmfx: Cross-modal fusion network for rgb-x crowd counting,
X.-M. Duan, H.-M. Sun, Z.-M. Zhang, L.-X. Qin, and R.-S. Jia, “Cmfx: Cross-modal fusion network for rgb-x crowd counting,”Neural Networks, vol. 184, p. 107070, 2025
work page 2025
-
[50]
Rgb-t crowd counting from drone: A bench- mark and mmccn network,
T. Peng, Q. Li, and P. Zhu, “Rgb-t crowd counting from drone: A bench- mark and mmccn network,” inProceedings of the Asian Conference on Computer Vision (ACCV), November 2020
work page 2020
-
[51]
Extremely overlapping vehicle counting,
R. Guerrero-G ´omez-Olmedo, B. Torre-Jim ´enez, R. L ´opez-Sastre, S. Maldonado-Basc ´on, and D. Onoro-Rubio, “Extremely overlapping vehicle counting,” inPattern Recognition and Image Analysis: 7th Iberian Conference, IbPRIA 2015, Santiago de Compostela, Spain, June 17-19, 2015, Proceedings 7. Springer, 2015, pp. 423–431
work page 2015
-
[52]
L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”Advances in Neural Information Processing Sys- tems, vol. 37, pp. 21 875–21 911, 2024
work page 2024
-
[53]
Improved Techniques for Training Consistency Models
Y . Song and P. Dhariwal, “Improved techniques for training consistency models,”arXiv preprint arXiv:2310.14189, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,”ICLR, 2021
work page 2021
-
[55]
Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,
W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 568–578
work page 2021
-
[56]
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,”Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022
work page 2022
-
[57]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[58]
Extracting training data from diffusion models,
N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V . Sehwag, F. Tramer, B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from diffusion models,” in32nd USENIX security symposium (USENIX Se- curity 23), 2023, pp. 5253–5270
work page 2023
-
[59]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” inAdvances in neural information processing systems, 2017, pp. 6626–6637
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.