arxiv: 2604.09689 · v2 · submitted 2026-04-06 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

Face Density as a Proxy for Data Complexity: Quantifying the Hardness of Instance Count

Abolfazl Mohammadi-Seif , Ricardo Baeza-Yates

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords face densitydata complexityinstance countobject detectionmodel performancedomain shiftcurriculum learningperformance degradation

0 comments

The pith

Instance density measured by exact face count drives data complexity and degrades model performance monotonically from 1 to 18 faces per image.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to prove that the number of faces present in an image functions as a primary and independent source of difficulty for machine learning models. By creating specially balanced data slices where only the face count changes while every other factor stays equal, the experiments show performance falling in a straight line as the count rises. A reader would care because this turns a vague sense that crowded pictures are harder into a measurable property that can be isolated and studied. The work further shows that models trained only on sparse images systematically undercount when shown denser ones, with error rates rising sharply. This frames density as a built-in property of the data that shapes what models can and cannot learn.

Core claim

The central claim is that model performance on face classification, regression, and detection tasks degrades monotonically as the number of faces per image increases from 1 to 18, even when training and test distributions are perfectly balanced across density levels on the WIDER FACE and Open Images datasets. Models trained exclusively on low-density images exhibit a systematic under-counting bias when evaluated on higher-density images, with error rates increasing by up to 4.6 times. The authors interpret this as density acting as a domain shift, establishing instance density as an intrinsic, quantifiable dimension of data hardness that motivates density-aware curriculum learning and strat

What carries the argument

Instance density defined strictly as the exact count of faces per image, isolated by restricting images to one of 18 discrete counts and enforcing equal numbers of samples at each count.

If this is right

Performance falls steadily as face count rises from 1 to 18 under balanced conditions.
Models trained on low-density data under-count objects when tested on high-density data, with errors rising up to 4.6 times.
Density functions as a domain shift even when models see the full range during training.
Curriculum learning that presents densities in increasing order and density-stratified evaluation are direct responses to the observed pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same controlled-count approach could be applied to other counting or detection tasks such as people in crowds or cells in microscope images to test whether density effects generalize.
Datasets could be explicitly stratified or augmented by density level to prevent hidden performance gaps from appearing only in real-world crowded scenes.
Models might be trained with an auxiliary density-prediction head that allows them to adapt their detection strategy based on the predicted count before final output.

Load-bearing premise

That fixing the face count and balancing sample sizes across counts removes all other sources of performance variation such as face size differences, occlusion patterns, or image quality.

What would settle it

Run the same balanced-count protocol while also matching face-size distributions and occlusion statistics across the 1-to-18 count groups and check whether the monotonic performance drop disappears.

Figures

Figures reproduced from arXiv: 2604.09689 by Abolfazl Mohammadi-Seif, Ricardo Baeza-Yates.

**Figure 2.** Figure 2: Exp 1: Misclassifications rate (%) vs. classification pair in the n-vs-n + 1 chain on WIDER FACE and Open Images. the average misclassification rate climbs from 35.3% at the lowest density point (1–2 faces) to a staggering 50.3% at the highest density (17–18 faces). This signifies that even when the numerical delta between classes remains constant at exactly one face, the model’s discriminative power is er… view at source ↗

**Figure 5.** Figure 5: Exp 3: MAE vs. true face count when training only on 1 to 9 faces. full 1 to 18 range. The identical protocol is applied to both datasets. Results. In both datasets, in-domain MAE (1 to 9 faces) remains low (WIDER FACE: 1.66, Open Images: 1.62), confirming successful learning on the training distribution. However, out-of-domain MAE (10 to 18 faces) grows to 7.73 (WIDER FACE) and 7.49 (Open Images), ≈ 4.6× … view at source ↗

**Figure 4.** Figure 4: Exp 2: Average accuracy and MCC across all gaps. provides evidence that face density itself, independent of the decision boundary (gap size), is a good proxy for task difficulty. C. Exp 3: Transfer from Low to High Density Motivation. Real-world datasets are often dominated by low-density images. We test whether a model trained exclusively on easy (low-density) scenes can generalize to denser images, or w… view at source ↗

**Figure 7.** Figure 7: Exp 4: MAE in Low (1 to 6), Medium (7 to 12), and High (13 to 18) bins after full balanced training [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Exp 4: MSE vs. exact face count (CSRNet, full end-to-end training). E. Exp 5: Detection-Based Counting with Modern Detectors Motivation. We verify that the observed phenomenon is not an artifact of regression or classification training but affects even the best publicly available face detectors that were trained on massive heterogeneous data. Setup. We evaluated three pre-trained, off-the-shelf detectors,… view at source ↗

**Figure 10.** Figure 10: Exp 6 (control): Prediction bias (predicted − true count) vs. true face count when training on the full balanced 1 to 18 distribution. becomes progressively negative beyond 12 faces, reaching −4.31 (WIDER FACE) and −4.22 (Open Images) at 18 faces-nearly identical across the two independent datasets. The curves follow a similar trajectory: an initial oscillatory phase with small positive bias gives way to … view at source ↗

**Figure 9.** Figure 9: Exp 5: MAE vs. true face count for three state-of-the-art off-theshelf detectors on WIDER FACE and Open Images. 2) Open Images + MTCNN 3) WIDER FACE + YOLOv9 4) Open Images + YOLOv9 5) WIDER FACE + RetinaFace 6) Open Images + RetinaFace Even the strongest model (RetinaFace in Open Images) degrades beyond 10 faces. The consistency of this hierarchy across two independent large-scale datasets, despite diffe… view at source ↗

**Figure 11.** Figure 11: Exp 7: Stability Analysis. Comparison of prediction bias between our Balanced Model (100 samples/count) and a model trained on the Full Biased WIDER FACE set (all available images, thousands per count). This comparison highlights a critical relationship between data volume and balance. In Exp 6 ( [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

read the original abstract

Machine learning progress has historically prioritized model-centric innovations, yet achievable performance is frequently capped by the intrinsic complexity of the data itself. In this work, we isolate and quantify the impact of instance density (measured by face count) as a primary driver of data complexity. Rather than simply observing that ``crowded scenes are harder,'' we rigorously control for class imbalance to measure the precise degradation caused by density alone. Controlled experiments on the WIDER FACE and Open Images datasets, restricted to exactly 1 to 18 faces per image with perfectly balanced sampling, reveal that model performance degrades monotonically with increasing face count. This trend holds across classification, regression, and detection paradigms, even when models are fully exposed to the entire density range. Furthermore, we demonstrate that models trained on low-density regimes fail to generalize to higher densities, exhibiting a systematic under-counting bias, with error rates increasing by up to 4.6x, which suggests density acts as a domain shift. These findings establish instance density as an intrinsic, quantifiable dimension of data hardness and motivate specific interventions in curriculum learning and density-stratified evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Balanced sampling shows monotonic drops in performance with rising face count, but the setup leaves scale, occlusion, and scene composition as plausible drivers instead.

read the letter

Here's the quick read on this one. The paper finds that face detection and counting get monotonically worse as the number of faces in an image rises from 1 to 18, even when they balance the number of training images at each density level on WIDER FACE and Open Images. They also show that models trained on low-density data underperform on high-density test sets, with error rates up to 4.6 times higher. What they do well is move past the vague 'crowds are hard' observation by enforcing class balance and testing across different task types. That controlled setup is a step forward and makes the domain-shift claim more credible than usual. The soft spot is that balancing image counts per density bin doesn't automatically isolate density from other image properties. Pictures with lots of faces tend to have smaller average face sizes, more occlusion, and different backgrounds. The work doesn't mention any matching, stratification, or regression to handle those, so the performance drop might trace back to scale or visibility issues instead. The abstract also skips error bars or significance tests, which leaves the trend looking a bit thin. This paper is aimed at people who curate vision datasets or design training curricula for dense scenes. It gives them a concrete variable to track. A reader working on evaluation benchmarks or data hardness metrics would find it relevant. I would send it out for peer review. The core idea is worth exploring, and the experiments provide a starting point, but referees will probably push on the confounding factors and ask for more statistical detail.

Referee Report

2 major / 1 minor

Summary. The paper claims that instance density, measured by face count per image, serves as a quantifiable proxy for data complexity in ML. Using controlled experiments on WIDER FACE and Open Images restricted to images with exactly 1–18 faces and perfectly balanced sampling across density bins, it reports monotonic performance degradation in classification, regression, and detection tasks as face count rises. Models trained on low-density data exhibit up to 4.6× higher error on high-density images, interpreted as a domain shift, motivating density-aware curriculum learning and stratified evaluation.

Significance. If the isolation of density holds, the work supplies a concrete, measurable dimension of data hardness that could reshape dataset curation, training curricula, and evaluation protocols in computer vision. The balanced-sampling design and consistency across task types provide an empirical basis for data-centric approaches, with potential to improve handling of crowded scenes.

major comments (2)

[Abstract] Abstract and experimental description: the reported monotonic degradation and 4.6× error increase lack error bars, statistical tests, or explicit details on density operationalization (e.g., exact binning and sampling procedure), leaving the central empirical claim only moderately supported.
[Controlled experiments on WIDER FACE and Open Images] Controlled experiments section: balancing image counts per face-density bin does not address systematic correlations between face count and other factors (smaller average bounding-box sizes, higher occlusion rates, differing scene compositions) known to exist in WIDER FACE and Open Images. No stratification, matching, or regression controls on these covariates are described, so the performance drop cannot be attributed solely to instance count.

minor comments (1)

[Abstract] The claim that models are 'fully exposed to the entire density range' would benefit from an explicit reference to the training protocol and data split used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better substantiate our central claims about instance density as a proxy for data complexity. We address each major comment below and will incorporate revisions to strengthen the empirical support and address potential confounding factors.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: the reported monotonic degradation and 4.6× error increase lack error bars, statistical tests, or explicit details on density operationalization (e.g., exact binning and sampling procedure), leaving the central empirical claim only moderately supported.

Authors: We agree that the abstract and experimental description would benefit from greater statistical rigor and transparency. In the revised manuscript, we will add error bars (or confidence intervals) to all performance plots and reported metrics, include formal statistical tests (e.g., linear trend tests or ANOVA with post-hoc comparisons) to confirm the significance of the monotonic degradation and the 4.6× error increase, and provide explicit details on density operationalization, including the precise bin boundaries (e.g., 1–3, 4–6, ..., 16–18 faces) and the exact procedure for achieving perfectly balanced image counts across bins. These changes will make the central empirical claims more robustly supported. revision: yes
Referee: [Controlled experiments on WIDER FACE and Open Images] Controlled experiments section: balancing image counts per face-density bin does not address systematic correlations between face count and other factors (smaller average bounding-box sizes, higher occlusion rates, differing scene compositions) known to exist in WIDER FACE and Open Images. No stratification, matching, or regression controls on these covariates are described, so the performance drop cannot be attributed solely to instance count.

Authors: The referee correctly notes that face count is correlated with other scene factors in these datasets, and our current design primarily balances image counts per density bin rather than explicitly controlling for those covariates. While the balanced sampling isolates density under the constraint of equal image representation, we acknowledge this does not fully rule out confounding. In the revision, we will add: (i) quantitative correlation analysis between face density and covariates such as average bounding-box size and occlusion rate; (ii) additional matched or stratified subsampling experiments on key covariates where data permits; and (iii) regression-based controls (e.g., including occlusion and box-size terms) to estimate the unique contribution of density. We will also expand the discussion of limitations regarding residual confounding. This will provide a more complete attribution analysis while preserving the core finding that density serves as a useful, measurable proxy. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on controlled empirical experiments

full rationale

The paper presents no derivation chain, equations, or fitted parameters that reduce to their own inputs. Its central results come from direct experiments on WIDER FACE and Open Images with explicit balancing of image counts across 1-18 faces per image, reporting observed monotonic performance degradation. No self-citations are invoked as load-bearing uniqueness theorems, no ansatzes are smuggled, and no known results are renamed as novel derivations. The work is self-contained against external benchmarks via public datasets and standard evaluation protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard supervised learning assumptions and public datasets without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Standard assumptions of supervised learning hold for the chosen architectures and loss functions.
The experiments presuppose that typical CNN or transformer training dynamics apply without additional regularization effects from density.

pith-pipeline@v0.9.0 · 5505 in / 1085 out tokens · 38967 ms · 2026-05-10T18:53:31.912350+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Single-image crowd counting via multi-column convolutional neural network,

Y . Zhang, D. Zhou, S. Chen, S. Gao, and Y . Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597

2016
[2]

Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,

Y . Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” inPro- ceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1091–1100

2018
[3]

Distribution match- ing for crowd counting,

B. Wang, H. Liu, D. Samaras, and M. H. Nguyen, “Distribution match- ing for crowd counting,”Advances in neural information processing systems, vol. 33, pp. 1595–1607, 2020

2020
[4]

Context-aware crowd counting,

W. Liu, M. Salzmann, and P. Fua, “Context-aware crowd counting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5099–5108

2019
[5]

Decidenet: Counting varying density crowds through attention guided detection and density estimation,

J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “Decidenet: Counting varying density crowds through attention guided detection and density estimation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5197–5206

2018
[6]

Nwpu-crowd: A large-scale benchmark for crowd counting and localization,

Q. Wang, J. Gao, W. Lin, and X. Li, “Nwpu-crowd: A large-scale benchmark for crowd counting and localization,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 6, pp. 2141– 2149, 2020

2020
[7]

Leveraging self-supervision for cross- domain crowd counting,

W. Liu, N. Durasov, and P. Fua, “Leveraging self-supervision for cross- domain crowd counting,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5341–5352

2022
[8]

Deep long- tailed learning: A survey,

Y . Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, “Deep long- tailed learning: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 9, pp. 10 795–10 816, 2023

2023
[9]

Decoupling representation and classifier for long-tailed recognition,

B. Kang, S. Xie, M. Rohrbach, Z. Yan, A. Gordo, J. Feng, and Y . Kalantidis, “Decoupling representation and classifier for long-tailed recognition,” inInternational Conference on Learning Representa- tions, 2019

2019
[10]

Retinaface: Single-shot multi-level face localisation in the wild,

J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou, “Retinaface: Single-shot multi-level face localisation in the wild,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 5203–5212

2020
[11]

Wider face: A face detection benchmark,

S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5525–5533

2016
[12]

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,

A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont- Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikovet al., “The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale,”International journal of computer vision, vol. 128, no. 7, pp. 1956–1981, 2020

1956
[13]

Efficientnet: Rethinking model scaling for con- volutional neural networks,

M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for con- volutional neural networks,” inInternational conference on machine learning. PMLR, 2019, pp. 6105–6114

2019
[14]

Transcrowd: weakly-supervised crowd counting with transformers,

D. Liang, X. Chen, W. Xu, Y . Zhou, and X. Bai, “Transcrowd: weakly-supervised crowd counting with transformers,”Science China Information Sciences, vol. 65, no. 6, p. 160104, 2022

2022
[15]

Very Deep Convolutional Networks for Large-Scale Image Recognition

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[16]

Yolov9: Learning what you want to learn using programmable gradient information,

C.-Y . Wang, I.-H. Yeh, and H.-Y . Mark Liao, “Yolov9: Learning what you want to learn using programmable gradient information,” inEuropean conference on computer vision. Springer, 2024, pp. 1–21

2024
[17]

Joint face detection and alignment using multitask cascaded convolutional networks,

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,”IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016

2016
[18]

Curriculum learning,

Y . Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” inProceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48

2009
[19]

Unsolved problems in ml safety

D. Hendrycks, N. Carlini, J. Schulman, and J. Steinhardt, “Unsolved problems in ml safety,”arXiv preprint arXiv:2109.13916, 2021

work page arXiv 2021
[20]

L2hcount: Generalizing crowd counting from low to high crowd density via density simulation,

G. Xu, J. Yin, R. Zhang, Y . Dang, F. Zhou, and B. Yu, “L2hcount: Generalizing crowd counting from low to high crowd density via density simulation,”arXiv preprint arXiv:2503.12935, 2025

work page arXiv 2025
[21]

Data-centric artificial intelligence: A survey,

D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, Z. Jiang, S. Zhong, and X. Hu, “Data-centric artificial intelligence: A survey,”ACM Computing Surveys, vol. 57, no. 5, pp. 1–42, 2025

2025