pith. sign in

arxiv: 2605.13854 · v1 · pith:E77H2NACnew · submitted 2026-04-01 · 💻 cs.CV · cs.GR· cs.MM· eess.IV

Contrastive Multi-Modal Hypergraph Reasoning for 3D Crowd Mesh Recovery

Pith reviewed 2026-05-15 06:16 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.MMeess.IV
keywords 3D crowd reconstructionhypergraph reasoningcontrastive learningmulti-modal fusionocclusion handlingmesh recoverypose estimation
0
0 comments X

The pith

A shared-topology hypergraph fuses RGB, geometric and pose cues to recover 3D crowd meshes despite heavy occlusion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets multi-person 3D mesh recovery in scenes where bodies overlap and depth is uncertain. It starts by merging RGB image features, geometric priors and incomplete pose estimates into node representations, then anchors them with a pelvis depth indicator for scale-free ordering. A single shared-topology hypergraph connects these nodes to capture higher-order group relations that pairwise graphs miss. Contrastive learning on the hypergraph pushes features apart within each modality while forcing orthogonality across modalities, so missing body parts can be inferred from collective context. The result is reported state-of-the-art accuracy on the Panoptic and GigaCrowd benchmarks.

Core claim

Constructing a shared-topology hypergraph from multi-modal node features and training it with a contrastive scheme that simultaneously raises intra-modal discriminability and cross-modal orthogonality lets global crowd context propagate to occluded individuals, producing complete 3D meshes without relying on single-modality inputs or isolated subject processing.

What carries the argument

A shared-topology hypergraph that models higher-order crowd dynamics together with a hypergraph-based contrastive loss that enforces cross-modal orthogonality during feature fusion.

If this is right

  • Group-level context becomes usable for resolving depth ambiguity in any crowded view.
  • Reconstruction no longer requires complete pose input for every person.
  • The same hypergraph structure scales to the large numbers of subjects in GigaCrowd.
  • Metric-scale depth ordering is obtained without camera calibration.
  • Feature representations become reusable across semantic, geometric and kinematic modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding temporal edges to the hypergraph would allow the same machinery to track meshes across video frames.
  • The orthogonality constraint could be tested on other multi-modal problems such as audio-visual scene understanding.
  • The pelvis anchor might simplify calibration-free applications like mobile-phone crowd capture.
  • Performance gains may shrink if input poses are generated by detectors that already incorporate group context.

Load-bearing premise

The shared-topology hypergraph and hypergraph-based contrastive learning scheme can effectively model higher-order crowd dynamics and enforce cross-modal orthogonality to infer missing information under severe occlusion.

What would settle it

An ablation study on the Panoptic benchmark that removes the contrastive orthogonality term and shows no accuracy gain over prior single-modality or pairwise-graph baselines would falsify the claim that the mechanism enables reliable inference of occluded geometry.

Figures

Figures reproduced from arXiv: 2605.13854 by Buzhen Huang, Chongyang Xu, Kun Li, Minghao Sun, Yitao Xie.

Figure 1
Figure 1. Figure 1: Our method reconstructs spatially consistent human meshes in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CoMHR. We begin with (a) Multi-Modal Initialization, which lifts 2D keypoints via pseudo-depth maps and explicitly anchors RGB features with Pelvis Depth to initialize multi-modal features. These cues are fused to construct (b) a Contrastive Hypergraph, where a shared topology is regularized by contrastive learning. This structure enables (c) High-Order Reasoning to propagate collective context… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on GigaCrowd [21]. GroupRec [3] often shows floating feet, drifting bodies, and incorrect depth ordering, while our method maintains accurate depth, stable spatial layout, and consistent reconstruction in dense crowds. apply a negative average cosine similarity loss with ReLU activation: Lcross = 1 |V| X i∈V max  0, −1/3 X m1<m2 cos(h (m1) i ,h (m2) i )  (5) This formulation penali… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of multi-person reconstruction among Multi-HMR [9], GroupRec [3], and our method. Red boxes show baseline failures such as pose misalignment, missing people, and depth errors, while our method produces accurate and consistent reconstructions. TABLE I COMPARISON WITH MULTI-PERSON MESH RECOVERY METHODS ON PANOPTIC DATASET. MPJPE (MM); ’–’ INDICATES UNAVAILABLE RESULTS. BEST IN BOLD, SE… view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on the Panoptic dataset. Top to bottom: (1) RGB-only baseline with depth errors, (2) RGB+Depth+Pose+Tz with improved 3D structure, (3) full model with contrastive learning achieving accurate depth and coherent meshes [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional qualitative results showing high-order interactions and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Multi-view results on GigaCrowd. Two scenes are displayed, each showing front, side, and top views to illustrate the model’s ability to handle dense crowds and maintain depth consistency. stable convergence. Eight data-loading worker threads were employed to enable parallelized data I/O, reducing training latency. For training, the COCO and MPII CLIFF datasets were used to provide diverse multi-person imag… view at source ↗
read the original abstract

Multi-person 3D reconstruction is pivotal for real-world interaction analysis, yet remains challenging due to severe occlusions and depth ambiguity. Current approaches typically rely on single-modality inputs, which inherently lack geometric guidance. Furthermore, these methods often reconstruct subjects in isolation, neglecting the collective group context essential for resolving ambiguities in crowded scenes. To address these limitations, we propose Contrastive Multi-modal Hypergraph Reasoning to synergize semantic, geometric, and pose cues for crowd reconstruction. We first initialize robust node representations by combining RGB features, geometric priors, and occlusion-aware incomplete poses. Additionally, we introduce a pelvis depth indicator as a global spatial anchor, aligning visual features with a metric-scale-agnostic depth ordering. Subsequently, we construct a shared-topology hypergraph that moves beyond pairwise constraints to model higher-order crowd dynamics. To improve feature fusion, we design a hypergraph-based contrastive learning scheme that jointly enhances intra-modal discriminability and enforces cross-modal orthogonality. This mechanism enables the network to propagate global context effectively, allowing it to infer missing information even under severe occlusion. Extensive experiments on the Panoptic and GigaCrowd benchmarks confirm that our method achieves new state-of-the-art performance. Code and pre-trained models are available at https://github.com/SunMH-try/CoMHR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Contrastive Multi-Modal Hypergraph Reasoning (CoMHR) for 3D crowd mesh recovery. It initializes multi-modal node features from RGB, geometric priors, and occlusion-aware poses; introduces a pelvis depth indicator as a metric-scale-agnostic spatial anchor; builds a shared-topology hypergraph to capture higher-order crowd relations beyond pairwise edges; and applies a hypergraph-based contrastive scheme that promotes intra-modal discriminability while enforcing cross-modal orthogonality. The method is evaluated on the Panoptic and GigaCrowd benchmarks, where it claims new state-of-the-art performance in reconstructing meshes under severe occlusion.

Significance. If the experimental claims hold, the work offers a principled way to integrate semantic, geometric, and pose cues via higher-order hypergraph reasoning and contrastive orthogonality, potentially improving robustness in crowded scenes where single-modality or isolated-person methods fail. The release of code and pre-trained models supports reproducibility and follow-up research.

major comments (3)
  1. Experimental Results section: the SOTA claim on Panoptic and GigaCrowd is asserted without reported error bars, statistical significance tests, or per-scene occlusion-level breakdowns; this leaves the central performance claim without visible quantitative support for the contribution of the hypergraph and contrastive components.
  2. Method section on hypergraph construction: the shared-topology assumption is introduced without an ablation comparing it to per-instance or learned topologies; if the topology is fixed across all crowds, the higher-order modeling benefit may be overstated for scenes with varying group structures.
  3. Contrastive learning scheme: the cross-modal orthogonality loss is described qualitatively but lacks a quantitative metric (e.g., average cosine similarity before/after training) showing reduced feature correlation, which is load-bearing for the claim that the scheme enables inference of missing information under occlusion.
minor comments (2)
  1. The abstract and introduction cite prior single-modality methods but omit explicit comparison to recent multi-person hypergraph or graph-based reconstruction works; adding 2-3 key references would clarify novelty.
  2. Notation for the pelvis depth indicator is introduced without an equation defining its computation from the input pose; a short formula would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below, providing clarifications and committing to revisions that strengthen the quantitative support and experimental validation of our claims.

read point-by-point responses
  1. Referee: Experimental Results section: the SOTA claim on Panoptic and GigaCrowd is asserted without reported error bars, statistical significance tests, or per-scene occlusion-level breakdowns; this leaves the central performance claim without visible quantitative support for the contribution of the hypergraph and contrastive components.

    Authors: We agree that the absence of error bars, statistical tests, and occlusion-stratified breakdowns weakens the presentation of our SOTA claims. In the revised manuscript we will report mean and standard deviation across multiple random seeds for all key metrics on both Panoptic and GigaCrowd. We will add per-scene tables grouped by occlusion severity and include paired t-tests against the strongest baselines to establish statistical significance. These additions will directly quantify the contribution of the hypergraph and contrastive components. revision: yes

  2. Referee: Method section on hypergraph construction: the shared-topology assumption is introduced without an ablation comparing it to per-instance or learned topologies; if the topology is fixed across all crowds, the higher-order modeling benefit may be overstated for scenes with varying group structures.

    Authors: The shared topology is chosen because the evaluated crowd benchmarks exhibit recurring higher-order interaction patterns that can be captured once and reused, reducing computational cost while still modeling group-level relations beyond pairwise edges. Nevertheless, we acknowledge that an explicit ablation is needed to substantiate this design choice. In the revision we will add a new ablation table comparing the fixed shared topology against per-instance topologies and a learned topology variant, reporting both accuracy and runtime differences to justify the shared-topology assumption. revision: yes

  3. Referee: Contrastive learning scheme: the cross-modal orthogonality loss is described qualitatively but lacks a quantitative metric (e.g., average cosine similarity before/after training) showing reduced feature correlation, which is load-bearing for the claim that the scheme enables inference of missing information under occlusion.

    Authors: We will augment the manuscript with a quantitative analysis of the orthogonality effect. Specifically, we will report the average cosine similarity between cross-modal feature pairs before and after the contrastive training stage, together with the corresponding intra-modal discriminability scores. This metric will be presented in both the Method and Experiments sections to empirically demonstrate how the loss reduces unwanted correlation and supports inference of occluded information. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation chain introduces new components (pelvis depth indicator as global anchor, shared-topology hypergraph for higher-order relations, and hypergraph-based contrastive scheme enforcing intra-modal discriminability plus cross-modal orthogonality) that are defined directly from the problem of multi-person 3D reconstruction under occlusion rather than reducing to fitted parameters or prior self-referential results. Node initialization combines RGB features, geometric priors, and incomplete poses via standard fusion; the contrastive loss is a standard orthogonality regularizer applied to the new hypergraph structure. Performance claims rest on external benchmarks (Panoptic, GigaCrowd) with conventional metrics, and no equations or self-citations in the abstract or described pipeline collapse the claimed outputs back to the inputs by construction. The method is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the untested effectiveness of the new hypergraph construction and contrastive orthogonality mechanism for inferring occluded geometry; no external benchmarks or independent evidence for these inventions are supplied in the abstract.

axioms (1)
  • domain assumption Neural networks can learn effective fusion of multi-modal features via hypergraph message passing
    Invoked as the basis for the shared-topology hypergraph and contrastive scheme.
invented entities (2)
  • pelvis depth indicator no independent evidence
    purpose: global spatial anchor for metric-scale-agnostic depth ordering
    New component introduced to align visual features with depth ordering.
  • shared-topology hypergraph no independent evidence
    purpose: model higher-order crowd dynamics beyond pairwise constraints
    New structure for propagating global context in crowded scenes.

pith-pipeline@v0.9.0 · 5548 in / 1304 out tokens · 49510 ms · 2026-05-15T06:16:22.765292+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Dycrowd: Towards dynamic crowd reconstruction from a large-scene video,

    H. Wen, H. Kang, J. Ma, J. Huang, Y . Yang, H. Lin, Y .-K. Lai, and K. Li, “Dycrowd: Towards dynamic crowd reconstruction from a large-scene video,”IEEE TPAMI, 2025

  2. [2]

    Crowd3d: Towards hundreds of people reconstruction from a single image,

    H.-K. Wen, J. Huang, H. Cui, H.-X. Lin, Y . Lai, L. Fang, and K.-J. Li, “Crowd3d: Towards hundreds of people reconstruction from a single image,”CVPR, pp. 8937–8946, 2023

  3. [3]

    Reconstructing groups of people with hypergraph relational reasoning,

    B. Huang, J. Ju, Z. Li, and Y . Wang, “Reconstructing groups of people with hypergraph relational reasoning,”ICCV, pp. 14 827–14 837, 2023

  4. [4]

    Closely interactive human reconstruction with proxemics and physics-guided adaption,

    B. Huang, C. Li, C. Xu, L. Pan, Y . Wang, and G. H. Lee, “Closely interactive human reconstruction with proxemics and physics-guided adaption,”CVPR, pp. 1011–1021, 2024

  5. [5]

    Reconstructing close human interaction with appearance and proxemics reasoning,

    B. Huang, C. Li, C. Xu, D. Lu, J. Chen, Y . Wang, and G. H. Lee, “Reconstructing close human interaction with appearance and proxemics reasoning,”CVPR, pp. 17 475–17 485, 2025

  6. [6]

    Object-occluded human shape and pose estimation with probabilistic latent consistency,

    B. Huang, T. Zhang, and Y . Wang, “Object-occluded human shape and pose estimation with probabilistic latent consistency,”IEEE TPAMI, pp. 5010–5026, 2022

  7. [7]

    Putting people in their place: Monocular regression of 3d people in depth,

    Y . Sun, W. Liu, Q. Bao, Y . Fu, T. Mei, and M. J. Black, “Putting people in their place: Monocular regression of 3d people in depth,”CVPR, pp. 13 233–13 242, 2021

  8. [8]

    Depth anything: Unleashing the power of large-scale unlabeled data,

    L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,”CVPR, pp. 10 371–10 381, 2024

  9. [9]

    Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,

    F. Baradel, M. Armando, S. Galaaoui, R. Br’egier, P. Weinzaepfel, G. Rogez, and T. Lucas, “Multi-hmr: Multi-person whole-body human mesh recovery in a single shot,”ArXiv, vol. abs/2402.14654, 2024

  10. [10]

    Score-guided diffusion for 3d human recovery,

    A. Stathopoulos, L. Han, and D. N. Metaxas, “Score-guided diffusion for 3d human recovery,”CVPR, pp. 906–915, 2024

  11. [11]

    Monocular, one-stage, regression of multiple 3d people,

    Y . Sun, Q. Bao, W. Liu, Y . Fu, M. J. Black, and T. Mei, “Monocular, one-stage, regression of multiple 3d people,”ICCV, pp. 11 159–11 168, 2020

  12. [12]

    Futr3d: A unified sensor fusion framework for 3d detection,

    X. Chen, T. Zhang, Y . Wang, Y . Wang, and H. Zhao, “Futr3d: A unified sensor fusion framework for 3d detection,”CVPRW, pp. 172–181, 2022

  13. [13]

    Immfusion: Robust mmwave-rgb fusion for 3d human body reconstruction in all weather conditions,

    A. Chen, X. Wang, K. Shi, S. Zhu, Y . Chen, B. Fang, J. Chen, Y . Huo, and Q. Ye, “Immfusion: Robust mmwave-rgb fusion for 3d human body reconstruction in all weather conditions,”ICRA, pp. 2752–2758, 2022

  14. [14]

    Hypergraph neural networks,

    Y . Feng, H. You, Z. Zhang, R. Ji, and Y . Gao, “Hypergraph neural networks,” inAAAI, 2018

  15. [15]

    Openpose: Realtime multi-person 2d pose estimation using part affinity fields,

    Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” TPAMI, vol. 43, no. 1, pp. 172–186, 2019

  16. [16]

    Pixels to Graphs by Associative Embedding

    A. Newell and J. Deng, “Pixels to graphs by associative embedding,”arXiv, vol. abs/1706.07365, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:4409978

  17. [17]

    Camera distance-aware top- down approach for 3d multi-person pose estimation from a single rgb image,

    G. Moon, J. Y . Chang, and K. M. Lee, “Camera distance-aware top- down approach for 3d multi-person pose estimation from a single rgb image,”ICCV, pp. 10 132–10 141, 2019

  18. [18]

    Groupnet: Multiscale hypergraph neural networks for trajectory prediction with relational reasoning,

    C. Xu, M. Li, Z. Ni, Y . Zhang, and S. Chen, “Groupnet: Multiscale hypergraph neural networks for trajectory prediction with relational reasoning,”CVPR, pp. 6488–6497, 2022

  19. [19]

    Multi-agent dynamic relational reasoning for social robot navigation,

    J. Li, C. Hua, H. Ma, J. Park, V . M. Dax, and M. J. Kochenderfer, “Multi-agent dynamic relational reasoning for social robot navigation,” arXiv, 2024

  20. [20]

    Hypergraph based human mesh hierarchical representation and reconstruction from a single image,

    C. Hao, D. Kong, J. Li, C. Liu, and B. Yin, “Hypergraph based human mesh hierarchical representation and reconstruction from a single image,”Comput. Graph., pp. 339–347, 2023

  21. [21]

    Deep social grouping network for large scenes with multiple subjects,

    W. Li and X. Sun, “Deep social grouping network for large scenes with multiple subjects,”Scientia Sinica Informationis, 2021

  22. [22]

    Cliff: Carrying location information in full frames into human pose and shape estimation,

    Z. Li, J. Liu, Z. Zhang, S. Xu, and Y . Yan, “Cliff: Carrying location information in full frames into human pose and shape estimation,” in ECCV, 2022

  23. [23]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inECCV, 2014

  24. [24]

    2d human pose estimation: New benchmark and state of the art analysis,

    M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,”CVPR, pp. 3686–3693, 2014

  25. [25]

    Panoptic studio: A massively multiview system for social motion capture,

    H. Joo, H. Liu, L. Tan, L. Gui, B. C. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y . Sheikh, “Panoptic studio: A massively multiview system for social motion capture,”ICCV, pp. 3334–3342, 2015

  26. [26]

    Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,

    J. Li, C. Wang, H. Zhu, Y . Mao, H. Fang, and C. Lu, “Crowdpose: Efficient crowded scenes pose estimation and a new benchmark,”CVPR, pp. 10 855–10 864, 2018

  27. [27]

    Monocular 3d pose and shape estimation of multiple people in natural scenes: The importance of multiple scene constraints,

    A. Zanfir, E. Marinoiu, and C. Sminchisescu, “Monocular 3d pose and shape estimation of multiple people in natural scenes: The importance of multiple scene constraints,”CVPR, 2018

  28. [28]

    Deep network for the integrated 3d sensing of multiple people in natural images,

    A. Zanfir, E. Marinoiu, M. Zanfir, A. Popa, and C. Sminchisescu, “Deep network for the integrated 3d sensing of multiple people in natural images,” inNeurIPS, 2018

  29. [29]

    Coherent reconstruction of multiple humans from a single image,

    W. B. Jiang, N. Kolotouros, G. Pavlakos, X. Zhou, and K. Daniilidis, “Coherent reconstruction of multiple humans from a single image,” CVPR, pp. 5578–5587, 2020

  30. [30]

    Body meshes as points,

    J. Zhang, D. Yu, J. H. Liew, X. Nie, and J. Feng, “Body meshes as points,”CVPR, pp. 546–556, 2021

  31. [31]

    Pose2uv: Single-shot multiperson mesh recovery with deep uv prior,

    B. Huang, T. Zhang, and Y . Wang, “Pose2uv: Single-shot multiperson mesh recovery with deep uv prior,”IEEE TIP, pp. 4679–4692, 2022

  32. [32]

    Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,

    H. Choi, G. Moon, J. H. Park, and K. M. Lee, “Learning to estimate robust 3d human mesh from in-the-wild crowded scenes,”CVPR, pp. 1465–1474, 2021

  33. [33]

    Scene-aware 3d multi-human motion capture from a single camera,

    D. C. Luvizon, M. Habermann, V . Golyanik, A. Kortylewski, and C. Theobalt, “Scene-aware 3d multi-human motion capture from a single camera,”CGF, 2023

  34. [34]

    Smap: Single-shot multi-person absolute 3d pose estimation,

    J. Zhen, Q. Fang, J. Sun, W. Liu, W. Jiang, H. Bao, and X. Zhou, “Smap: Single-shot multi-person absolute 3d pose estimation,”arXiv, 2020

  35. [35]

    Sat-hmr: Real-time multi-person 3d mesh estimation via scale-adaptive tokens,

    C. Su, X. Ma, J. Su, and Y . Wang, “Sat-hmr: Real-time multi-person 3d mesh estimation via scale-adaptive tokens,” inProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), June 2025, pp. 16 796–16 806

  36. [36]

    Crowdrec: 3d crowd reconstruction from single color images,

    B. Huang, J. Ju, and Y . Wang, “Crowdrec: 3d crowd reconstruction from single color images,”arXiv, 2023. SUPPLEMENTARYMATERIAL In this document, we provide the following supplementary contents to further clarify the proposed method and facilitate reproducibility: In this document, we provide the following supplementary contents to further clarify the prop...

  37. [37]

    Specif- ically, we provide quantitative evaluations on crowd-level spatial metrics and qualitative visualizations to demonstrate step-by-step improvements of our CoMHR framework

    Extended Ablation Studies In this section, we present ablation studies to further validate the individual contributions of our proposed modules. Specif- ically, we provide quantitative evaluations on crowd-level spatial metrics and qualitative visualizations to demonstrate step-by-step improvements of our CoMHR framework. Quantitative Analysis on Crowd-Le...

  38. [38]

    6 shows three representative examples from the CrowdPose dataset, which is known for its crowded scenes, frequent inter-person occlusions, and challenging pose variations

    Additional Qualitative Results To further demonstrate our method’s robustness and versa- tility across diverse datasets and crowd densities, Fig. 6 shows three representative examples from the CrowdPose dataset, which is known for its crowded scenes, frequent inter-person occlusions, and challenging pose variations. Despite these difficulties, our model r...

  39. [39]

    Implementation Details To facilitate future research and ensure the full reproducibil- ity of our work, we provide comprehensive implementation details of the proposed CoMHR framework in this section. Specifically, we elaborate on the hardware environment and training hyperparameters, the precise dimensions of our net- work architecture, and the exact con...

  40. [40]

    Environmental Degradation.We tested the model’s resis- tance to severe environmental noise on the Panoptic dataset (Baseline MPJPE = 104.18 mm)

    Robustness to Noise and Upstream Failures To comprehensively evaluate the resilience of CoMHR, we conducted stress tests under both environmental degradation and extreme upstream foundation model failures. Environmental Degradation.We tested the model’s resis- tance to severe environmental noise on the Panoptic dataset (Baseline MPJPE = 104.18 mm). Under ...

  41. [41]

    On a single NVIDIA RTX 4090 GPU, our full relation reasoning stage takes approximately 28 ms per frame for standard groups (6-16 individuals)

    Efficiency and Scalability CoMHR comprises 53.03M parameters in total, where the core hypergraph accounts for only 17.2% (∼9.1M), highlight- ing its parameter efficiency. On a single NVIDIA RTX 4090 GPU, our full relation reasoning stage takes approximately 28 ms per frame for standard groups (6-16 individuals). To address scalability in massive crowds (e...

  42. [42]

    For instance, foreground truncation and sensor contamination increase the error by merely 2.41 mm and 1.18 mm, respectively

    Limitations Extensive stress tests demonstrate CoMHR’s strong resis- tance to severe environmental degradation and upstream mod- ule failures. For instance, foreground truncation and sensor contamination increase the error by merely 2.41 mm and 1.18 mm, respectively. Similarly, extreme upstream corrup- tions, such as pelvis detection failure (altering err...

  43. [43]

    Multimodal Contrastive Learning Pseudocode We present the pseudocode for the multimodal contrastive learning component of our CoMHR model to facilitate deeper understanding of our core training strategy. This pseudocode highlights the extraction of RGB, depth, and pose features, their relational encoding via PastEncoder, and the computation of intra- and ...