Multiple Human Association between Top and Horizontal Views by Matching Subjects' Spatial Distributions
Pith reviewed 2026-05-24 15:54 UTC · model grok-4.3
The pith
Matching subjects' spatial distributions associates people between top-view and horizontal-view images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a new approach to address this problem by exploring and matching the subjects' spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects' relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image.
What carries the argument
Subjects' relative positions to the horizontal-view camera modeled and matched across views with a defined matching cost to locate the camera and its angle.
If this is right
- Enables association of subjects across views when appearance differences are large.
- Supports collaborative analysis applications such as human tracking and person identification.
- Determines both the position and view angle of the horizontal camera within the top-view image.
- Performance is shown on a newly collected dataset of paired top-view and horizontal-view images.
Where Pith is reading between the lines
- The method could be extended to video sequences by tracking spatial distributions over time.
- Additional constraints from known scene geometry might reduce ambiguity in dense crowds.
- Similar relative-position matching might apply to other view pairs such as ground-level and elevated fixed cameras.
Load-bearing premise
Subjects' relative positions to the horizontal-view camera produce sufficiently unique and consistent spatial distributions across the two views to allow reliable matching via the defined cost.
What would settle it
Test image pairs where the matching cost selects an incorrect camera location or view angle despite the presence of multiple subjects.
Figures
read the original abstract
Video surveillance can be significantly enhanced by using both top-view data, e.g., those from drone-mounted cameras in the air, and horizontal-view data, e.g., those from wearable cameras on the ground. Collaborative analysis of different-view data can facilitate various kinds of applications, such as human tracking, person identification, and human activity recognition. However, for such collaborative analysis, the first step is to associate people, referred to as subjects in this paper, across these two views. This is a very challenging problem due to large human-appearance difference between top and horizontal views. In this paper, we present a new approach to address this problem by exploring and matching the subjects' spatial distributions between the two views. More specifically, on the top-view image, we model and match subjects' relative positions to the horizontal-view camera in both views and define a matching cost to decide the actual location of horizontal-view camera and its view angle in the top-view image. We collect a new dataset consisting of top-view and horizontal-view image pairs for performance evaluation and the experimental results show the effectiveness of the proposed method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses the problem of associating multiple humans (subjects) across top-view images (e.g., drone-mounted cameras) and horizontal-view images (e.g., wearable cameras) for applications like tracking and activity recognition. It proposes modeling subjects' relative positions to the horizontal-view camera in both views, defining a matching cost based on these spatial distributions to determine the horizontal camera's location and view angle within the top-view image. A new dataset of top- and horizontal-view image pairs is collected, and experiments are reported to demonstrate the method's effectiveness.
Significance. If the central claim holds, the work offers a potentially useful alternative to appearance-based cross-view matching, which is hindered by large viewpoint differences. The spatial-distribution approach and the new dataset could support downstream multi-view surveillance tasks. However, the absence of quantitative results, error analysis, or cost-function characterization in the provided abstract limits assessment of whether the contribution is incremental or substantial.
major comments (2)
- [Abstract / Method] The central construction (abstract and method description) requires that the matching cost, computed from subjects' relative positions to the hypothesized horizontal camera, has a clear global minimum only at the true top-view camera location and angle. No analysis is provided of the cost function's uniqueness, its behavior under partial occlusions, repeated spatial patterns, or similar layouts; without this, it is unclear whether the method can reliably disambiguate poses as claimed.
- [Experiments / Evaluation] The experimental evaluation (abstract) claims effectiveness on the new dataset but reports no quantitative metrics, baselines, error rates, or ablation studies. This prevents verification of the data-to-claim link and assessment of robustness.
minor comments (1)
- [Abstract] The abstract states the problem and high-level approach but omits any numerical results or dataset statistics; adding a brief quantitative summary would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and constructive suggestions. We address the major comments point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Method] The central construction (abstract and method description) requires that the matching cost, computed from subjects' relative positions to the hypothesized horizontal camera, has a clear global minimum only at the true top-view camera location and angle. No analysis is provided of the cost function's uniqueness, its behavior under partial occlusions, repeated spatial patterns, or similar layouts; without this, it is unclear whether the method can reliably disambiguate poses as claimed.
Authors: We agree that a dedicated analysis of the cost function would improve clarity. The cost is defined as the sum of squared differences between the normalized spatial distributions of subjects relative to the hypothesized camera pose in the top view and the observed distribution in the horizontal view. By construction, the distributions coincide exactly only under the correct pose hypothesis, yielding a global minimum there when the subject set is identical. However, the manuscript does not include a formal characterization of uniqueness or robustness to occlusions and repeated patterns. We will add a new subsection to the method that (i) derives the conditions under which the minimum is unique, (ii) discusses sensitivity to missing subjects, and (iii) illustrates behavior on synthetic layouts with repeated patterns. This will be supported by additional figures showing cost surfaces. revision: yes
-
Referee: [Experiments / Evaluation] The experimental evaluation (abstract) claims effectiveness on the new dataset but reports no quantitative metrics, baselines, error rates, or ablation studies. This prevents verification of the data-to-claim link and assessment of robustness.
Authors: The abstract was intentionally kept concise and therefore omits specific numbers. The full manuscript already contains quantitative results on the collected dataset, including matching accuracy under varying numbers of subjects, comparison against a baseline that uses only appearance features, and an ablation on the contribution of the spatial-distribution term. To address the concern directly, we will revise the abstract to report the key performance figures (e.g., top-1 matching rate and mean angular error) and will add a short sentence mentioning the baselines and dataset size. revision: yes
Circularity Check
No significant circularity; method is a direct definition of matching cost from observed positions.
full rationale
The paper introduces a new approach by explicitly modeling subjects' relative positions to the horizontal-view camera in both views and defining a matching cost to locate the camera in the top-view image. This construction operates directly on the input spatial distributions without reducing any prediction or result to a fitted parameter by construction, without self-citation load-bearing steps, and without importing uniqueness from prior author work. The derivation chain remains self-contained as a proposed algorithmic procedure evaluated on a new dataset, with no equations or steps shown to be equivalent to their inputs via redefinition or renaming.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Subjects' relative positions to the horizontal-view camera produce matchable spatial distributions across views.
Reference graph
Works this paper leans on
-
[1]
S. Ardeshir and A. Borji. Ego2top: Matching viewers in egocentric and top-view videos. In ECCV, 2016. 1, 2
work page 2016
-
[2]
S. Ardeshir and A. Borji. Egocentric meets top-view. IEEE TPAMI, 2018. 1, 2
work page 2018
-
[3]
S. Ardeshir and A. Borji. Integrating egocentric videos in top-view surveillance videos: Joint identification and tem- poral alignment. 2018. 1, 2, 6
work page 2018
-
[4]
C. Fan, J. Lee, M. Xu, K. K. Singh, Y . J. Lee, D. J. Crandall, and M. S. Ryoo. Identifying first-person camera wearers in third-person videos. In CVPR, 2017. 2
work page 2017
-
[5]
F. Ferland, F. Pomerleau, C. T. L. Dinh, and F. Michaud. Egocentric and exocentric teleoperation interface using real- time, 3d video projection. 2
- [6]
-
[7]
L. Giuseppe, M. Iacopo, A. D. Bagdanov, and D. B. Alberto. Person re-identification by iterative re-weighted sparse rank- ing, 2015. 2
work page 2015
-
[8]
D. Gray and T. Hai. Viewpoint invariant pedestrian recogni- tion with an ensemble of localized features. In ECCV, 2008. 2
work page 2008
-
[9]
M. Kstinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012. 2
work page 2012
-
[10]
S. Liao, Y . Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015. 2
work page 2015
-
[11]
K. B. Low and U. U. Sheikh. Learning hierarchical represen- tation using siamese convolution neural network for human re-identification. In ICDIM, 2016. 2
work page 2016
-
[12]
B. Ma, S. Yu, and F. Jurie. Local descriptors encoded by fisher vectors for person re-identification. In ECCV, 2012. 2
work page 2012
-
[13]
S. Paisitkriangkrai, C. Shen, and A. V . D. Hengel. Learning to rank in person re-identification with metric ensembles. In CVPR, 2015. 2
work page 2015
-
[14]
H. S. Park, E. Jain, and Y . Sheikh. Predicting primary gaze behavior using social saliency fields. In ICCV, 2013. 2
work page 2013
- [15]
-
[16]
Z. Rui, W. Ouyang, and X. Wang. Learning mid-level filters for person re-identification. In CVPR, 2014. 2
work page 2014
-
[17]
M. Sniedovich. Dynamic programming. foundations and principles. Monographs and Textbooks in Pure and Applied Mathematics, 2011. 4
work page 2011
- [18]
-
[19]
Y . Sun, Z. Liang, Y . Yi, T. Qi, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018. 6
work page 2018
-
[20]
X. Tong, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for per- son re-identification. In CVPR, 2016. 2
work page 2016
-
[21]
R. R. Varior, S. Bing, J. Lu, X. Dong, and W. Gang. A siamese long short-term memory architecture for human re- identification. In ECCV, 2016. 2
work page 2016
-
[22]
M. Xu, C. Fan, Y . Wang, M. S. Ryoo, and D. J. Crandall. Joint person segmentation and identification in synchronized first- and third-person videos. In ECCV, 2018. 2
work page 2018
-
[23]
Y . Yang, J. Yang, J. Yan, S. Liao, Y . Dong, and S. Z. Li. Salient color names for person re-identification. In ECCV,
- [24]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.