Simultaneous multi-view instance detection with learned geometric soft-constraints

Ahmed Samy Nassar; Jan D. Wegner; Sebastien Lefevre

arxiv: 1907.10892 · v1 · pith:VFRJSKHMnew · submitted 2019-07-25 · 💻 cs.LG · cs.CV· stat.ML

Simultaneous multi-view instance detection with learned geometric soft-constraints

Ahmed Samy Nassar , Sebastien Lefevre , Jan D. Wegner This is my paper

Pith reviewed 2026-05-24 16:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords multi-view object detectioninstance re-identificationgeometric soft constraintsend-to-end learningstreet-level panoramascross-view detectionurban object detection

0 comments

The pith

Jointly learning detection and cross-view re-identification lets a single network capture both appearance and geometric soft constraints end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make multi-view instance detection more robust by converting the separate problems of object detection and instance re-identification into one joint training task. This lets the model absorb both visual appearance cues and learned geometric relationships between views without requiring explicit camera poses or separate 3D modules. The authors release a large new dataset of street-level urban panoramas and a dedicated annotation tool to support the task, then show that the combined model beats several baselines.

Core claim

By turning object detection and instance re-identification in different views into a joint learning task, both image appearance and geometric soft constraints can be incorporated into a single, multi-view detection process that is learnable end-to-end.

What carries the argument

A neural network that performs simultaneous multi-view detection and re-identification while learning geometric soft constraints directly from paired image data.

If this is right

Detection remains accurate despite large viewpoint, lighting, and scale changes across views.
The approach produces a single trainable pipeline instead of separate detection, matching, and geometry stages.
A large public dataset of urban panoramas becomes available for multi-view instance tasks.
A custom annotation tool tailored to labeling the same instances across multiple views is released.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may reduce the need for calibrated multi-camera rigs in applications such as traffic monitoring.
Similar joint learning could be tested on video sequences where frames act as changing views of the same objects.
If the soft constraints prove sufficient, future work could drop explicit 3D reconstruction steps in other cross-view problems.

Load-bearing premise

Geometric relationships between views can be captured as learnable soft constraints inside the same network that processes appearance, without explicit camera poses or separate geometric modules.

What would settle it

An experiment in which the joint end-to-end model does not outperform a pipeline that first runs independent detectors and then applies explicit geometric matching or separate re-identification on the street-level panorama dataset.

Figures

Figures reproduced from arXiv: 1907.10892 by Ahmed Samy Nassar, Jan D. Wegner, Sebastien Lefevre.

**Figure 1.** Figure 1: A pair of images is fed to our multi-view object detectors, matching projected predictions is learned, and the geo-coordinate of [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: C ∗ : Camera with geo-position. T: The tree has its actual geographic coordinates, and location within the panorama. a ◦ : heading angle inside panorama. v: Distance between cameras [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Tree instance re-identification problem (color indicates [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Our network design: Images along with their GMD are [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of predictions (b∗) projected (b 0 ∗) in other views and their ground truth (g∗). Projection Net: This network component fine-tunes projected predictions b 0 ∗ by learning to regress the discrepancy between them and the other block’s ground truth as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 7.** Figure 7: Our annotation tool provides 4 multi-view panoramas [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Detection and Re-identification using our method. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Small subset of tree predictions (red) overlaid to an aerial [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

We propose to jointly learn multi-view geometry and warping between views of the same object instances for robust cross-view object detection. What makes multi-view object instance detection difficult are strong changes in viewpoint, lighting conditions, high similarity of neighbouring objects, and strong variability in scale. By turning object detection and instance re-identification in different views into a joint learning task, we are able to incorporate both image appearance and geometric soft constraints into a single, multi-view detection process that is learnable end-to-end. We validate our method on a new, large data set of street-level panoramas of urban objects and show superior performance compared to various baselines. Our contribution is threefold: a large-scale, publicly available data set for multi-view instance detection and re-identification; an annotation tool custom-tailored for multi-view instance detection; and a novel, holistic multi-view instance detection and re-identification method that jointly models geometry and appearance across views.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's real value is the new public dataset of street-level panoramas plus an annotation tool; the joint learning claim for geometry and appearance is plausible but not yet shown to be a clear advance over existing multi-view fusion.

read the letter

The headline takeaway is a new, large public dataset of urban street-level panoramas aimed at multi-view instance detection and re-identification, together with a custom annotation tool. That part is concrete and useful for anyone working on city-scale mapping or repeated object detection across cameras. The method itself turns detection and cross-view matching into one end-to-end network that adds learned geometric soft constraints on top of appearance features, and the abstract reports better numbers than several baselines on their data. Those two contributions are the parts that actually move the needle. The geometric component is the softer spot. The description stays at the level of 'jointly learn multi-view geometry and warping' without showing the loss terms, how the constraints are parameterized, or whether they require any camera calibration at all. If the geometry module turns out to be a standard epipolar or homography layer wrapped in a network, the novelty shrinks. The experiments claim superiority, but without ablations that isolate the geometric term it is hard to tell how much of the gain comes from the new data versus the modeling choice. The paper is aimed at computer-vision groups doing multi-camera urban applications. It is coherent on its own terms and the dataset alone makes it worth a referee's time, even if the method section needs tightening. I would send it to review rather than desk-reject.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes jointly learning multi-view geometry and warping for cross-view object instance detection by turning detection and re-identification into a single end-to-end task that incorporates image appearance and geometric soft constraints. It contributes a new large-scale dataset of street-level urban panoramas, a custom annotation tool, and experimental results claiming superior performance over baselines.

Significance. If the end-to-end integration of learned geometric soft-constraints without explicit camera poses or separate modules holds, the approach could simplify multi-view detection pipelines and improve robustness to viewpoint and scale changes in applications such as urban object monitoring. The public dataset release is a clear positive contribution to the field.

minor comments (2)

[Abstract] Abstract: the claim of 'superior performance' would benefit from a brief quantitative statement (e.g., mAP improvement) rather than a qualitative assertion alone.
[Abstract] The threefold contribution list in the abstract repeats the dataset and tool descriptions; consolidating this paragraph would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. We are pleased that the end-to-end integration of learned geometric soft-constraints and the public dataset release are viewed as potentially valuable contributions.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The provided abstract and text describe a joint end-to-end learning task for multi-view detection that incorporates appearance and geometric soft constraints without any equations, loss terms, or derivation steps shown. No self-citations, fitted parameters renamed as predictions, or self-definitional constructs are present in the given material. The central claim of a learnable holistic method stands as an independent proposal validated on a new dataset, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5697 in / 1067 out tokens · 24021 ms · 2026-05-24T16:21:57.075422+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Agarwal, Y

S. Agarwal, Y . Furukawa, N. Snavely, B. Curless, S. M. Seitz, and R. Szeliski. Reconstructing Rome. IEEE Com- puter, pages 40–47, 2010

work page 2010
[2]

Agarwal, N

S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In IEEE International Conference on Computer Vision, pages 72–79, 2009

work page 2009
[3]

Branson, J

S. Branson, J. D. Wegner, D. Hall, N. Lang, K. Schindler, and P. Perona. From google maps to a ﬁne-grained catalog of street trees. ISPRS Journal of Photogrammetry and Remote Sensing, 135:13–30, 2018

work page 2018
[4]

Bromley, I

J. Bromley, I. Guyon, Y . LeCun, E. S¨ackinger, and R. Shah. Signature veriﬁcation using a ”siamese” time delay neural network. In Advances in Neural Information Processing Sys- tems, pages 737–744, 1994

work page 1994
[5]

X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017

work page 1907
[6]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016
[7]

S. En, A. Lechervy, and F. Jurie. Rpnet: An end-to-end net- work for relative camera pose estimation. In European Con- ference on Computer Vision, pages 738–745, 2018

work page 2018
[8]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[9]

X. Han, T. Leung, Y . Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch- based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3279– 3286, 2015

work page 2015
[10]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. In IEEE International Conference on Computer Vi- sion, pages 2980–2988, 2017

work page 2017
[11]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016
[12]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017

work page 2017
[13]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolu- tional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on com- puter vision, pages 2938–2946, 2015

work page 2015
[14]

Krylov, E

V . Krylov, E. Kenny, and R. Dahyot. Automatic discovery and geotagging of objects from street view imagery. Remote Sensing, 10(5):661, 2018

work page 2018
[15]

V . A. Krylov and R. Dahyot. Object geolocation using mrf based multi-sensor fusion. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2745–2749. IEEE, 2018

work page 2018
[16]

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, Oct 2018

work page 2018
[17]

Lef `evre, D

S. Lef `evre, D. Tuia, J. D. Wegner, T. Produit, and A. S. Nas- sar. Toward seamless multiview scene analysis from satellite to street level.Proceedings of the IEEE, 105(10):1884–1899, 2017

work page 2017
[18]

W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep ﬁlter pairing neural network for person re-identiﬁcation. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 152–159, 2014

work page 2014
[19]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016

work page 2016
[20]

D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose esti- mation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5137–5146, 2018

work page 2018
[21]

Nakajima and H

Y . Nakajima and H. Saito. Robust camera pose estimation by viewpoint classiﬁcation using deep learning.Computational Vision Media, 3(2):189–198, 2017

work page 2017
[22]

Neuhold, T

G. Neuhold, T. Ollmann, S. R. Bul `o, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, pages 5000–5009, 2017

work page 2017
[23]

Nilwong, D

S. Nilwong, D. Hossain, S.-i. Kaneko, and G. Capi. Outdoor landmark detection for real-world localization using faster r-cnn. In 6th International Conference on Control, Mecha- tronics and Automation, pages 165–169. ACM, 2018

work page 2018
[24]

Poier, D

G. Poier, D. Schinagl, and H. Bischof. Learning pose spe- ciﬁc representations by predicting different views. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 60–69, 2018

work page 2018
[25]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015

work page 2015
[26]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 815–823, 2015

work page 2015
[27]

Shechtman and M

E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–

work page 2007
[28]

S. Sun, R. Sarukkai, J. Kwok, and V . Shet. Accurate deep di- rect geo-localization from ground imagery and phone-grade gps. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1016–1023, 2018

work page 2018
[29]

R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1420– 1429, 2016

work page 2016
[30]

J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Per- ona. Cataloging public objects using aerial and street-level images - urban trees. In IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 6014–6023, 2016

work page 2016
[31]

Xiang, T

Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems, 2018

work page 2018
[32]

J. Xiao, Y . Xie, T. Tillo, K. Huang, Y . Wei, and J. Feng. Ian: the individual aggregation network for person search. Pattern Recognition, 87:332–340, 2019

work page 2019
[33]

T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identiﬁcation feature learning for person search. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3415–3424, 2017

work page 2017
[34]

Yang, J.-B

C.-Y . Yang, J.-B. Huang, and M.-H. Yang. Exploiting self- similarities for single frame super-resolution. In Asian con- ference on computer vision, pages 497–510. Springer, 2010

work page 2010
[35]

Zbontar and Y

J. Zbontar and Y . LeCun. Stereo matching by training a con- volutional neural network to compare image patches. Jour- nal of Machine Learning Research, 17(1-32):2, 2016

work page 2016
[36]

Zhang, C

W. Zhang, C. Witharana, W. Li, C. Zhang, X. Li, and J. Parent. Using deep learning to identify utility poles with crossarms and estimate their locations from google street view images. Sensors, 18(8):2484, 2018

work page 2018
[37]

J. Zhao, X. N. Zhang, H. Gao, J. Yin, M. Zhou, and C. Tan. Object detection based on hierarchical multi-view proposal network for autonomous driving. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–6. IEEE, 2018

work page 2018
[38]

Zheng, W.-Q

Q.-F. Zheng, W.-Q. Wang, and W. Gao. Effective and efﬁ- cient object-based image retrieval using visual phrases. In Proceedings of the 14th ACM international conference on Multimedia, pages 77–80. ACM, 2006

work page 2006
[39]

X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classiﬁ- cation using super-vector coding of local image descriptors. In European conference on computer vision, pages 141–154. Springer, 2010

work page 2010

[1] [1]

Agarwal, Y

S. Agarwal, Y . Furukawa, N. Snavely, B. Curless, S. M. Seitz, and R. Szeliski. Reconstructing Rome. IEEE Com- puter, pages 40–47, 2010

work page 2010

[2] [2]

Agarwal, N

S. Agarwal, N. Snavely, I. Simon, S. M. Seitz, and R. Szeliski. Building rome in a day. In IEEE International Conference on Computer Vision, pages 72–79, 2009

work page 2009

[3] [3]

Branson, J

S. Branson, J. D. Wegner, D. Hall, N. Lang, K. Schindler, and P. Perona. From google maps to a ﬁne-grained catalog of street trees. ISPRS Journal of Photogrammetry and Remote Sensing, 135:13–30, 2018

work page 2018

[4] [4]

Bromley, I

J. Bromley, I. Guyon, Y . LeCun, E. S¨ackinger, and R. Shah. Signature veriﬁcation using a ”siamese” time delay neural network. In Advances in Neural Information Processing Sys- tems, pages 737–744, 1994

work page 1994

[5] [5]

X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017

work page 1907

[6] [6]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

work page 2016

[7] [7]

S. En, A. Lechervy, and F. Jurie. Rpnet: An end-to-end net- work for relative camera pose estimation. In European Con- ference on Computer Vision, pages 738–745, 2018

work page 2018

[8] [8]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013

[9] [9]

X. Han, T. Leung, Y . Jia, R. Sukthankar, and A. C. Berg. Matchnet: Unifying feature and metric learning for patch- based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3279– 3286, 2015

work page 2015

[10] [10]

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick. Mask R- CNN. In IEEE International Conference on Computer Vi- sion, pages 2980–2988, 2017

work page 2017

[11] [11]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE con- ference on computer vision and pattern recognition , pages 770–778, 2016

work page 2016

[12] [12]

Huang, V

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7310–7311, 2017

work page 2017

[13] [13]

Kendall, M

A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolu- tional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on com- puter vision, pages 2938–2946, 2015

work page 2015

[14] [14]

Krylov, E

V . Krylov, E. Kenny, and R. Dahyot. Automatic discovery and geotagging of objects from street view imagery. Remote Sensing, 10(5):661, 2018

work page 2018

[15] [15]

V . A. Krylov and R. Dahyot. Object geolocation using mrf based multi-sensor fusion. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 2745–2749. IEEE, 2018

work page 2018

[16] [16]

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander. Joint 3d proposal generation and object detection from view aggregation. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1–8, Oct 2018

work page 2018

[17] [17]

Lef `evre, D

S. Lef `evre, D. Tuia, J. D. Wegner, T. Produit, and A. S. Nas- sar. Toward seamless multiview scene analysis from satellite to street level.Proceedings of the IEEE, 105(10):1884–1899, 2017

work page 2017

[18] [18]

W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep ﬁlter pairing neural network for person re-identiﬁcation. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 152–159, 2014

work page 2014

[19] [19]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y . Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision , pages 21–37. Springer, 2016

work page 2016

[20] [20]

D. C. Luvizon, D. Picard, and H. Tabia. 2d/3d pose esti- mation and action recognition using multitask deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5137–5146, 2018

work page 2018

[21] [21]

Nakajima and H

Y . Nakajima and H. Saito. Robust camera pose estimation by viewpoint classiﬁcation using deep learning.Computational Vision Media, 3(2):189–198, 2017

work page 2017

[22] [22]

Neuhold, T

G. Neuhold, T. Ollmann, S. R. Bul `o, and P. Kontschieder. The mapillary vistas dataset for semantic understanding of street scenes. In ICCV, pages 5000–5009, 2017

work page 2017

[23] [23]

Nilwong, D

S. Nilwong, D. Hossain, S.-i. Kaneko, and G. Capi. Outdoor landmark detection for real-world localization using faster r-cnn. In 6th International Conference on Control, Mecha- tronics and Automation, pages 165–169. ACM, 2018

work page 2018

[24] [24]

Poier, D

G. Poier, D. Schinagl, and H. Bischof. Learning pose spe- ciﬁc representations by predicting different views. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 60–69, 2018

work page 2018

[25] [25]

S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems , pages 91–99, 2015

work page 2015

[26] [26]

Schroff, D

F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uniﬁed embedding for face recognition and clustering. In IEEE Conference on Computer Vision and Pattern Recogni- tion, pages 815–823, 2015

work page 2015

[27] [27]

Shechtman and M

E. Shechtman and M. Irani. Matching local self-similarities across images and videos. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–

work page 2007

[28] [28]

S. Sun, R. Sarukkai, J. Kwok, and V . Shet. Accurate deep di- rect geo-localization from ground imagery and phone-grade gps. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1016–1023, 2018

work page 2018

[29] [29]

R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1420– 1429, 2016

work page 2016

[30] [30]

J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Per- ona. Cataloging public objects using aerial and street-level images - urban trees. In IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 6014–6023, 2016

work page 2016

[31] [31]

Xiang, T

Y . Xiang, T. Schmidt, V . Narayanan, and D. Fox. Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. In Robotics: Science and Systems, 2018

work page 2018

[32] [32]

J. Xiao, Y . Xie, T. Tillo, K. Huang, Y . Wei, and J. Feng. Ian: the individual aggregation network for person search. Pattern Recognition, 87:332–340, 2019

work page 2019

[33] [33]

T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. Joint detection and identiﬁcation feature learning for person search. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3415–3424, 2017

work page 2017

[34] [34]

Yang, J.-B

C.-Y . Yang, J.-B. Huang, and M.-H. Yang. Exploiting self- similarities for single frame super-resolution. In Asian con- ference on computer vision, pages 497–510. Springer, 2010

work page 2010

[35] [35]

Zbontar and Y

J. Zbontar and Y . LeCun. Stereo matching by training a con- volutional neural network to compare image patches. Jour- nal of Machine Learning Research, 17(1-32):2, 2016

work page 2016

[36] [36]

Zhang, C

W. Zhang, C. Witharana, W. Li, C. Zhang, X. Li, and J. Parent. Using deep learning to identify utility poles with crossarms and estimate their locations from google street view images. Sensors, 18(8):2484, 2018

work page 2018

[37] [37]

J. Zhao, X. N. Zhang, H. Gao, J. Yin, M. Zhou, and C. Tan. Object detection based on hierarchical multi-view proposal network for autonomous driving. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 1–6. IEEE, 2018

work page 2018

[38] [38]

Zheng, W.-Q

Q.-F. Zheng, W.-Q. Wang, and W. Gao. Effective and efﬁ- cient object-based image retrieval using visual phrases. In Proceedings of the 14th ACM international conference on Multimedia, pages 77–80. ACM, 2006

work page 2006

[39] [39]

X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classiﬁ- cation using super-vector coding of local image descriptors. In European conference on computer vision, pages 141–154. Springer, 2010

work page 2010