pith. sign in

arxiv: 1907.09381 · v1 · pith:SF7ZVH2Xnew · submitted 2019-07-22 · 💻 cs.CV

Visualizing the Invisible: Occluded Vehicle Segmentation and Recovery

Pith reviewed 2026-05-24 18:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords occluded vehicle segmentationappearance recoveryiterative multi-task learningadversarial discriminators3D model poolvehicle trackingsynthetic to real generalization
0
0 comments X

The pith

An iterative framework alternates between completing occluded vehicle segmentation masks and recovering hidden appearances to progressively refine both.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a multi-task method that repeatedly switches between filling gaps in a vehicle's outline mask and restoring the look of the blocked sections. Two coupled discriminators are trained against silhouettes drawn from a 3D model collection to sharpen the mask step, while a two-path network with shared weights handles the appearance step. Each round of mask work feeds the appearance work and vice versa, so the outputs improve together. The authors release a dataset of both synthetic and real occluded vehicle images to measure the gains. They also show the recovered appearances help track vehicles across real video frames where parts stay hidden.

Core claim

We propose a novel iterative multi-task framework to complete the segmentation mask of an occluded vehicle and recover the appearance of its invisible parts. In particular, to improve the quality of the segmentation completion, we present two coupled discriminators and introduce an auxiliary 3D model pool for sampling authentic silhouettes as adversarial samples. In addition, we propose a two-path structure with a shared network to enhance the appearance recovery capability. By iteratively performing the segmentation completion and the appearance recovery, the results will be progressively refined.

What carries the argument

Iterative loop coupling segmentation completion (via two discriminators and 3D silhouette adversarial samples) with appearance recovery (via two-path shared network).

If this is right

  • Each iteration improves both the completed mask and the restored appearance over the previous round.
  • The method exceeds prior approaches on the new Occluded Vehicle dataset for both mask completion and appearance recovery accuracy.
  • The recovered appearances directly improve tracking accuracy for vehicles that remain partially hidden across video frames.
  • The 3D pool supplies silhouettes that train the discriminators without needing real occluded masks as adversarial targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alternation pattern could be tested on other rigid objects such as traffic signs or furniture where partial views are common.
  • If the iteration stabilizes quickly, the approach might reduce the number of camera views needed in multi-view 3D reconstruction pipelines.
  • Failure cases where the 3D pool lacks matching vehicle types would point to a need for on-the-fly 3D model adaptation during inference.

Load-bearing premise

Silhouettes sampled from the auxiliary 3D model pool will act as realistic enough adversarial examples to let the model generalize from synthetic training images to real-world occluded vehicles without harmful domain shift.

What would settle it

Run the trained model on a held-out collection of real occluded vehicle photos whose vehicle shapes are absent from the 3D model pool and check whether mask and appearance errors stop decreasing after the first iteration.

Figures

Figures reproduced from arXiv: 1907.09381 by Feigege Wang, Jia Pan, Shengfeng He, Wenxi Liu, Xiaosheng Yan, Yuanlong Yu.

Figure 1
Figure 1. Figure 1: (a-b) Given the input image with an occluded car, our approach can recover the appearance of its invisible part. Illustration of tracking occluded vehicles in the original video (c) and our processed video (d) that recovers the appearance of the target vehicles from occlusions. consists of two modules: a segmentation completion network that aims at completing the incomplete segmentation mask of the occlude… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the segmentation completion network. The input image I is passed to the pretrained segmentation model F and then concatenated with the computed incomplete mask Mˆ to produce the recovered segmentation mask M. In our framework, we present two coupled discriminators, both of which are fed with the same samples for different classification tasks. For the object discriminator Dobj , it aims to … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the appearance recovery network. On the training stage, its generator has two paths to perform separate generation tasks but shares the same network. The first path aims at filling in colors for invisible parts. The second path aims at inpainting the image foreground given the image culling out the foreground object. At test time, only the first path is adopted for generating appearance. cl… view at source ↗
Figure 4
Figure 4. Figure 4: Generated complete segmentation masks on exemplar syn￾thetic and real images for evaluating our discriminators. that is a single discriminator network for real/fake classifi￾cation. As illustrated in Tab. I, the quality of segmentation completion from our proposed model is generally improved. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Recovered appearance on exemplar synthetic and real images for evaluating our proposed two-path structure [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of an example on our iterative refinement. The first column refers the input image and its corresponding incomplete mask. The second and third column refer to the results produced at the first and the second iteration, respectively. 1, 2, and 3 iterations in the tasks of segmentation completion and appearance recovery. In specific, the model running for 1 iteration refers to the process of pas… view at source ↗
Figure 7
Figure 7. Figure 7: Examples of the segmentation completion comparison. Input Incomplete mask Deepfill [48] pix2pix [19] SeGAN [11] Ours Ground truth [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Examples of the appearance recovery comparison [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recovered real occluded vehicles from other public datasets, including vehicle-occluding-vehicle, multi-person-occluding-vehicle, and truncated vehicles. Model Type Input L1 ↓ L2 ↓ ICP ↑ SS ↑ Deepfill [48] 0.0284 0.0107 0.5620 0.8295 pix2pix [19] 0.0174 0.0060 0.7081 0.9410 SeGAN [11] Syn. Mgt 0.0181 0.0055 0.6662 0.9371 Ours (1 st iter.) 0.0159 0.0038 0.7436 0.9458 Ours (2 nd iter.) 0.0158 0.0039 0.7267 0… view at source ↗
read the original abstract

In this paper, we propose a novel iterative multi-task framework to complete the segmentation mask of an occluded vehicle and recover the appearance of its invisible parts. In particular, to improve the quality of the segmentation completion, we present two coupled discriminators and introduce an auxiliary 3D model pool for sampling authentic silhouettes as adversarial samples. In addition, we propose a two-path structure with a shared network to enhance the appearance recovery capability. By iteratively performing the segmentation completion and the appearance recovery, the results will be progressively refined. To evaluate our method, we present a dataset, the Occluded Vehicle dataset, containing synthetic and real-world occluded vehicle images. We conduct comparison experiments on this dataset and demonstrate that our model outperforms the state-of-the-art in tasks of recovering segmentation mask and appearance for occluded vehicles. Moreover, we also demonstrate that our appearance recovery approach can benefit the occluded vehicle tracking in real-world videos.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes an iterative multi-task framework for occluded vehicle segmentation mask completion and appearance recovery of invisible parts. It introduces two coupled discriminators trained adversarially using silhouettes sampled from an auxiliary 3D model pool, a two-path network with shared weights for appearance recovery, and iterative alternation between the tasks to progressively refine outputs. A new Occluded Vehicle dataset (synthetic + real) is presented, with claims of outperforming prior methods on mask and appearance recovery plus downstream benefits for real-world occluded vehicle tracking.

Significance. If the empirical gains hold under rigorous baselines and the 3D-pool silhouettes transfer without domain artifacts, the work would offer a practical advance for occlusion handling in vehicle perception, supported by the release of a dedicated dataset. The iterative coupling idea is conceptually appealing for mutual refinement of mask and appearance, though its value depends on the auxiliary pool's fidelity.

major comments (2)
  1. [Abstract] Abstract / method description: The central claim that sampling silhouettes from the auxiliary 3D model pool produces sufficiently authentic adversarial samples for the coupled discriminators (enabling generalization of iterative mask+appearance refinement to real occluded vehicles) is load-bearing. No analysis of shape/viewpoint/occlusion distribution mismatch between the pool and real test images is provided; such mismatch would produce inaccurate masks that degrade the shared network and allow error amplification across iterations rather than refinement.
  2. [Experiments] Experiments section: The abstract asserts outperformance on the new dataset without reporting quantitative metrics, ablation studies on the 3D pool or coupled discriminators, or error analysis. This makes it impossible to verify whether gains survive strong baselines or are attributable to the proposed components versus dataset-specific tuning.
minor comments (2)
  1. The two-path structure with shared network would benefit from an explicit diagram or equations showing weight sharing and how appearance recovery conditions on the completed mask.
  2. Clarify whether the Occluded Vehicle dataset splits are balanced across synthetic/real and occlusion levels, and whether any real images were used in training the 3D-pool discriminators.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of validating the 3D model pool and strengthening the experimental evidence. We address each major comment below and commit to revisions that incorporate additional analysis and studies.

read point-by-point responses
  1. Referee: [Abstract] Abstract / method description: The central claim that sampling silhouettes from the auxiliary 3D model pool produces sufficiently authentic adversarial samples for the coupled discriminators (enabling generalization of iterative mask+appearance refinement to real occluded vehicles) is load-bearing. No analysis of shape/viewpoint/occlusion distribution mismatch between the pool and real test images is provided; such mismatch would produce inaccurate masks that degrade the shared network and allow error amplification across iterations rather than refinement.

    Authors: We agree that an explicit analysis of potential mismatches in shape, viewpoint, and occlusion distributions between the 3D model pool and real test images would strengthen the justification for using the pool to generate authentic adversarial samples. The current manuscript relies on the diversity of the auxiliary pool and empirical results on the mixed synthetic-real dataset to support generalization, but does not include a dedicated distributional comparison. In the revised version, we will add a new analysis subsection (likely in the experiments or method section) that quantifies these distributions and discusses their implications for iterative refinement. This will directly address the risk of error amplification. revision: yes

  2. Referee: [Experiments] Experiments section: The abstract asserts outperformance on the new dataset without reporting quantitative metrics, ablation studies on the 3D pool or coupled discriminators, or error analysis. This makes it impossible to verify whether gains survive strong baselines or are attributable to the proposed components versus dataset-specific tuning.

    Authors: The abstract provides a high-level summary of outperformance, consistent with typical abstract constraints that preclude detailed metrics or ablations. The experiments section does report comparison results against prior methods on the Occluded Vehicle dataset (synthetic and real), demonstrating improvements in mask completion and appearance recovery. However, we acknowledge that dedicated ablations on the 3D pool and coupled discriminators, plus error analysis, are not present and would improve verifiability. We will revise the experiments section to include these elements, such as ablation tables isolating each component and quantitative error breakdowns, to confirm gains over strong baselines and attribute improvements to the proposed iterative multi-task framework. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical iterative framework with no derivation reducing to fitted inputs or self-citations

full rationale

The paper describes a data-driven iterative multi-task network using coupled discriminators, an auxiliary 3D model pool for silhouettes, and a two-path shared network for appearance recovery. No equations, uniqueness theorems, or first-principles derivations are presented that reduce performance claims to quantities defined by the method's own fitted parameters or prior self-citations. The central claims rest on empirical training, dataset construction, and comparison experiments, which are externally falsifiable via the reported Occluded Vehicle dataset and SOTA benchmarks. This matches the default expectation for non-circular empirical CV papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on the effectiveness of the newly introduced components (coupled discriminators, 3D pool, two-path structure) and on the assumption that iterative alternation improves both tasks without compounding errors; these are not derived from prior literature but introduced in the paper.

axioms (1)
  • domain assumption Iterative alternation between segmentation completion and appearance recovery will progressively refine both outputs without introducing compounding artifacts.
    Stated directly in the abstract as the operating principle of the framework.
invented entities (3)
  • coupled discriminators no independent evidence
    purpose: Improve quality of segmentation mask completion via adversarial training with authentic silhouettes.
    New architectural component introduced to address segmentation quality.
  • auxiliary 3D model pool no independent evidence
    purpose: Supply authentic vehicle silhouettes as adversarial samples during training.
    New data source introduced for the discriminator training.
  • two-path structure with shared network no independent evidence
    purpose: Enhance appearance recovery capability.
    New network design introduced for the recovery task.

pith-pipeline@v0.9.0 · 5694 in / 1528 out tokens · 41258 ms · 2026-05-24T18:13:38.849477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 2 internal anchors

  1. [1]

    Arjovsky, S

    M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017

  2. [2]

    A. Behl, O. H. Jafari, S. K. Mustikovela, H. A. Alhaija, C. Rother, and A. Geiger. Bounding boxes, segmentations and object coordinates: How important is recognition for 3d scene flow estimation in autonomous driving scenarios? In ICCV, 2017

  3. [3]

    A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], 2015

  4. [4]

    Chang, L

    J. Chang, L. Wang, G. Meng, S. Xiang, and C. Pan. Vision-based occlusion handling and vehicle classification for traffic surveillance systems. IEEE Intelligent Transportation Systems Magazine , 2018

  5. [5]

    L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam. Masklab: Instance segmentation by refining object detection with semantic and direction features. In CVPR, 2018

  6. [6]

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2018

  7. [7]

    X. Chen, Y . Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: interpretable representation learning by informa- tion maximizing generative adversarial nets. NeurIPS, 2016

  8. [8]

    Y .-T. Chen, X. Liu, and M.-H. Yang. Multi-instance object segmentation with occlusion handling. In CVPR, 2015

  9. [9]

    Cordts, M

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016

  10. [10]

    J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In CVPR, 2016

  11. [11]

    Ehsani, R

    K. Ehsani, R. Mottaghi, and A. Farhadi. Segan: Segmenting and generating the invisible. In CVPR, 2018. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

  12. [12]

    Learning to See the Invisible: End-to-End Trainable Amodal Instance Segmentation

    P. Follmann, R. Konig, P. Hartinger, and M. Klostermann. Learning to see the invisible: End-to-end trainable amodal instance segmentation. arXiv preprint arXiv:1804.08864 , 2018

  13. [13]

    T. Gao, B. Packer, and D. Koller. A segmentation-aware object detection model with occlusion handling. In CVPR, 2011

  14. [14]

    I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y . Bengio. Generative adversarial nets. In NeurIPS, 2014

  15. [15]

    K. He, G. Gkioxari, P. Dollr, and R. B. Girshick. Mask r-cnn. In ICCV, 2017

  16. [16]

    J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters. TPAMI, 2015

  17. [17]

    Hsiao and M

    E. Hsiao and M. Hebert. Occlusion reasoning for object detection under arbitrary viewpoint. In CVPR, 2012

  18. [18]

    Y . Hua, K. Alahari, and C. Schmid. Occlusion and motion reasoning for long-term tracking. In ECCV, 2014

  19. [19]

    Isola, J.-Y

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017

  20. [20]

    Kanizsa, P

    G. Kanizsa, P. Legrenzi, and P. Bozzi. Organization in vision : essays on gestalt perception. 1979

  21. [21]

    Kim and J

    J. Kim and J. F. Canny. Interpretable learning for self-driving cars by visualizing causal attention. In 2017 IEEE International Conference on Computer Vision (ICCV) , 2017

  22. [22]

    Koller, J

    D. Koller, J. Weber, and J. Malik. Robust multiple car tracking with occlusion reasoning. In ECCV, 1994

  23. [23]

    Krause, M

    J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In International IEEE Workshop on 3D Representation and Recognition , Sydney, Australia, 2013

  24. [24]

    Li and J

    K. Li and J. Malik. Amodal instance segmentation. ECCV, 2016

  25. [25]

    T.-Y . Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in context. ECCV, 2014

  26. [26]

    S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In CVPR, 2018

  27. [27]

    J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015

  28. [28]

    Ma and X

    X. Ma and X. Sun. Detection and segmentation of occluded vehicles based on symmetry analysis. In International Conference on Systems and Informatics, 2017

  29. [29]

    A. I. Maqueda, A. Loquercio, G. Gallego, N. N. Garca, and D. Scara- muzza. Event-based vision meets deep learning on steering prediction for self-driving cars. In CVPR, 2018

  30. [30]

    Mei and H

    X. Mei and H. Ling. Robust visual tracking and vehicle classification via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011

  31. [31]

    Mueller, D

    F. Mueller, D. Mehta, O. Sotnychenko, S. Sridhar, D. Casas, and C. Theobalt. Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In ICCVW, 2017

  32. [32]

    C. C. C. Pang, W. W. L. Lam, and N. H. C. Yung. A novel method for resolving vehicle occlusion in a monocular traffic-image sequence. IEEE Transactions on Intelligent Transportation Systems , 2004

  33. [33]

    Pathak, P

    D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros. Context encoders: Feature learning by inpainting. In CVPR, 2016

  34. [34]

    P. O. Pinheiro, T.-Y . Lin, R. Collobert, and P. Doll ´ar. Learning to refine object segments. In ECCV, 2016

  35. [35]

    Ramanishka, Y .-T

    V . Ramanishka, Y .-T. Chen, T. Misu, and K. Saenko. Toward driving scene understanding: A dataset for learning driver behavior and causal reasoning. In CVPR, 2018

  36. [36]

    Salimans, I

    T. Salimans, I. Goodfellow, W. Zaremba, V . Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training GANs. In NeurIPS, 2016

  37. [37]

    Y . Shen, T. Xiao, H. Li, S. Yi, and X. Wang. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In ICCV, 2017

  38. [38]

    G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah. Part-based multiple-person tracking with partial occlusion handling. In CVPR, 2012

  39. [39]

    Sivaraman and M

    S. Sivaraman and M. M. Trivedi. Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis. IEEE Transactions on Intelligent Transportation Systems , 2013

  40. [40]

    Szegedy, V

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016

  41. [41]

    Tighe, M

    J. Tighe, M. Niethammer, and S. Lazebnik. Scene parsing with object instances and occlusion ordering. In CVPR, 2014

  42. [42]

    Y .-H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M.-H. Yang. Deep image harmonization. In CVPR, 2017

  43. [43]

    Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In ICCV, 2017

  44. [44]

    K. Yan, Y . Tian, Y . Wang, W. Zeng, and T. Huang. Exploiting multi-grain ranking constraints for precisely searching visually-similar vehicles. In ICCV, 2017

  45. [45]

    T. Yang, Q. Pan, J. Li, and S. Li. Real-time multiple objects tracking with occlusion handling in dynamic scenes. In CVPR, 2005

  46. [46]

    Y . Yang, S. Hallman, D. Ramanan, and C. C. Fowlkes. Layered object models for image segmentation. TPAMI, 2012

  47. [47]

    Yu and V

    F. Yu and V . Koltun. Multi-scale context aggregation by dilated convolutions. ICLR, 2016

  48. [48]

    J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang. Generative image inpainting with contextual attention. In CVPR, 2018

  49. [49]

    Zhang, T

    H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. StackGAN++: Realistic image synthesis with stacked genera- tive adversarial networks. TPAMI, 2018

  50. [50]

    Zhang, G

    S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura. Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras. In ICCV, 2017

  51. [51]

    Zhang, K

    T. Zhang, K. Jia, C. Xu, Y . Ma, and N. Ahuja. Partial occlusion handling for visual tracking via robust part matching. In CVPR, 2014

  52. [52]

    Zhang, Q

    W. Zhang, Q. Wu, X. Yang, and X. Fang. Multilevel framework to detect and handle vehicle occlusion. IEEE Transactions on Intelligent Transportation Systems, 2008

  53. [53]

    N. Zhao, Y . Xia, C. Xu, X. Shi, and Y . Liu. Appos: An adaptive partial occlusion segmentation method for multiple vehicles tracking. Journal of Visual Communication and Image Representation , 2016

  54. [54]

    Zhouy and L

    Y . Zhouy and L. Shao. Viewpoint-aware attentive multi-view inference for vehicle re-identification. In CVPR, 2018

  55. [55]

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017

  56. [56]

    Y . Zhu, Y . Tian, D. N. Metaxas, and P. Dollr. Semantic amodal segmentation. In CVPR, 2017