pith. sign in

arxiv: 2511.05782 · v2 · submitted 2025-11-08 · 💻 cs.CV

TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation

Pith reviewed 2026-05-17 23:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords unsupervised domain adaptationmedical image segmentationvision-language alignmentcross-modality adaptationtext-driven semanticsprototype alignmentcardiac segmentationdomain shift reduction
0
0 comments X

The pith

Textual class descriptions can align visual features across CT and MRI to reduce domain shift in medical image segmentation without target labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that domain-invariant text descriptions of anatomical classes can steer the training of image encoders so that features become consistent across modalities. It does this by adding a covariance cosine loss that matches image features to textual semantic relations between classes and by aligning class prototypes at the pixel level. A sympathetic reader would care because medical segmentation models often fail when moved from one scanner or modality to another, and collecting new labels for each target domain is costly. If the approach works, adaptation becomes possible using only source labels plus general language descriptions of the classes.

Core claim

TCSA-UDA uses domain-invariant textual class descriptions to guide visual representation learning via a vision-language covariance cosine loss that aligns image encoder features with inter-class textual semantic relations, combined with a prototype alignment module that matches class-wise pixel distributions across domains, thereby reducing residual category-level discrepancies in cross-modality cardiac, abdominal, and brain tumor segmentation.

What carries the argument

The vision-language covariance cosine loss that directly aligns image encoder features with inter-class textual semantic relations to produce modality-invariant representations.

If this is right

  • Segmentation accuracy on target-domain MRI improves when only CT source labels and text descriptions are available.
  • The same text-guided alignment produces gains on abdominal and brain tumor cross-modality tasks.
  • Residual category-level domain gaps are reduced beyond what image-only alignment achieves.
  • The method consistently exceeds prior state-of-the-art unsupervised domain adaptation performance on the tested benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same text-alignment principle could be tested on ultrasound or PET by writing equivalent class descriptions.
  • Hospitals might maintain one segmentation model across scanner vendors if text prompts are standardized.
  • Automatically deriving the textual descriptions from radiology textbooks or ontologies would remove the need for manual prompt engineering.

Load-bearing premise

Domain-invariant textual class descriptions exist and can reliably guide visual representation learning across modalities.

What would settle it

Replacing the chosen domain-invariant text prompts with random or modality-specific prompts and checking whether the reported gains over baseline UDA methods disappear on the same cross-modality benchmarks.

Figures

Figures reproduced from arXiv: 2511.05782 by Honghai Liu, Lalit Maurya, Reyer Zwiggelaar.

Figure 1
Figure 1. Figure 1: Class-specific visual features (squares and circles) from source [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic representation of the proposed TCSA-UDA framework, comprising: (a) text-driven semantic covariance learning via [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative segmentation results by different comparison algorithms. The blue, green, yellow and red represent the AA, LAC, LVC and MYO, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of Abdominal organ segmentation by different comparison algorithms. The blue, green, yellow and red represent the Spleen, RK, [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of brain tumor segmentation by different compar [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GradCAM visualization results comparing attention regions across [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes TCSA-UDA, a Text-driven Cross-Semantic Alignment framework for unsupervised domain adaptation in medical image segmentation. It uses domain-invariant textual class descriptions to guide visual representation learning through a vision-language covariance cosine loss that aligns image encoder features with inter-class textual semantic relations, plus a prototype alignment module to align class-wise pixel-level feature distributions across domains. The authors report that experiments on cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks show the framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods.

Significance. If the claimed outperformance holds with substantial and statistically validated gains, the work would be significant for introducing language-driven semantics into UDA for medical imaging, offering a new direction to handle cross-modality shifts (e.g., CT-MRI) where purely visual methods often fall short. The combination of covariance alignment and prototype matching is a plausible extension of vision-language techniques, and reproducible code or parameter-free elements would strengthen its contribution.

major comments (2)
  1. [Abstract] Abstract: the central claim of consistent outperformance and significant domain-shift reduction is presented without any quantitative deltas, ablation results, or mention of statistical testing. This absence is load-bearing because the empirical superiority over prior UDA methods cannot be assessed from the given description alone.
  2. [Method] Method description (vision-language covariance cosine loss and prototype alignment): the approach assumes fixed textual class descriptions encode modality-robust semantics that remain invariant under large intensity and contrast differences between CT and MRI. No sensitivity analysis to prompt choice or physics-based justification is referenced, which directly affects whether the alignment losses can produce the claimed modality-invariant representations.
minor comments (2)
  1. [Abstract] Abstract: consider including at least one key quantitative result (e.g., Dice improvement on a benchmark) or a pointer to the main results table to make the performance claim concrete.
  2. Notation: the terms 'vision-language covariance cosine loss' and 'high-level semantic prototypes' would benefit from an explicit equation or diagram reference on first use to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will incorporate to strengthen the presentation of our contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of consistent outperformance and significant domain-shift reduction is presented without any quantitative deltas, ablation results, or mention of statistical testing. This absence is load-bearing because the empirical superiority over prior UDA methods cannot be assessed from the given description alone.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative evidence. In the revised manuscript, we will update the abstract to report specific performance deltas (e.g., average Dice score improvements over the strongest baseline across the three benchmarks) and note that statistical significance was assessed via paired tests. This change directly addresses the concern while preserving the abstract's brevity. revision: yes

  2. Referee: [Method] Method description (vision-language covariance cosine loss and prototype alignment): the approach assumes fixed textual class descriptions encode modality-robust semantics that remain invariant under large intensity and contrast differences between CT and MRI. No sensitivity analysis to prompt choice or physics-based justification is referenced, which directly affects whether the alignment losses can produce the claimed modality-invariant representations.

    Authors: We appreciate the referee's point on the underlying assumption. The class descriptions are drawn from standard medical terminology to emphasize anatomical and pathological properties rather than modality-specific appearance. To address the gap, we will add a sensitivity analysis to prompt variations (reported in the supplement) and a concise justification in the method section explaining the expected robustness to intensity/contrast shifts. These additions will better support the design of the covariance and prototype alignment losses. revision: yes

Circularity Check

0 steps flagged

Derivation is self-contained; new losses and modules defined independently without reduction to inputs

full rationale

The TCSA-UDA framework defines its core components—the vision-language covariance cosine loss for aligning image features with textual semantic relations and the prototype alignment module for class-wise feature distributions—directly from standard feature extraction and cross-modal alignment principles. These constructions use explicit mathematical formulations based on encoder outputs, inter-class textual descriptions, and high-level prototypes without any fitted parameters from target-domain data being repurposed as predictions, nor any self-definitional loops where outputs are presupposed in the inputs. No load-bearing self-citations or uniqueness theorems from the authors' prior work are invoked to justify the central alignment mechanism; the approach extends existing UDA techniques with independently specified text-driven terms. The experimental claims rest on benchmark comparisons rather than circular derivations, making the chain self-contained against external validation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract provides limited technical detail; the approach rests on standard deep-learning assumptions plus the domain-specific premise that textual descriptions remain semantically stable across modalities.

free parameters (1)
  • loss weighting coefficients
    Relative weights between the covariance cosine loss, prototype alignment loss, and segmentation loss are expected to be tuned on validation data.
axioms (1)
  • domain assumption Textual class descriptions are domain-invariant and semantically meaningful for guiding visual features
    Invoked when the authors state that language descriptions guide modality-invariant representation learning.

pith-pipeline@v0.9.0 · 5491 in / 1237 out tokens · 33257 ms · 2026-05-17T23:58:42.844369+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages

  1. [1]

    Magnetic resonance imaging versus com- puted tomography for three-dimensional bone imaging of musculoskele- tal pathologies: a review,

    M. C. Florkow, K. Willemsen, V . V . Mascarenhas, E. H. Oei, M. van Stralen, and P. R. Seevinck, “Magnetic resonance imaging versus com- puted tomography for three-dimensional bone imaging of musculoskele- tal pathologies: a review,”Journal of Magnetic Resonance Imaging, vol. 56, no. 1, pp. 11–34, 2022

  2. [2]

    Unsupervised domain adaptation for em image denoising with invertible networks,

    S. Deng, Y . Chen, W. Huang, R. Zhang, and Z. Xiong, “Unsupervised domain adaptation for em image denoising with invertible networks,” IEEE Transactions on Medical Imaging, vol. 44, no. 1, pp. 92–105, 2025

  3. [3]

    Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation,

    C. Chen, Q. Dou, H. Chen, J. Qin, and P.-A. Heng, “Synergistic image and feature adaptation: Towards cross-modality domain adaptation for medical image segmentation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 865–872, 2019. 10

  4. [4]

    Deep unsupervised domain adaptation: A review of recent advances and perspectives,

    X. Liu, C. Yoo, F. Xing, H. Oh, G. El Fakhri, J.-W. Kang, J. Woo,et al., “Deep unsupervised domain adaptation: A review of recent advances and perspectives,”APSIPA Transactions on Signal and Information Processing, vol. 11, no. 1, 2022

  5. [5]

    Unpaired image-to- image translation using cycle-consistent adversarial networks,

    J.-Y . Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to- image translation using cycle-consistent adversarial networks,” inIEEE International Conference on Computer Vision (ICCV), pp. 2242–2251, 2017

  6. [6]

    Unsupervised domain adaptation for medical image segmentation by selective entropy constraints and adaptive semantic alignment,

    W. Feng, L. Ju, L. Wang, K. Song, X. Zhao, and Z. Ge, “Unsupervised domain adaptation for medical image segmentation by selective entropy constraints and adaptive semantic alignment,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 623–631, Jun. 2023

  7. [7]

    Learning transferable features with deep adaptation networks,

    M. Long, Y . Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” inInternational Conference on Machine Learning, pp. 97–105, PMLR, 2015

  8. [8]

    Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,

    C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng, “Unsupervised bidirectional cross-modality adaptation via deeply synergistic image and feature alignment for medical image segmentation,”IEEE Transactions on Medical Imaging, vol. 39, no. 7, pp. 2494–2505, 2020

  9. [9]

    Le- uda: Label-efficient unsupervised domain adaptation for medical image segmentation,

    Z. Zhao, F. Zhou, K. Xu, Z. Zeng, C. Guan, and S. K. Zhou, “Le- uda: Label-efficient unsupervised domain adaptation for medical image segmentation,”IEEE Transactions on Medical Imaging, vol. 42, no. 3, pp. 633–646, 2022

  10. [10]

    Attention- enhanced disentangled representation learning for unsupervised do- main adaptation in cardiac segmentation,

    X. Sun, Z. Liu, S. Zheng, C. Lin, Z. Zhu, and Y . Zhao, “Attention- enhanced disentangled representation learning for unsupervised do- main adaptation in cardiac segmentation,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 745–754, Springer, 2022

  11. [11]

    Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,

    Y . Zou, Z. Yu, B. V . Kumar, and J. Wang, “Unsupervised domain adaptation for semantic segmentation via class-balanced self-training,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018

  12. [12]

    Bidirectional learning for domain adaptation of semantic segmentation,

    Y . Li, L. Yuan, and N. Vasconcelos, “Bidirectional learning for domain adaptation of semantic segmentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  13. [13]

    Col- laborative unsupervised domain adaptation for medical image diagnosis,

    Y . Zhang, Y . Wei, Q. Wu, P. Zhao, S. Niu, J. Huang, and M. Tan, “Col- laborative unsupervised domain adaptation for medical image diagnosis,” IEEE Transactions on Image Processing, vol. 29, pp. 7834–7844, 2020

  14. [14]

    Learning to adapt structured output space for semantic segmentation,

    Y .-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang, and M. Chandraker, “Learning to adapt structured output space for semantic segmentation,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 7472–7481, 2018

  15. [15]

    Advent: Ad- versarial entropy minimization for domain adaptation in semantic seg- mentation,

    T.-H. Vu, H. Jain, M. Bucher, M. Cord, and P. P ´erez, “Advent: Ad- versarial entropy minimization for domain adaptation in semantic seg- mentation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2517–2526, 2019

  16. [16]

    Unsupervised domain adaptation for medical image segmentation with dynamic prototype-based contrastive learning,

    Q. En and Y . Guo, “Unsupervised domain adaptation for medical image segmentation with dynamic prototype-based contrastive learning,” in Conference on Health, Inference, and Learning, pp. 312–325, PMLR, 2024

  17. [17]

    Daformer: Improving network architectures and training strategies for domain-adaptive semantic seg- mentation,

    L. Hoyer, D. Dai, and L. Van Gool, “Daformer: Improving network architectures and training strategies for domain-adaptive semantic seg- mentation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9914–9925, 2022

  18. [18]

    Unsupervised domain adaptation for medical image segmentation using transformer with meta attention,

    W. Ji and A. C. S. Chung, “Unsupervised domain adaptation for medical image segmentation using transformer with meta attention,” IEEE Transactions on Medical Imaging, vol. 43, no. 2, pp. 820–831, 2024

  19. [19]

    Tganet: Text-guided attention for improved polyp segmentation,

    N. K. Tomar, D. Jha, U. Bagci, and S. Ali, “Tganet: Text-guided attention for improved polyp segmentation,” inInternational Conference on Medi- cal Image Computing and Computer-Assisted Intervention, pp. 151–160, Springer, 2022

  20. [20]

    Ariadne’s thread: Using text prompts to improve segmentation of infected areas from chest x-ray images,

    Y . Zhong, M. Xu, K. Liang, K. Chen, and M. Wu, “Ariadne’s thread: Using text prompts to improve segmentation of infected areas from chest x-ray images,” inInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 724–733, Springer, 2023

  21. [21]

    M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,

    C. Liu, S. Cheng, C. Chen, M. Qiao, W. Zhang, A. Shah, W. Bai, and R. Arcucci, “M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,” in International Conference on Medical Image Computing and Computer- Assisted Intervention, pp. 637–647, Springer, 2023

  22. [22]

    Generative text-guided 3d vision-language pretraining for unified medical image segmentation,

    Y . Chen, C. Liu, W. Huang, S. Cheng, R. Arcucci, and Z. Xiong, “Gen- erative text-guided 3d vision-language pretraining for unified medical image segmentation,”arXiv preprint arXiv:2306.04811, 2023

  23. [23]

    Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation,

    Z. Liu, Z. Zhu, S. Zheng, Y . Liu, J. Zhou, and Y . Zhao, “Margin preserving self-paced contrastive learning towards domain adaptation for medical image segmentation,”IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 2, pp. 638–647, 2022

  24. [24]

    Domain-agnostic mutual prompting for unsupervised domain adaptation,

    Z. Du, X. Li, F. Li, K. Lu, L. Zhu, and J. Li, “Domain-agnostic mutual prompting for unsupervised domain adaptation,” inIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), pp. 23375– 23384, June 2024

  25. [25]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inProceedings of the 38th International Conference on Machine Learning(M. Meila and T. Zhang, eds.), vol. 139 ofProceedings of Machine Learnin...

  26. [26]

    Biobert: a pre-trained biomedical language representation model for biomedical text mining,

    J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a pre-trained biomedical language representation model for biomedical text mining,”Bioinformatics, vol. 36, pp. 1234–1240, 09 2019

  27. [27]

    Learning semantic representa- tions for unsupervised domain adaptation,

    S. Xie, Z. Zheng, L. Chen, and C. Chen, “Learning semantic representa- tions for unsupervised domain adaptation,” inInternational Conference on Machine Learning, pp. 5423–5432, PMLR, 2018

  28. [28]

    Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,

    X. Zhuang and J. Shen, “Multi-scale patch and multi-modality atlases for whole heart segmentation of mri,”Medical Image Analysis, vol. 31, pp. 77–87, 2016

  29. [29]

    Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,

    B. Landman, Z. Xu, J. Igelsias, M. Styner, T. Langerak, and A. Klein, “Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge,” inProc. MICCAI multi-atlas labeling beyond cranial vault—workshop challenge, vol. 5, p. 12, Munich, Germany, 2015

  30. [30]

    Chaos challenge- combined (ct-mr) healthy abdominal organ segmentation,

    A. E. Kavur, N. S. Gezer, M. Barıs ¸, S. Aslan, P.-H. Conze, V . Groza, D. D. Pham, S. Chatterjee, P. Ernst, S. ¨Ozkan,et al., “Chaos challenge- combined (ct-mr) healthy abdominal organ segmentation,”Medical Image Analysis, vol. 69, p. 101950, 2021

  31. [31]

    The multimodal brain tumor image segmentation benchmark (brats),

    B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, Y . Burren, N. Porz, J. Slotboom, R. Wiest,et al., “The multimodal brain tumor image segmentation benchmark (brats),”IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 2014

  32. [32]

    Pnp-adanet: Plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation,

    Q. Dou, C. Ouyang, C. Chen, H. Chen, B. Glocker, X. Zhuang, and P.-A. Heng, “Pnp-adanet: Plug-and-play adversarial domain adaptation network at unpaired cross-modality cardiac segmentation,”IEEE Access, vol. 7, pp. 99065–99076, 2019

  33. [33]

    Energy-constrained self-training for unsupervised domain adaptation,

    X. Liu, B. Hu, X. Liu, J. Lu, J. You, and L. Kong, “Energy-constrained self-training for unsupervised domain adaptation,” in25th International Conference on Pattern Recognition (ICPR), pp. 7515–7520, 2021

  34. [34]

    Confidence regularized self-training,

    Y . Zou, Z. Yu, X. Liu, B. V . K. V . Kumar, and J. Wang, “Confidence regularized self-training,” inIEEE/CVF International Conference on Computer Vision (ICCV), pp. 5981–5990, 2019

  35. [35]

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

    L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2017

  36. [36]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, Ieee, 2009

  37. [37]

    Image-to-image trans- lation with conditional adversarial networks,

    P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans- lation with conditional adversarial networks,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134, 2017