pith. machine review for the scientific record. sign in

arxiv: 2603.11795 · v2 · submitted 2026-03-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Intrinsic Concept Extraction Based on Compositional Interpretability

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords concept extractiondiffusion modelshyperbolic geometrycompositional interpretabilityimage disentanglementintrinsic conceptsunsupervised learning
0
0 comments X

The pith

HyperExpress extracts composable object and attribute concepts from one image using hyperbolic disentanglement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the CI-ICE task to extract intrinsic concepts from a single image such that those concepts can be recombined to reconstruct the original. HyperExpress achieves this by placing concept learning in hyperbolic space, which naturally captures hierarchies and relations among concepts, then applying concept-wise optimization to keep the extracted pieces combinable. The authors report that the resulting concepts are both accurate and reusable without extra training data or labels. A sympathetic reader would care because this turns opaque image generation into a modular process where individual parts can be isolated and swapped. The work rests on the assumption that diffusion models already hold the necessary structure for such extraction.

Core claim

We propose HyperExpress to solve the CI-ICE task by leveraging hyperbolic space's hierarchical modeling capability for accurate concept disentanglement that preserves relational dependencies, together with a concept-wise optimization step that maps the embedding space to maintain inter-concept relationships while ensuring the concepts remain composable, thereby allowing reconstruction of the original image from their combination.

What carries the argument

Hyperbolic concept disentanglement, which models concepts in hyperbolic space to capture hierarchy and dependencies, paired with concept-wise optimization to ensure composability.

If this is right

  • Extracted concepts can be recombined to reconstruct the input image with high fidelity.
  • The method simultaneously handles object-level and attribute-level concepts from one image.
  • Relational dependencies among concepts are preserved during the disentanglement process.
  • The extraction requires no additional labeled data or multiple example images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The extracted concepts could support targeted image editing by swapping or modifying single parts while leaving others unchanged.
  • The same hyperbolic optimization approach might extend to other generative models to improve their interpretability.
  • Collections of such concepts could serve as reusable visual primitives for building new images from everyday photos.

Load-bearing premise

Diffusion-based text-to-image models already contain accurate, disentangled concepts that can be isolated and recombined through hyperbolic mapping and optimization without further supervision.

What would settle it

A test where the extracted concepts are recombined in the diffusion model and the output image differs substantially from the input in layout, objects, or attributes.

Figures

Figures reproduced from arXiv: 2603.11795 by Chi-Man Pun, Guoheng Huang, Hanyu Shi, Hong Tao, Jianbin Jiang, Pan Pan, Shanhu Wang, Xuhang Chen.

Figure 1
Figure 1. Figure 1: The difference between unsupervised concept extrac￾tion (UCE) methods [5, 8, 41] and the composable and inter￾pretable intrinsic concept extraction method. Object-level con￾cept extraction method [8, 15] can only extract object-level con￾cepts and is unable to extract attribute-level concepts such as color and material. Although the intrinsic concept extraction method [5] can extract both object-level conc… view at source ↗
Figure 2
Figure 2. Figure 2: The Proposed Method and Its Components. (a) The overall structure of HyperExpress: It addresses the CI-ICE task from two aspects: concept learning and concept-wise optimization. (b) Concept learning: It leverages triplet loss Ltriplet and hyperbolic entailment loss Lentail to learn the hierarchical structure and associative relationships between object-level concepts and attribute-level concepts. (c) Conce… view at source ↗
Figure 3
Figure 3. Figure 3: Explanation of the proposed HEL module. Un￾like the HCL module, the hyperbolic entailment loss is computed within the Lorentz model. If the object-level concept (v obj k ) and attribute-level concept (v color k and v material k ) satisfy the condition in Eq. (9), the entailment loss will be 0; otherwise, the correspond￾ing entailment loss will be calculated. concepts and attribute-level concepts is shown a… view at source ↗
Figure 5
Figure 5. Figure 5: In concept extraction, HyperExpress distinguishes itself from ICE [5] by learning associative relationships be￾tween concepts, leading to the extraction of specific, con￾crete object concepts from images. This specificity en￾hances the interpretability of the compositional process. Consequently, in compositional reconstruction, HyperEx￾press generates more interpretable pathways than ICE. For instance, as … view at source ↗
read the original abstract

Unsupervised Concept Extraction aims to extract concepts from a single image; however, existing methods suffer from the inability to extract composable intrinsic concepts. To address this, this paper introduces a new task called Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE). The CI-ICE task aims to leverage diffusion-based text-to-image models to extract composable object-level and attribute-level concepts from a single image, such that the original concept can be reconstructed through the combination of these concepts. To achieve this goal, we propose a method called HyperExpress, which addresses the CI-ICE task through two core aspects. Specifically, first, we propose a concept learning approach that leverages the inherent hierarchical modeling capability of hyperbolic space to achieve accurate concept disentanglement while preserving the hierarchical structure and relational dependencies among concepts; second, we introduce a concept-wise optimization method that maps the concept embedding space to maintain complex inter-concept relationships while ensuring concept composability. Our method demonstrates outstanding performance in extracting compositionally interpretable intrinsic concepts from a single image.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Compositional and Interpretable Intrinsic Concept Extraction (CI-ICE) task for unsupervised extraction of composable object-level and attribute-level concepts from a single image. It proposes the HyperExpress method, which uses hyperbolic geometry to model hierarchical concept structures and disentangle concepts while a concept-wise optimization maps embeddings to preserve inter-concept relationships and enable reconstruction of the original image semantics via linear or compositional recombination. The abstract asserts that the approach demonstrates outstanding performance.

Significance. If the central claims are validated, the work could advance unsupervised concept learning in computer vision by providing a framework for extracting reusable, hierarchically structured concepts directly from single images via diffusion models. The integration of hyperbolic embeddings for relational dependencies offers a potentially useful inductive bias for compositional interpretability. However, the absence of any quantitative results means the significance cannot yet be assessed beyond the conceptual level.

major comments (2)
  1. [Abstract] Abstract: the claim of 'outstanding performance' is unsupported because no quantitative metrics (PSNR, CLIP similarity, reconstruction error, composability scores, or human interpretability ratings), baselines, or experimental protocol are reported, rendering the central claim unverifiable.
  2. [Method] Method description: no equations are supplied for the hyperbolic embedding loss, the concept-wise optimization objective, or any regularizers (reconstruction, orthogonality, hierarchy), so it is impossible to check whether the optimization actually enforces composability or collapses to trivial solutions.
minor comments (1)
  1. [Title/Abstract] The title refers to 'Intrinsic Concept Extraction Based on Compositional Interpretability' while the abstract defines the task as CI-ICE; align the terminology for consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the current version lacks sufficient quantitative evidence and explicit mathematical details to fully substantiate the claims. We will revise the paper accordingly by adding experiments and equations.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'outstanding performance' is unsupported because no quantitative metrics (PSNR, CLIP similarity, reconstruction error, composability scores, or human interpretability ratings), baselines, or experimental protocol are reported, rendering the central claim unverifiable.

    Authors: We acknowledge that the abstract's claim of 'outstanding performance' is not supported by quantitative results in the current manuscript. In the revised version, we will add a comprehensive experimental section reporting metrics such as PSNR, CLIP similarity, reconstruction error, and composability scores, along with baselines and a detailed experimental protocol. The abstract will be updated to reflect these results rather than asserting outstanding performance without evidence. revision: yes

  2. Referee: [Method] Method description: no equations are supplied for the hyperbolic embedding loss, the concept-wise optimization objective, or any regularizers (reconstruction, orthogonality, hierarchy), so it is impossible to check whether the optimization actually enforces composability or collapses to trivial solutions.

    Authors: We agree that the absence of explicit equations makes it difficult to verify the optimization details. The revised manuscript will include the full mathematical formulations for the hyperbolic embedding loss, the concept-wise optimization objective, and all regularizers (including reconstruction, orthogonality, and hierarchy terms). These additions will clarify how the objectives enforce composability and prevent collapse to trivial solutions. revision: yes

Circularity Check

0 steps flagged

No circularity: method claims rest on proposed hyperbolic disentanglement and optimization without reduction to fitted inputs or self-citations

full rationale

The provided abstract and description introduce the CI-ICE task and HyperExpress method via two components (hyperbolic hierarchical modeling for disentanglement and concept-wise optimization for composability), but contain no equations, loss definitions, or self-citations that reduce any prediction or result to the inputs by construction. No self-definitional loops, fitted parameters called predictions, or load-bearing uniqueness theorems from prior author work are visible. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard properties of hyperbolic geometry for hierarchy and diffusion model capabilities for reconstruction; no free parameters or invented entities are specified in the abstract.

axioms (1)
  • domain assumption Hyperbolic space possesses inherent hierarchical modeling capability suitable for concept disentanglement
    Invoked to achieve accurate disentanglement while preserving structure and dependencies

pith-pipeline@v0.9.0 · 5493 in / 1028 out tokens · 38239 ms · 2026-05-15T12:12:05.292891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4o System Card

    Hurst A, Lerer A, and et al. Goucher A P. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

  2. [2]

    Break-a-scene: Extracting multiple concepts from a single image

    Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. InSIGGRAPH Asia, 2023. 1, 2, 7

  3. [3]

    Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst

    Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond euclidean data.IEEE Signal Processing Mag- azine, 2017. 3

  4. [4]

    Unsupervised learn- ing of visual features by contrasting cluster assignments

    Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi- otr Bojanowski, and Armand Joulin. Unsupervised learn- ing of visual features by contrasting cluster assignments. In NeurIPS, 2020. 3

  5. [5]

    Ice: Intrinsic con- cept extraction from a single image via diffusion models

    Fernando Julio Cendra and Kai Han. Ice: Intrinsic con- cept extraction from a single image via diffusion models. In CVPR, 2025. 1, 2, 3, 4, 5, 6, 7, 8

  6. [6]

    Horopca: Hyperbolic dimensionality reduction via horo- spherical projections

    Ines Chami, Albert Gu, Dat Nguyen, and Christopher R ´e. Horopca: Hyperbolic dimensionality reduction via horo- spherical projections. InICML, 2021. 6

  7. [7]

    Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Ruiz, Xuhui Jia, Ming-Wei Chang, and William W. Cohen. Subject-driven text-to-image generation via apprenticeship learning. InNeurIPS, 2023. 2

  8. [8]

    Au- toconcept: Unsupervised extraction of constituent concepts from single image

    Pranav Singh Chib, Kirtankumar Vijaykumar Patel, Mudit Gupta, Pise Ashutosh Kalidas, and Pravendra Singh. Au- toconcept: Unsupervised extraction of constituent concepts from single image. InICCV Workshops, 2025. 1, 2, 7

  9. [9]

    Hyperbolic image- text representations

    Karan Desai, Maximilian Nickel, Tanmay Rajpurohit, Justin Johnson, and Ramakrishna Vedantam. Hyperbolic image- text representations. InICML, 2023. 3, 5

  10. [10]

    Bermano, Gal Chechik, and Daniel Cohen-Or

    Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image gen- eration using textual inversion. InICLR, 2023. 2

  11. [11]

    Bermano, Gal Chechik, and Daniel Cohen-Or

    Rinon Gal, Moab Arar, Yuval Atzmon, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models.ACM Transactions on Graphics (TOG), 2023. 2

  12. [12]

    Hyperbolic entailment cones for learning hierarchical em- beddings

    Octavian Ganea, Gary Becigneul, and Thomas Hofmann. Hyperbolic entailment cones for learning hierarchical em- beddings. InICML, 2018. 5

  13. [13]

    Hyperbolic neural networks

    Octavian-Eugen Ganea, Gary B ´ecigneul, and Thomas Hof- mann. Hyperbolic neural networks. InNeurIPS, 2018. 3, 4, 5

  14. [14]

    Shaozhe Hao, Kai Han, Shihao Zhao, and Kwan-Yee K. Wong. Vico: Plug-and-play visual condition for personalized text-to-image generation.arXiv preprint arXiv:2306.00971,

  15. [15]

    Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, and Kwan-Yee K. Wong. ConceptExpress: Harnessing diffusion models for single-image unsupervised concept extraction. In ECCV, 2024. 1, 2, 6, 7

  16. [16]

    Denoising diffu- sion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu- sion probabilistic models. InNeurIPS, 2020. 3, 4

  17. [17]

    Intriguing properties of hyperbolic embed- dings in vision-language models.Transactions on Machine Learning Research, 2024

    Sarah Ibrahimi, Mina Ghadimi Atigh, Pascal Mettes, and Marcel Worring. Intriguing properties of hyperbolic embed- dings in vision-language models.Transactions on Machine Learning Research, 2024. 3

  18. [18]

    Xuhui Jia, Yang Zhao, Kelvin C. K. Chan, Yandong Li, Han-Ying Zhang, Boqing Gong, Tingbo Hou, H. Wang, and Yu-Chuan Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models.arXiv preprint arXiv:2304.02642, 2023. 2

  19. [19]

    An image is worth multiplewords: dis- covering object level concepts using multi-concept prompt learning

    Chen Jin, Ryutaro Tanno, Amrutha Saseendran, Tom Diethe, and Philip Teare. An image is worth multiplewords: dis- covering object level concepts using multi-concept prompt learning. InICML, 2024. 1, 2

  20. [20]

    Mixco: Mix-up contrastive learning for visual representa- tion.arXiv preprint arXiv:2010.06300, 2020

    Sungnyun Kim, Gihun Lee, Sangmin Bae, and Seyoung Yun. Mixco: Mix-up contrastive learning for visual representa- tion.arXiv preprint arXiv:2010.06300, 2020. 3

  21. [21]

    Attention-based ensemble for deep met- ric learning

    Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon. Attention-based ensemble for deep met- ric learning. InECCV, 2018. 3

  22. [22]

    Multi-concept customization of text-to-image diffusion

    Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. InCVPR, 2023. 2

  23. [23]

    Diffusion models already have a semantic latent space

    Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models already have a semantic latent space. InICLR, 2023. 3

  24. [24]

    Inferring concept hierarchies from text corpora via hyperbolic embeddings

    Matthew Le, Stephen Roller, Laetitia Papaxanthos, Douwe Kiela, and Maximilian Nickel. Inferring concept hierarchies from text corpora via hyperbolic embeddings. InACL, 2019. 5

  25. [25]

    Dongxu Li, Junnan Li, and Steven C. H. Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to- image generation and editing. InNeurIPS, 2023. 2

  26. [26]

    Enhancing hyperbolic graph embeddings via contrastive learning.arXiv preprint arXiv:2201.08554, 2022

    Jiahong Liu, Menglin Yang, Min Zhou, Shanshan Feng, and Philippe Fournier-Viger. Enhancing hyperbolic graph embeddings via contrastive learning.arXiv preprint arXiv:2201.08554, 2022. 3

  27. [27]

    Tenenbaum, and Antonio Torralba

    Nan Liu, Yilun Du, Shuang Li, Joshua B. Tenenbaum, and Antonio Torralba. Unsupervised compositional concepts dis- covery with text-to-image generative models. InICCV, 2023. 2

  28. [28]

    Ma, Huan Yang, Wenjing Wang, Jianlong Fu, and Jiay- ing Liu

    Y . Ma, Huan Yang, Wenjing Wang, Jianlong Fu, and Jiay- ing Liu. Unified multi-modal latent diffusion for joint sub- ject and text conditional image generation.arXiv preprint arXiv:2303.09319, 2023. 2

  29. [29]

    Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image genera- tion and editing with text-guided diffusion models. InICML,

  30. [30]

    Compositional entailment learning for hyperbolic vision-language models

    Avik Pal, Max van Spengler, Guido Maria D’Amely di Me- lendugno, Alessandro Flaborea, Fabio Galasso, and Pascal Mettes. Compositional entailment learning for hyperbolic vision-language models. InICLR, 2025. 3

  31. [31]

    Hyperbolic deep neural networks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

    Wei Peng, Tuomas Varanka, Abdelrahman Mostafa, Henglin Shi, and Guoying Zhao. Hyperbolic deep neural networks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 4, 5

  32. [32]

    Hyperbolic safety-aware vision-language models

    Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, and Rita Cucchiara. Hyperbolic safety-aware vision-language models. InCVPR, 2025. 5

  33. [33]

    Controlling text-to-image diffusion by orthogo- nal finetuning

    Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch¨olkopf. Controlling text-to-image diffusion by orthogo- nal finetuning. InNeurIPS, 2023. 2

  34. [34]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021. 4

  35. [35]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gen- eration with clip latents.arXiv preprint arXiv:2204.06125,

  36. [36]

    Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, 2022. 2, 3, 4

  37. [37]

    Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR, 2023. 2

  38. [38]

    Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Lit, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Raphael Gontijo-Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022. 3, 4

  39. [39]

    Instant- booth: Personalized text-to-image generation without test- time finetuning

    Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instant- booth: Personalized text-to-image generation without test- time finetuning. InCVPR, 2024. 2

  40. [40]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 3, 4

  41. [41]

    Towards compositionality in concept learning

    Adam Stein, Aaditya Naik, Yinjun Wu, Mayur Naik, and Eric Wong. Towards compositionality in concept learning. InICML, 2024. 1, 2, 3, 4, 6

  42. [42]

    Key-locked rank one editing for text-to-image personaliza- tion

    Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personaliza- tion. InSIGGRAPH Asia, 2023. 2

  43. [43]

    Concept decomposition for visual exploration and inspiration.ACM Transactions on Graphics (TOG), 2023

    Yael Vinker, Andrey V oynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration.ACM Transactions on Graphics (TOG), 2023. 3

  44. [44]

    Concept algebra for (score-based) text-controlled generative models

    Zihao Wang, Lin Gui, Jeffrey Negrea, and Victor Veitch. Concept algebra for (score-based) text-controlled generative models. InNeurIPS, 2023. 3

  45. [45]

    Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation

    Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual con- cepts into textual embeddings for customized text-to-image generation. InICCV, 2023. 2

  46. [46]

    Compo- sitional generalization in unsupervised compositional repre- sentation learning: a study on disentanglement and emergent language

    Zhenlin Xu, Marc Niethammer, and Colin Raffel. Compo- sitional generalization in unsupervised compositional repre- sentation learning: a study on disentanglement and emergent language. InNeurIPS, 2022. 3

  47. [47]

    Sope: Spherical coordinate- based positional embedding for enhancing spatial perception of 3d lvlms.arXiv preprint arXiv:2602.22716, 2026

    Guanting Ye, Qiyan Zhao, Wenhao Yu, Liangyu Yuan, Mingkai Li, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, Qing Jiang, and Ka-Veng Yuen. Sope: Spherical coordinate- based positional embedding for enhancing spatial perception of 3d lvlms.arXiv preprint arXiv:2602.22716, 2026. 1

  48. [48]

    Cˆ2rope: Causal continuous rotary positional encoding for 3d large multimodal-models reasoning

    Guanting Ye, Qiyan Zhao, Wenhao Yu, Xiaofeng Zhang, Jianmin Ji, Yanyong Zhang, and Ka-Veng Yuen. Cˆ2rope: Causal continuous rotary positional encoding for 3d large multimodal-models reasoning. InICRA, 2026. 1

  49. [49]

    Understanding hyperbolic metric learning through hard neg- ative sampling

    Yun Yue, Fangzhou Lin, Guanyi Mou, and Ziming Zhang. Understanding hyperbolic metric learning through hard neg- ative sampling. InWACV, 2024. 3

  50. [50]

    Hallucination be- gins where saliency drops

    Xiaofeng Zhang, Yuanchao Zhu, Chaochen Gu, Xiaosong Yuan, Qiyan Zhao, Jiawei Cao, Feilong Tang, Sinan Fan, Yaomin Shen, Chen Shen, and Hao Tang. Hallucination be- gins where saliency drops. InICLR, 2026. 1

  51. [51]

    Attention calibration for disentangled text-to-image person- alization

    Yanbing Zhang, Mengping Yang, Qin Zhou, and Zhe Wang. Attention calibration for disentangled text-to-image person- alization. InCVPR, 2024. 2