pith. machine review for the scientific record. sign in

arxiv: 2604.24642 · v1 · submitted 2026-04-27 · 💻 cs.CV

Recognition: unknown

Probing CLIP's Comprehension of 360-Degree Textual and Visual Semantics

Authors on Pith no claims yet

Pith reviewed 2026-05-08 04:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords CLIP360-degree panoramicsemantic alignmentcircular shift invarianceLoRA fine-tuningtextual semanticsvisual semanticspanoramic evaluation
0
0 comments X

The pith

CLIP models understand explicit 360-degree text labels but fail to keep semantic scores stable when panoramic images are rotated horizontally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines 360-degree textual semantics as the cues carried by explicit format words in captions and 360-degree visual semantics as the meanings that remain unchanged when a panoramic image is shifted horizontally around its circle. It tests CLIP by rewriting captions to remove or add 360 indicators and by applying controlled circular shifts to images, then measures how much the model’s similarity scores change. The results show that CLIP responds reliably to the text labels yet produces inconsistent scores after even modest image rotations. The authors therefore introduce a LoRA fine-tuning procedure that trains the model to treat shifted versions of the same panorama as equivalent, raising robustness to rotations while lowering accuracy on ordinary non-panoramic benchmarks.

Core claim

CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. A LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts improves the second capability, although it produces a slight degradation in original semantic evaluation performance.

What carries the argument

Keyword manipulation on captions combined with horizontal circular shifts of varying magnitudes on panoramic images, scored by cosine similarity in CLIP space and analyzed statistically across model variants, followed by LoRA adapters that enforce shift invariance.

Load-bearing premise

Horizontal circular shifts of panoramic images leave their underlying semantic content unchanged and that inserting or removing 360-related keywords affects only the targeted 360 semantics.

What would settle it

Collect a held-out set of real 360-degree images, apply random horizontal shifts, recompute CLIP similarity to matched captions, and check whether the score distribution remains statistically identical before and after the shifts.

Figures

Figures reproduced from arXiv: 2604.24642 by Hai Wang, Jing-Hao Xue, Mingzhi Dong, Xiaochen Yang.

Figure 1
Figure 1. Figure 1: Example of two types of image-text pairs. Textually, the explicit format identifier is highlighted (left in each pair). Visually, the corresponding horizontally circular-shifted versions are shown (right in each pair). tually, prompts for such images often include explicit 360- degree panoramic format identifiers (e.g., “a 360 degree view of ”, “360 photo”), which convey what we define as 360-degree textua… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our framework to evaluate CLIP models’ understanding of 360-degree textual semantics. The format cue V ∗ is a keyword explicitly identifying the 360-degree panoramic image format (e.g., “360 panorama”, “360 photo”), while U ∗ is a generic cue (e.g., “photo”, “image”) that lacks specific 360-degree panoramic format information. comprehension of 360-degree visual semantics, while also revealing t… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our framework to assess CLIP models’ understanding of 360-degree visual semantics. Image I δ is ob￾tained by applying a horizontal circular shift of δ pixels to I of size H × W. the significance level (0.01). This leads to a decisive rejec￾tion of the null hypothesis, providing compelling evidence that these evaluated CLIP models effectively discern and utilize 360-degree textual semantics. 2.2… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Example of a 360-degree panoramic image-text pair and (c) its CLIP score differences (s − s δ ) using ViT-B/32 across diverse shift distances, where stability bound β = 1.7919 view at source ↗
Figure 6
Figure 6. Figure 6: (a) Example of an image-text pair containing the direc￾tional cue (“in the middle”). (b) Examples of base prompts (first row) and standardized prompts (second row) from BLIP-2 and ChatGPT. Format identifiers are highlighted in red font. images (1024×512 resolution) sourced from Laval In￾door (Gardner et al., 2017) and Laval Outdoor (Hold￾Geoffroy et al., 2019), using BLIP-2 (Li et al., 2023). These automat… view at source ↗
Figure 7
Figure 7. Figure 7: Boxplots of absolute score differences (|si − s T i |) under two diverse transformations for three various CLIP models on the 360 real dataset. utilize 360-degree panorama-specific textual cues, resulting in significantly stronger image-text alignment when such cues are present. Consequently, these results underscore the considerable importance of incorporating 360-degree panoramic identifiers in textual p… view at source ↗
Figure 8
Figure 8. Figure 8: CLIP scores of original 360-degree panoramic images using a frozen CLIP model and its three fine-tuned (FT) versions. which further proves their improved comprehension of 360- degree visual semantics. To further compare the semantic evaluation capability of frozen and fine-tuned models, we computed the CLIP scores (si) of all original 360-degree panoramic images in 360 real using the frozen CLIP model (ViT… view at source ↗
Figure 9
Figure 9. Figure 9: shows the flowchart to produce the two paired image-text datasets (360 real and 360 syn) used in our evaluation experiments. real-world 360-degree panoramic images (2386 images) augmented text prompts synthesized 360-degree panoramic images (2386 images) augmented text prompts 360_real 360_syn real-world 360-degree panoramic images (2438 images) BLIP-2 ChatGPT base text prompts (2438 prompts) standardized … view at source ↗
Figure 10
Figure 10. Figure 10: Fine-tuning loss curves using different λ values of ViT-B/32 (OpenCLIP, LAION-400M). To determine an appropriate value of λ for weighting the two components in LF T , we adopt a data-driven approach based on knee-point detection. Specifically, we first set λ = 1, record the corresponding loss curve, and detect its knee point using the kneed Python package.1 The loss value at this knee point is denoted as … view at source ↗
Figure 11
Figure 11. Figure 11: Fine-tuning loss curves using different λ values of ViT-B/16 (OpenCLIP, LAION-400M). 1 https://kneed.readthedocs.io/en/stable/ 16 view at source ↗
Figure 12
Figure 12. Figure 12: Fine-tuning loss curves using different λ values of ViT-L/14 (OpenCLIP, LAION-400M). Frozen FT ( = 1) FT ( = 0.9899) FT ( = 0) 20 25 30 35 40 45 S c ore of Origin al Im a g e: si ViT-B/16 (OpenCLIP, LAION-400M) (a) Frozen FT ( = 1) FT ( = 0.9919) FT ( = 0) 20 25 30 35 40 45 50 S c ore of Origin al Im a g e: si ViT-L/14 (OpenCLIP, LAION-400M) (b) view at source ↗
Figure 13
Figure 13. Figure 13: CLIP scores of original 360-degree panoramic images using a frozen CLIP model and its three fine-tuned (FT) versions. 17 view at source ↗
Figure 14
Figure 14. Figure 14: CLIP scores of original 360-degree panoramic images using a frozen CLIP model and its three fine-tuned versions with different fine-tuning methods. (I) and (I & T) denote fine-tuning on image encoder and both encoders, respectively. Frozen LoRA (I) LoRA (I & T) Full Fine-Tuning (I) 25 30 35 40 45 50 S c ore of Origin al Im a g e: si ViT-L/14 (OpenCLIP, LAION-400M) view at source ↗
Figure 15
Figure 15. Figure 15: CLIP scores of original 360-degree panoramic images using a frozen CLIP model and its three fine-tuned versions with different fine-tuning methods. (I) and (I & T) denote fine-tuning on image encoder and both encoders, respectively view at source ↗
Figure 16
Figure 16. Figure 16: shows examples of the image-text pairs together with horizontally flipped and circular-shifted versions. Unlike 360-degree panoramic images, circular shifts in perspective images introduce clear semantic distortions, providing a controlled way to test whether a model’s score meaningfully reflects semantic alignment. “a long table with lots of plants in a greenhouse” “modern interior with large windows and… view at source ↗
Figure 17
Figure 17. Figure 17: Boxplots of absolute score differences (|si − s T i |) under three diverse transformations for (a) SigLIP and (b) CLIP models on the per syn dataset. Following procedure in Sec. 2.2, we present the distribution of absolute score differences (|si − s f lip i |) for the SigLIP and CLIP models on the leftmost side of each subfigure in view at source ↗
Figure 18
Figure 18. Figure 18: [ViT-B/32, OpenCLIP, LAION-400M], CLIP scores of an original 360-degree panoramic image and its corresponding horizontally circular-shifted versions using frozen and fine-tuned CLIP models, respectively. δj and W denote the shift distance and the image width, respectively. “<360panorama>, a hallway in a building” 𝛿𝛿𝑗𝑗 = 𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 2𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 3𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 4𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 5𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 6𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 7𝑊𝑊⁄8 Fro… view at source ↗
Figure 19
Figure 19. Figure 19: [ViT-B/16, OpenCLIP, LAION-400M], CLIP scores of an original 360-degree panoramic image and its corresponding horizontally circular-shifted versions using frozen and fine-tuned CLIP models, respectively. δj and W denote the shift distance and the image width, respectively. “<360panorama>, a park with a lake and a beach” 𝛿𝛿𝑗𝑗 = 𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 2𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 3𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 4𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 5𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 6𝑊𝑊⁄8 𝛿𝛿𝑗𝑗 = 7𝑊… view at source ↗
Figure 20
Figure 20. Figure 20: [ViT-L/14, OpenCLIP, LAION-400M], CLIP scores of an original 360-degree panoramic image and its corresponding horizontally circular-shifted versions using frozen and fine-tuned CLIP models, respectively. δj and W denote the shift distance and the image width, respectively. 29 view at source ↗
read the original abstract

The dream of instantly creating rich 360-degree panoramic worlds from text is rapidly becoming a reality, yet a crucial gap exists in our ability to reliably evaluate their semantic alignment. Contrastive Language-Image Pre-training (CLIP) models, standard AI evaluators, predominantly trained on perspective image-text pairs, face an open question regarding their understanding of the unique characteristics of 360-degree panoramic image-text pairs. This paper addresses this gap by first introducing two concepts: \emph{360-degree textual semantics}, semantic information conveyed by explicit format identifiers, and \emph{360-degree visual semantics}, invariant semantics under horizontal circular shifts. To probe CLIP's comprehension of these semantics, we then propose novel evaluation methodologies using keyword manipulation and horizontal circular shifts of varying magnitudes. Rigorous statistical analyses across popular CLIP configurations reveal that: (1) CLIP models effectively leverage explicit textual identifiers, demonstrating an understanding of 360-degree textual semantics; and (2) CLIP models fail to robustly preserve semantic alignment under horizontal circular shifts, indicating limited comprehension of 360-degree visual semantics. To address this limitation, we propose a LoRA-based fine-tuning framework that explicitly instills invariance to circular shifts. Our fine-tuned models exhibit improved comprehension of 360-degree visual semantics, though with a slight degradation in original semantic evaluation performance, highlighting a fundamental trade-off in adapting CLIP to 360-degree panoramic images. Code is available at https://github.com/littlewhitesea/360Semantics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the concepts of 360-degree textual semantics (semantic information from explicit format identifiers in text) and 360-degree visual semantics (invariant semantics under horizontal circular shifts of panoramic images). It proposes evaluation methods using keyword manipulation and horizontal circular shifts of varying magnitudes to probe CLIP models, finding that CLIP effectively leverages textual identifiers but fails to robustly preserve alignment under shifts. A LoRA-based fine-tuning framework is introduced to instill shift invariance, yielding improved 360 visual semantics performance at the cost of a slight degradation in standard semantic evaluation.

Significance. If the evaluation methodology is robust, this work identifies a meaningful limitation in CLIP for 360-degree panoramic content, which is increasingly relevant for text-to-360 generation. The empirical results across CLIP configurations and the proposed adaptation method provide actionable insights, with the public code release at https://github.com/littlewhitesea/360Semantics supporting reproducibility.

major comments (2)
  1. [Probing methodology for 360-degree visual semantics] The definition of 360-degree visual semantics as invariance under horizontal circular shifts assumes that such shifts preserve underlying semantic content exactly. In equirectangular projections, even small rolls can reposition polar distortion regions relative to CLIP's fixed patch grid or expose seam artifacts; these are non-semantic changes that could legitimately alter embeddings. This assumption is load-bearing for attributing any cosine-similarity drop to missing invariance rather than altered input (see the probing methodology and results sections).
  2. [Experimental setup and results] The abstract reports 'rigorous statistical analyses' and a 'performance trade-off after fine-tuning,' yet the support for the central claims cannot be fully verified without the full methods, exact shift magnitudes, datasets, and statistical tests (e.g., p-values or effect sizes). Please expand the experimental setup section to include these details.
minor comments (2)
  1. [Abstract] The abstract could briefly quantify the number of CLIP configurations tested and the magnitude of the observed trade-off to give readers immediate context.
  2. [Keyword manipulation procedure] Clarify whether the keyword manipulation for textual semantics introduces any unintended changes in sentence structure or length that might confound the isolation of format-identifier effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for highlighting areas where the manuscript can be strengthened. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Probing methodology for 360-degree visual semantics] The definition of 360-degree visual semantics as invariance under horizontal circular shifts assumes that such shifts preserve underlying semantic content exactly. In equirectangular projections, even small rolls can reposition polar distortion regions relative to CLIP's fixed patch grid or expose seam artifacts; these are non-semantic changes that could legitimately alter embeddings. This assumption is load-bearing for attributing any cosine-similarity drop to missing invariance rather than altered input (see the probing methodology and results sections).

    Authors: We acknowledge that equirectangular projections introduce non-semantic variations under horizontal shifts, including changes in polar distortion relative to the patch grid and potential seam artifacts. Our definition of 360-degree visual semantics relies on the standard assumption in panoramic vision that horizontal circular shifts should preserve semantics for a rotationally invariant model. To address the concern, we will revise the probing methodology section to explicitly discuss these potential confounds, provide visualizations of shifted images highlighting distortion effects, and include a control analysis (e.g., comparing horizontal vs. vertical shifts) to isolate the contribution of shift invariance from projection artifacts. This will strengthen the attribution of observed similarity drops primarily to CLIP's limited robustness rather than input alterations. revision: partial

  2. Referee: [Experimental setup and results] The abstract reports 'rigorous statistical analyses' and a 'performance trade-off after fine-tuning,' yet the support for the central claims cannot be fully verified without the full methods, exact shift magnitudes, datasets, and statistical tests (e.g., p-values or effect sizes). Please expand the experimental setup section to include these details.

    Authors: We agree that the current experimental setup section lacks sufficient detail for full verification and reproducibility. In the revised manuscript, we will expand this section to specify: the exact horizontal shift magnitudes tested (multiples of 30° from 0° to 180°), the datasets used (including sources of panoramic image-text pairs and any preprocessing), and the statistical procedures (e.g., paired t-tests or Wilcoxon tests with reported p-values, effect sizes such as Cohen's d, and confidence intervals). The 'rigorous statistical analyses' refer to these tests applied to cosine similarity differences across conditions and models. We will also update the code repository to include the exact scripts and random seeds used for these analyses. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical probing with independent measurements

full rationale

The paper defines 360-degree textual semantics (via explicit format identifiers) and 360-degree visual semantics (via invariance under horizontal circular shifts), then applies keyword manipulation and shift-based tests to pre-trained CLIP models and reports direct statistical outcomes. No equations, predictions, or first-principles derivations reduce the findings to quantities fitted from the same data. The LoRA fine-tuning step is a separate intervention whose results are measured independently. No load-bearing self-citations or self-definitional reductions appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on two newly introduced concepts whose definitions are internal to the paper and on the domain assumption that horizontal shifts preserve semantics.

axioms (1)
  • domain assumption Horizontal circular shifts of 360-degree images preserve semantic content for the purpose of alignment evaluation
    Invoked when using shift magnitude as a probe for visual semantics invariance.
invented entities (2)
  • 360-degree textual semantics no independent evidence
    purpose: Semantic information conveyed by explicit format identifiers in text
    Newly defined concept used to structure the keyword-manipulation experiments.
  • 360-degree visual semantics no independent evidence
    purpose: Invariant semantics under horizontal circular shifts
    Newly defined concept used to structure the circular-shift experiments.

pith-pipeline@v0.9.0 · 5573 in / 1479 out tokens · 61034 ms · 2026-05-08T04:22:13.309849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 20 canonical work pages · 7 internal anchors

  1. [1]

    A survey of representation learning, optimization strategies, and applications for omnidirectional vision.arXiv preprint arXiv:2502.10444,

    Ai, H., Cao, Z., and Wang, L. A survey of representation learning, optimization strategies, and applications for omnidirectional vision.arXiv preprint arXiv:2502.10444,

  2. [2]

    Pali: A jointly-scaled multilingual language-image model

    Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al. Pali: A jointly-scaled multilingual language-image model.arXiv preprint arXiv:2209.06794, 2022a. Chen, Z., Wang, G., and Liu, Z. Text2light: Zero-shot text- driven hdr panorama generation.ACM Transactions on Graphics (...

  3. [3]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  4. [4]

    arXiv preprint arXiv:2309.17425 (2023) 3, 4, 9, 11, 20, 21, 22

    Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A., and Shankar, V . Data filtering networks.arXiv preprint arXiv:2309.17425,

  5. [5]

    Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141, 2023

    Feng, M., Liu, J., Cui, M., and Xie, X. Diffusion360: Seamless 360 degree panoramic image generation based on diffusion models.arXiv preprint arXiv:2311.13141,

  6. [6]

    Clipscore: A reference-free evaluation metric for image captioning

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y . Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

  7. [7]

    Decoupled Weight Decay Regularization

    Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

  8. [8]

    Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting

    9 Probing CLIP’s Comprehension of 360-Degree Textual and Visual Semantics Ma, Y ., Zhan, D., and Jin, Z. Fastscene: Text-driven fast 3d indoor scene generation via panoramic gaussian splatting. arXiv preprint arXiv:2405.05768,

  9. [9]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M ¨uller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952,

  10. [10]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3,

  11. [11]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and Komatsuzaki, A. Laion-400m: Open dataset of clip- filtered 400 million image-text pairs.arXiv preprint arXiv:2111.02114,

  12. [12]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Sun, Q., Fang, Y ., Wu, L., Wang, X., and Cao, Y . Eva- clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389,

  13. [13]

    Tang, Z., Lian, L., Eisape, S., Wang, X., Herzig, R., Yala, A., Suhr, A., Darrell, T., and Chan, D. M. Tulip: To- wards unified language-image pretraining.arXiv preprint arXiv:2503.15485,

  14. [14]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y ., Mustafa, B., et al. Siglip 2: Multilingual vision-language encoders with improved semantic under- standing, localization, and dense features.arXiv preprint arXiv:2502.14786,

  15. [15]

    Customizing 360-degree panoramas through text-to-image diffusion models

    Wang, H., Xiang, X., Fan, Y ., and Xue, J.-H. Customizing 360-degree panoramas through text-to-image diffusion models. InProceedings of the IEEE/CVF Winter Confer- ence on Applications of Computer Vision, pp. 4933–4943, 2024a. Wang, H., Xiang, X., Xia, W., and Xue, J.-H. A survey on text-driven 360-degree panorama generation.arXiv preprint arXiv:2502.14799,

  16. [16]

    360- degree panorama generation from few unregistered nfov images.arXiv preprint arXiv:2308.14686,

    Wang, J., Chen, Z., Ling, J., Xie, R., and Song, L. 360- degree panorama generation from few unregistered nfov images.arXiv preprint arXiv:2308.14686,

  17. [17]

    Clip in mirror: Disentangling text from visual images through reflection.Advances in Neural In- formation Processing Systems, 37:24523–24546, 2024b

    Wang, T., Yang, Y ., Yang, L., Lin, S., Zhang, J., Guo, G., and Zhang, B. Clip in mirror: Disentangling text from visual images through reflection.Advances in Neural In- formation Processing Systems, 37:24523–24546, 2024b. Weissig, C., Schreer, O., Eisert, P., and Kauff, P. The ultimate immersive experience: panoramic 3d video ac- quisition. InAdvances in...

  18. [18]

    Demysti- fying clip data

    10 Probing CLIP’s Comprehension of 360-Degree Textual and Visual Semantics Xu, H., Xie, S., Tan, X. E., Huang, P.-Y ., Howes, R., Sharma, V ., Li, S.-W., Ghosh, G., Zettlemoyer, L., and Feicht- enhofer, C. Demystifying clip data.arXiv preprint arXiv:2309.16671,

  19. [19]

    Layerpano3d: Layered 3d panorama for hyper-immersive scene generation.arXiv preprint arXiv:2408.13252,

    Yang, S., Tan, J., Zhang, M., Wu, T., Li, Y ., Wetzstein, G., Liu, Z., and Lin, D. Layerpano3d: Layered 3d panorama for hyper-immersive scene generation.arXiv preprint arXiv:2408.13252,

  20. [20]

    Diff- pano: Scalable and consistent text to panorama generation with spherical epipolar-aware diffusion.arXiv preprint arXiv:2410.24203,

    Ye, W., Ji, C., Chen, Z., Gao, J., Huang, X., Zhang, S.- H., Ouyang, W., He, T., Zhao, C., and Zhang, G. Diff- pano: Scalable and consistent text to panorama generation with spherical epipolar-aware diffusion.arXiv preprint arXiv:2410.24203,

  21. [21]

    <360panorama>,

    11 Probing CLIP’s Comprehension of 360-Degree Textual and Visual Semantics A. More Implementation Details A.1. Flowchart of Dataset Generation Fig. 9 shows the flowchart to produce the two paired image-text datasets (360 realand360 syn) used in our evaluation experiments. real-world 360-degree panoramic images (2386 images) augmented text prompts synthesi...

  22. [22]

    ”U ∗ =“image,

    results for different CLIP models on the two paired image-text datasets, where the null hypothesis is the distribution of the score differences ( s−s u) is normal, and the significance level (α) is 0.01. The p-values less thanαare in bold. U ∗ =“”U ∗ =“image,” ViT 360 real 360 syn 360 real 360 syn statistic p-value statistic p-value statistic p-value stat...

  23. [23]

    The results, detailed in Table 10, indicate that for all datasets, the p-values were below a commonly used significance level (α = 0.01)

    for evaluating the null hypothesis that|s−s δj | is statistically greater than or equal to the stability bound β, we assessed the normality of the absolute score differences (|s−s δj |) between the original and shifted CLIP scores using the Shapiro-Wilk test (Shapiro & Wilk, 1965). The results, detailed in Table 10, indicate that for all datasets, the p-v...

  24. [24]

    The p-values less thanαare in bold

    results under horizontal circular shift of various δj pixels for different CLIP models on the360 syndataset, where the null hypothesis ( H0) is that |s−s δj | is greater than or equal to the stability boundβ, and the significance level (α) is 0.01. The p-values less thanαare in bold. ViTβ δ j W/8 2W/8 3W/8 4W/8 5W/8 6W/8 7W/8 B/32 1.7699 statistic844832 1...

  25. [25]

    The p-values less thanαare in bold

    results under horizontal circular shift of various δj pixels for different CLIP models on the360 syndataset, where the null hypothesis ( H0) is that |s−s δj | is greater than or equal to the stability boundβ, and the significance level (α) is 0.01. The p-values less thanαare in bold. ViTβ δ j W/8 2W/8 3W/8 4W/8 5W/8 6W/8 7W/8 B/32 1.0822 statistic792795 1...

  26. [26]

    Generalization Capability of Fine-Tuned Models G.1

    λViTβ δ j W/8 2W/8 3W/8 4W/8 5W/8 6W/8 7W/8 0.9831 B/32 1.0822 statistic0 0 0 0 0 0 0 p-value0 0 0 0 0 0 0 0.9839 B/16 1.0704 statistic0 0 0 0 0 0 0 p-value0 0 0 0 0 0 0 0.9882 L/14 1.1995 statistic0 0 0 0 0 0 0 p-value0 0 0 0 0 0 0 21 Probing CLIP’s Comprehension of 360-Degree Textual and Visual Semantics G. Generalization Capability of Fine-Tuned Models...

  27. [27]

    a long table with lots of plants in a greenhouse

    and their corresponding perspective images (1024×512 resolution) synthesized with SDXL (Podell et al., 2023), which we refer to asper syn. Fig. 16 shows examples of the image-text pairs together with horizontally flipped and circular-shifted versions. Unlike 360-degree panoramic images, circular shifts in perspective images introduce clear semantic distor...

  28. [28]

    <360panorama>, a large room with a large window

    For frozen CLIP models, the scores vary noticeably across different circular shifts, indicating that they fail to preserve stable semantic alignment under this transformation, consistent with our statistical results. In contrast, our fine-tuned models remain stable scores across all shift magnitudes, demonstrating a stable and robust understanding of 360-...