pith. machine review for the scientific record. sign in

arxiv: 2604.00912 · v1 · submitted 2026-04-01 · 💻 cs.CV · cs.MM

Recognition: 1 theorem link

· Lean Theorem

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:45 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords spatial augmented realityprojection-aware captioningvision-language modelssemantic segmentationRGBP datasetdecoupled annotationsdual-captioning protocol
0
0 comments X

The pith

ProCap decouples projected digital content from physical scenes in spatial augmented reality to enable accurate captioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to let vision-language models reason correctly about spatial augmented reality by treating projected content and physical scenes as separate layers rather than a single mixed image. Standard models fail because projections overlap and distort real surfaces, creating semantic ambiguity. ProCap fixes this with a two-stage process: automated segmentation first isolates the virtual and physical parts of the image, then region-aware retrieval pulls descriptions that respect each layer's context and avoid distortion errors. A new RGBP dataset supplies over 180,000 examples across 65 scenes with dense decoupled annotations, and a dual-captioning protocol evaluates each layer independently. If the approach works, SAR systems gain a reliable semantic base for queries and interactions that treat real and projected elements distinctly.

Core claim

ProCap shows that explicitly separating projected content from physical scenes through visual isolation via segmentation followed by region-aware retrieval produces accurate independent captions for each layer, overcoming the virtual-physical confusion that defeats standard vision-language models in SAR environments.

What carries the argument

The ProCap two-stage pipeline of automated visual segmentation to isolate layers followed by region-aware retrieval to handle projection-induced semantic distortion.

Load-bearing premise

Standard vision-language models cannot resolve virtual-physical ambiguity in SAR scenes on their own, and the segmentation-plus-retrieval steps cleanly separate the layers without adding new errors from imperfect isolation or mismatched retrieval.

What would settle it

Running ProCap on fresh SAR scenes outside the RGBP dataset and finding that generated captions still mix physical and projected elements when segmentation boundaries are inaccurate would show the decoupling does not reliably resolve the ambiguity.

Figures

Figures reproduced from arXiv: 2604.00912 by Bingyao Huang, Haibin Ling, Yuchen Deng, Zimo Cao.

Figure 1
Figure 1. Figure 1: ProCap framework for decoupled SAR scene captioning. Standard VLMs ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Configuration of the RGBP dataset capture environment. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: RGBP scenes used to train and evaluate models. We show [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed ProCap architecture. Given an observed image I containing both physical scene and projected content, a frozen vision transformer (ViT-g) backbone first extracts coarse features Zc, which are refined into U(Zc) by a feature refinement module U(·). A projection segmentation module S is employed to estimate a coarse projection mask Im, enabling mask pooling to retain projection featur… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of descriptive ability in complex SAR scenes. We evaluate ProCap [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 1
Figure 1. Figure 1: This figure shows the specific details of 60 seen scenes with projected content A (traffic lights) and projected content B (dolphins). [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Examples of projection mapping based on TOSHIBA TDP-T100C DLP projector (1024 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
read the original abstract

Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProCap, a two-stage framework for projection-aware captioning in spatial augmented reality that decouples projected digital content from physical scenes: first via automated segmentation for visual layer isolation, then region-aware retrieval to mitigate projection distortion and VLM ambiguity. It contributes the RGBP dataset (65 scenes, >180k projections with dense decoupled annotations) and a dual-captioning evaluation protocol using task-specific tokens. Experiments are claimed to demonstrate robustness for semantic distinction in SAR scenes.

Significance. If the decoupling holds without error propagation, the work supplies the first large-scale SAR semantic benchmark and a practical pipeline that could enable reliable VLM-based reasoning and interaction in projector-based environments, directly supporting future SAR research.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the claim that experiments demonstrate robustness is unsupported by any reported quantitative metrics, baselines, error analysis, or explicit handling of projection distortion, leaving the central robustness assertion without empirical grounding.
  2. [Methods] Methods (two-stage pipeline, visual isolation stage): segmentation accuracy is not validated on real SAR scenes (no IoU, ablation on geometric distortion, lighting bleed, or texture overlap); mis-isolated pixels would corrupt the input to region-aware retrieval and undermine the decoupling claim.
minor comments (1)
  1. [Dataset] Dataset description: provide explicit statistics on annotation density, scene diversity metrics, and projection variation to allow readers to assess benchmark quality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional quantitative evidence is needed to support the robustness claims and will revise the manuscript accordingly. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the claim that experiments demonstrate robustness is unsupported by any reported quantitative metrics, baselines, error analysis, or explicit handling of projection distortion, leaving the central robustness assertion without empirical grounding.

    Authors: We acknowledge that the current Experiments section relies primarily on qualitative examples to illustrate ProCap's decoupling performance. In the revision we will add quantitative metrics including captioning scores (e.g., BLEU, CIDEr) comparing ProCap against standard VLM baselines, an explicit error analysis of projection distortion effects, and ablation results that quantify robustness gains. These additions will provide the requested empirical grounding. revision: yes

  2. Referee: [Methods] Methods (two-stage pipeline, visual isolation stage): segmentation accuracy is not validated on real SAR scenes (no IoU, ablation on geometric distortion, lighting bleed, or texture overlap); mis-isolated pixels would corrupt the input to region-aware retrieval and undermine the decoupling claim.

    Authors: We agree that explicit validation of the visual isolation stage on real SAR data is necessary. The revised Methods section will include IoU measurements on a held-out subset of RGBP real scenes, together with ablations that isolate the effects of geometric distortion, lighting bleed, and texture overlap. These results will quantify segmentation reliability and show that residual errors do not materially degrade the subsequent region-aware retrieval. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces ProCap as a new two-stage pipeline (visual isolation via segmentation followed by region-aware retrieval) and the RGBP dataset as independent contributions. No equations, fitted parameters renamed as predictions, self-citations that bear the central load, or self-definitional reductions appear in the abstract or described content. The decoupling claim is presented as a methodological advance rather than reducing by construction to its inputs, and the dual-captioning protocol is defined separately. This is a standard case of a self-contained empirical contribution with no circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that VLMs inherently confuse virtual and physical layers and that automated segmentation plus retrieval can isolate them reliably; no free parameters or invented entities beyond the framework and dataset are specified in the abstract.

axioms (1)
  • domain assumption Standard VLMs struggle with virtual-physical ambiguity due to projection distortion
    Explicitly stated as motivation in the abstract.
invented entities (2)
  • ProCap framework no independent evidence
    purpose: Decouple projected content from physical scenes for accurate captioning
    Newly proposed two-stage pipeline
  • RGBP dataset no independent evidence
    purpose: Large-scale SAR semantic benchmark with decoupled annotations
    Claimed as the first such dataset

pith-pipeline@v0.9.0 · 5525 in / 1123 out tokens · 30330 ms · 2026-05-13T22:45:28.330225+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 1 internal anchor

  1. [1]

    Agrawal, K

    H. Agrawal, K. Desai, Y . Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: novel object captioning at scale. InICCV, pp. 8947–8956, 2019. 3, 1 9 Accepted by 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

  2. [2]

    Llama 3 model card, 2024

    AI@Meta. Llama 3 model card, 2024. 1, 2

  3. [3]

    Alayrac, J

    J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...

  4. [4]

    Anderson, B

    P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: Semantic propositional image caption evaluation. InECCV, 2016. 6

  5. [5]

    Asahina, T

    R. Asahina, T. Nomoto, T. Yoshida, and Y . Watanabe. Realistic 3d swept-volume display with hidden-surface removal using physical materials. InIEEE VR, pp. 113–121, 2021. 1

  6. [6]

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

  7. [7]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-VL technical report, 2025. 1, 2, 3, 4

  8. [8]

    Banerjee and A

    S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InPro- ceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, 2005. 6

  9. [9]

    A. H. Bermano, M. Billeter, D. Iwai, and A. Grundh ¨ofer. Makeup Lamps: Live augmentation of human faces via projection.Computer Graphics F orum, 36(2):311–323, 2017. 1, 2

  10. [10]

    Bimber and R

    O. Bimber and R. Raskar.Spatial Augmented Reality: Merging Real and Virtual Worlds. A. K. Peters, Ltd., 2005. 1

  11. [11]

    Bitton-Guetta, Y

    N. Bitton-Guetta, Y . Bitton, J. Hessel, L. Schmidt, Y . Elovici, G. Stanovsky, and R. Schwartz. Breaking Common Sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional im- ages. InICCV, pp. 2616–2627, 2023. 3, 1

  12. [12]

    W. Dai, J. Li, D. LI, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.,NeurIPS, vol. 36, pp. 49250– 49267, 2023. 2

  13. [13]

    Q. Deng, J. Li, H. Ling, and B. Huang. GS-ProCams: Gaussian splatting-based projector-camera systems.IEEE TVCG, 2025. 2

  14. [14]

    Y . Deng, H. Ling, and B. Huang. LAPIG: Language guided projector image generation with surface adaptation and stylization.IEEE TVCG,

  15. [15]

    X. Dong, H. Ling, and B. Huang. Adaptive color structured light for calibration and shape reconstruction. InIEEE International Symposium on Mixed and Augmented Reality, pp. 1240–1249, 2023. 1, 2

  16. [16]

    L. Duan, Y . Xiu, and M. Gorlatova. Advancing the understanding and evaluation of AR-generated scenes: When vision-language models shine and stumble. InIEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 156–161, 2025. 2

  17. [17]

    Y . Erel, D. Iwai, and A. H. Bermano. Neural projection mapping using reflectance fields.IEEE TVCG, 29(11):4339–4349, 2023. 1, 2, 3

  18. [18]

    Y . Erel, O. Kozlovsky-Mordenfeld, D. Iwai, K. Sato, and A. H. Bermano. Casper DPM: Cascaded perceptual dynamic projection mapping onto hands. InSIGGRAPH Asia. ACM, 2024. 1, 2

  19. [19]

    Geng and H

    X. Geng and H. Liu. OpenLLaMA: An open reproduction of LLaMA,

  20. [20]

    Grundh¨ofer and D

    A. Grundh¨ofer and D. Iwai. Robust, error-tolerant photometric projec- tor compensation.IEEE TIP, 24(12):5086–5099, 2015. 1, 2

  21. [21]

    Gupta, P

    A. Gupta, P. Dollar, and R. Girshick. LVIS: A dataset for large vocabu- lary instance segmentation. InCVPR, 2019. 5, 9

  22. [22]

    Gururangan, A

    S. Gururangan, A. Marasovi ´c, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 3

  23. [23]

    Houlsby, A

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient trans- fer learning for NLP. InICML, vol. 97 ofProceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. 3

  24. [24]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 3

  25. [25]

    Huang and H

    B. Huang and H. Ling. End-to-end projector photometric compensation. InCVPR, pp. 6803–6812, 2019. 2, 3

  26. [26]

    Huang and H

    B. Huang and H. Ling. DeProCams: Simultaneous relighting, compen- sation and shape reconstruction for projector-camera systems.IEEE TVCG, 27(5):2725–2735, 2021. 2, 3

  27. [27]

    Huang, T

    B. Huang, T. Sun, and H. Ling. End-to-end full projector compensation. IEEE TPAMI, 2022. 2, 3

  28. [28]

    Huang, Y

    B. Huang, Y . Tang, S. Ozdemir, and H. Ling. A fast and flexible projector-camera calibration system.IEEE Transactions on Automation Science and Engineering, 2021. 1, 2

  29. [29]

    Huang, T.-C

    T.-H. Huang, T.-C. Wang, and H. H. Chen. Radiometric compensation of images projected on non-white surfaces by exploiting chromatic adaptation and perceptual anchoring.IEEE TIP, 26(1):147–159, 2017. 2

  30. [30]

    D. Iwai. Projection mapping technologies: A review of current trends and future directions.Proceedings of the Japan Academy. Series B, Physical and Biological Sciences, 100(3):234–251, 2024. 1

  31. [31]

    D. Iwai. Projection mapping technologies for seamless spatial augmen- tation. InProceedings of Laser Display and Lighting Conference, pp. 181–185. Springer Nature Singapore, 2025. 2

  32. [32]

    Iwai and K

    D. Iwai and K. Sato. Limpid desk: see-through access to disorderly desktop in projection-based mixed reality. InProceedings of the ACM Symposium on Virtual Reality Software and Technology, p. 112–115. ACM, 2006. 1, 2

  33. [33]

    Izacard and E

    G. Izacard and E. Grave. Leveraging passage retrieval with genera- tive models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main V olume, pp. 874–880. Association for Computational Linguistics, 2021. 3

  34. [34]

    C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, vol. 139 ofProceedings of Machine Learning Research, pp. 4904–4916. PMLR,

  35. [35]

    Kagami and K

    S. Kagami and K. Hashimoto. Animated Stickies: Fast video projec- tion mapping onto a markerless plane through a direct closed-loop alignment.IEEE TVCG, 25(11):3094–3104, 2019. 1, 2

  36. [36]

    Kageyama, D

    Y . Kageyama, D. Iwai, and K. Sato. Online projector deblurring using a convolutional neural network.IEEE TVCG, 28(5):2223–2233, 2022. 1, 2

  37. [37]

    Kaminokado, D

    T. Kaminokado, D. Iwai, and K. Sato. Augmented environment map- ping for appearance editing of glossy surfaces. InIEEE International Symposium on Mixed and Augmented Reality, pp. 55–65, 2019. 1, 2

  38. [38]

    Kusuyama, Y

    H. Kusuyama, Y . Kageyama, D. Iwai, and K. Sato. A multi-aperture coaxial projector balancing shadow suppression and deblurring.IEEE TVCG, pp. 1–11, 2024. 2

  39. [39]

    Y . Lee. Qwen2-VL-Finetune, 2024. 6

  40. [40]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, vol. 33, pp. 9459–9474, 2020. 3

  41. [41]

    C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang, and J. Gao. Multimodal foundation models: From specialists to general-purpose assistants. F oundations and Trends in Computer Graphics and Vision, 16(1-2):1– 214, 2024. 8

  42. [42]

    J. Li, Q. Deng, H. Ling, and B. Huang. DPCS: Path tracing-based differentiable projector-camera systems.IEEE TVCG, 2025. 2

  43. [43]

    J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: bootstrapping language- 10 Accepted by 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) image pre-training with frozen image encoders and large language models. InICML, 2023. 1, 2, 3, 5

  44. [44]

    J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, vol. 162, pp. 12888–12900, 2022. 1, 2

  45. [45]

    J. Li, D. M. V o, A. Sugimoto, and H. Nakayama. EVCap: Retrieval- augmented image captioning with external visual-name memory for open-world comprehension. InCVPR, pp. 13733–13742, 2024. 3, 4, 5, 9

  46. [46]

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InECCV, pp. 740–755, 2014. 3, 7, 9, 1

  47. [47]

    K. Luo, G. Yang, W. Xian, H. Haraldsson, B. Hariharan, and S. Be- longie. Stay Positive: Non-negative image synthesis for augmented reality. InCVPR, pp. 10050–10060, 2021. 2

  48. [48]

    Nomoto, R

    T. Nomoto, R. Koishihara, and Y . Watanabe. Realistic dynamic projec- tion mapping using real-time ray tracing. InSIGGRAPH. ACM, 2020. 2

  49. [49]

    Nomoto, W

    T. Nomoto, W. Li, H.-L. Peng, and Y . Watanabe. Dynamic projec- tion mapping with networked multi-projectors based on pixel-parallel intensity control. InSIGGRAPH Asia. ACM, 2020. 2

  50. [50]

    Nomoto, W

    T. Nomoto, W. Li, H.-L. Peng, and Y . Watanabe. Dynamic multi- projection mapping based on parallel intensity control.IEEE TVCG, 28(5):2125–2134, 2022. 2

  51. [51]

    Papineni, S

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, 2002. 6

  52. [52]

    H.-L. Peng, K. Sato, S. Nakagawa, and Y . Watanabe. Perceptually- aligned dynamic facial projection mapping by high-speed face-tracking method and lens-shift co-axial setup.IEEE TVCG, 31(10):6824–6838,

  53. [53]

    Peng and Y

    H.-L. Peng and Y . Watanabe. High-speed human arm projection map- ping with skin deformation. InSIGGRAPH Asia. ACM, 2020. 2

  54. [54]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, vol. 139, pp. 8748–8763, 2021. 1, 2

  55. [55]

    Ramesh, M

    A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InICMR, vol. 139 ofProceedings of Machine Learning Research, pp. 8821–8831. PMLR, 2021. 9

  56. [56]

    Raskar, J

    R. Raskar, J. van Baar, P. Beardsley, T. Willwacher, S. Rao, and C. For- lines. iLamps: geometrically aware and self-configuring projectors. ACM TOG, 22(3):809–818, 2003. 2

  57. [57]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pp. 10684–10695, 2022. 9

  58. [58]

    Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. EV A-CLIP: Improved training techniques for clip at scale, 2023. 4

  59. [59]

    Takeuchi, H

    M. Takeuchi, H. Kusuyama, D. Iwai, and K. Sato. Projection mapping under environmental lighting by replacing room lights with heteroge- neous projectors.IEEE TVCG, 30(5), 2024. 2

  60. [60]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 6

  61. [61]

    Urakawa and Y

    Y . Urakawa and Y . Watanabe. Neural inverse rendering for high- accuracy 3d measurement of moving objects with fewer phase-shifting patterns. InICCV, pp. 27692–27701, 2025. 2

  62. [62]

    P. K. A. Vasu, F. Faghri, C.-L. Li, C. Koc, N. True, A. Antony, G. San- thanam, J. Gabriel, P. Grasch, O. Tuzel, and H. Pouransari. FastVLM: Efficient vision encoding for vision language models. InCVPR, 2025. 1, 2, 6, 7, 8, 3, 4

  63. [63]

    Vedantam, C

    R. Vedantam, C. Lawrence Zitnick, and D. Parikh. CIDEr: Consensus- based image description evaluation. InCVPR, 2015. 6

  64. [64]

    T.-J. J. Wang, J. Laaksonen, T. Langer, H. Arponen, and T. E. Bishop. Learning by Hallucinating: Vision-language pre-training with weak supervision. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1073–1083, 2023. 2

  65. [65]

    Y . Wang, H. Ling, and B. Huang. CompenHR: Efficient full compen- sation for high-resolution projector. InIEEE VR, pp. 135–145, 2023. 2

  66. [66]

    Y . Wang, H. Ling, and B. Huang. ViComp: Video compensation for projector-camera systems.IEEE TVCG, 30(5):2347–2356, 2024. 2

  67. [67]

    Y . Xiu, T. Scargill, and M. Gorlatova. ViDDAR: Vision language model-based task-detrimental content detection for augmented reality. IEEE TVCG, 31(5):3194–3203, 2025. 2

  68. [68]

    Yasui, R

    M. Yasui, R. Iwataki, M. Ishikawa, and Y . Watanabe. Projection map- ping with a brightly lit surrounding using a mixed light field approach. IEEE TVCG, 30(5):2217–2227, 2024. 2

  69. [69]

    Yasunaga, A

    M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W.-T. Yih. Retrieval-augmented mul- timodal language modeling. InICML, vol. 202 ofProceedings of Machine Learning Research, pp. 39755–39769. PMLR, 2023. 3

  70. [70]

    Zhang, G

    P. Zhang, G. Zeng, T. Wang, and W. Lu. TinyLlama: An open-source small language model, 2024. 6, 1

  71. [71]

    Zhang, S

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: Open pre-trained transformer language models, 2022. 6, 1

  72. [72]

    Zheng, W.-L

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. 6, 1

  73. [73]

    J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

  74. [74]

    We highlight incorrect captioned objects in red and correct ones in blue

    using ProCap Vicuna-1.5-7B. We highlight incorrect captioned objects in red and correct ones in blue. 5