arxiv: 2604.00912 · v1 · submitted 2026-04-01 · 💻 cs.CV · cs.MM

Recognition: 1 theorem link

· Lean Theorem

ProCap: Projection-Aware Captioning for Spatial Augmented Reality

Zimo Cao , Yuchen Deng , Haibin Ling , Bingyao Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:45 UTC · model grok-4.3

classification 💻 cs.CV cs.MM

keywords spatial augmented realityprojection-aware captioningvision-language modelssemantic segmentationRGBP datasetdecoupled annotationsdual-captioning protocol

0 comments

The pith

ProCap decouples projected digital content from physical scenes in spatial augmented reality to enable accurate captioning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to let vision-language models reason correctly about spatial augmented reality by treating projected content and physical scenes as separate layers rather than a single mixed image. Standard models fail because projections overlap and distort real surfaces, creating semantic ambiguity. ProCap fixes this with a two-stage process: automated segmentation first isolates the virtual and physical parts of the image, then region-aware retrieval pulls descriptions that respect each layer's context and avoid distortion errors. A new RGBP dataset supplies over 180,000 examples across 65 scenes with dense decoupled annotations, and a dual-captioning protocol evaluates each layer independently. If the approach works, SAR systems gain a reliable semantic base for queries and interactions that treat real and projected elements distinctly.

Core claim

ProCap shows that explicitly separating projected content from physical scenes through visual isolation via segmentation followed by region-aware retrieval produces accurate independent captions for each layer, overcoming the virtual-physical confusion that defeats standard vision-language models in SAR environments.

What carries the argument

The ProCap two-stage pipeline of automated visual segmentation to isolate layers followed by region-aware retrieval to handle projection-induced semantic distortion.

Load-bearing premise

Standard vision-language models cannot resolve virtual-physical ambiguity in SAR scenes on their own, and the segmentation-plus-retrieval steps cleanly separate the layers without adding new errors from imperfect isolation or mismatched retrieval.

What would settle it

Running ProCap on fresh SAR scenes outside the RGBP dataset and finding that generated captions still mix physical and projected elements when segmentation boundaries are inaccurate would show the decoupling does not reliably resolve the ambiguity.

Figures

Figures reproduced from arXiv: 2604.00912 by Bingyao Huang, Haibin Ling, Yuchen Deng, Zimo Cao.

**Figure 2.** Figure 2: Configuration of the RGBP dataset capture environment. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: RGBP scenes used to train and evaluate models. We show [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed ProCap architecture. Given an observed image I containing both physical scene and projected content, a frozen vision transformer (ViT-g) backbone first extracts coarse features Zc, which are refined into U(Zc) by a feature refinement module U(·). A projection segmentation module S is employed to estimate a coarse projection mask Im, enabling mask pooling to retain projection featur… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of descriptive ability in complex SAR scenes. We evaluate ProCap [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 1.** Figure 1: This figure shows the specific details of 60 seen scenes with projected content A (traffic lights) and projected content B (dolphins). [PITH_FULL_IMAGE:figures/full_fig_p013_1.png] view at source ↗

**Figure 2.** Figure 2: Examples of projection mapping based on TOSHIBA TDP-T100C DLP projector (1024 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

read the original abstract

Spatial augmented reality (SAR) directly projects digital content onto physical scenes using projectors, creating immersive experience without head-mounted displays. However, for SAR to support intelligent interaction, such as reasoning about the scene or answering user queries, it must semantically distinguish between the physical scene and the projected content. Standard Vision Language Models (VLMs) struggle with this virtual-physical ambiguity, often confusing the two contexts. To address this issue, we introduce ProCap, a novel framework that explicitly decouples projected content from physical scenes. ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion. To support this, we present RGBP (RGB + Projections), the first large-scale SAR semantic benchmark dataset, featuring 65 diverse physical scenes and over 180,000 projections with dense, decoupled annotations. Finally, we establish a dual-captioning evaluation protocol using task-specific tokens to assess physical scene and projection descriptions independently. Our experiments show that ProCap provides a robust semantic foundation for future SAR research. The source code, pre-trained models and the RGBP dataset are available on the project page: https://ZimoCao.github.io/ProCap/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ProCap adds a new large-scale SAR dataset and a two-stage decoupling pipeline, but the experiments do not adequately test whether segmentation errors propagate under real projection conditions.

read the letter

The paper's clearest value is the RGBP dataset: 65 physical scenes, over 180,000 projections, and dense annotations that explicitly separate projected content from the underlying scene. That is a concrete resource the subfield did not have before. The dual-captioning protocol with task-specific tokens is also a sensible way to score the two layers independently rather than forcing a single mixed output. Releasing code, models, and data lowers the barrier for others to use or extend the benchmark. The two-stage idea—segment first, then do region-aware retrieval—directly targets the virtual-physical confusion that off-the-shelf VLMs show in projector setups. That framing is practical for SAR applications in education or design. The central limitation is the missing validation on the isolation step. The stress-test concern holds: there are no reported IoU numbers for segmentation on distorted, lit, or textured projections, and no ablations showing how segmentation mistakes affect the final captions. Without those checks, it is unclear whether the pipeline actually reduces ambiguity or simply passes corrupted regions downstream. The abstract claims robustness, but the lack of quantitative baselines, error analysis, or distortion-specific tests leaves that claim under-supported. This is for researchers already working on projector-based AR or VLM adaptation to physical scenes. Someone building a new benchmark or needing decoupled annotations would get immediate use from the dataset. I would send it to peer review. The dataset and problem setup are substantial enough to justify referee time, even if the method requires additional experiments to strengthen the robustness argument.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProCap, a two-stage framework for projection-aware captioning in spatial augmented reality that decouples projected digital content from physical scenes: first via automated segmentation for visual layer isolation, then region-aware retrieval to mitigate projection distortion and VLM ambiguity. It contributes the RGBP dataset (65 scenes, >180k projections with dense decoupled annotations) and a dual-captioning evaluation protocol using task-specific tokens. Experiments are claimed to demonstrate robustness for semantic distinction in SAR scenes.

Significance. If the decoupling holds without error propagation, the work supplies the first large-scale SAR semantic benchmark and a practical pipeline that could enable reliable VLM-based reasoning and interaction in projector-based environments, directly supporting future SAR research.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the claim that experiments demonstrate robustness is unsupported by any reported quantitative metrics, baselines, error analysis, or explicit handling of projection distortion, leaving the central robustness assertion without empirical grounding.
[Methods] Methods (two-stage pipeline, visual isolation stage): segmentation accuracy is not validated on real SAR scenes (no IoU, ablation on geometric distortion, lighting bleed, or texture overlap); mis-isolated pixels would corrupt the input to region-aware retrieval and undermine the decoupling claim.

minor comments (1)

[Dataset] Dataset description: provide explicit statistics on annotation density, scene diversity metrics, and projection variation to allow readers to assess benchmark quality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that additional quantitative evidence is needed to support the robustness claims and will revise the manuscript accordingly. We address each major comment below.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the claim that experiments demonstrate robustness is unsupported by any reported quantitative metrics, baselines, error analysis, or explicit handling of projection distortion, leaving the central robustness assertion without empirical grounding.

Authors: We acknowledge that the current Experiments section relies primarily on qualitative examples to illustrate ProCap's decoupling performance. In the revision we will add quantitative metrics including captioning scores (e.g., BLEU, CIDEr) comparing ProCap against standard VLM baselines, an explicit error analysis of projection distortion effects, and ablation results that quantify robustness gains. These additions will provide the requested empirical grounding. revision: yes
Referee: [Methods] Methods (two-stage pipeline, visual isolation stage): segmentation accuracy is not validated on real SAR scenes (no IoU, ablation on geometric distortion, lighting bleed, or texture overlap); mis-isolated pixels would corrupt the input to region-aware retrieval and undermine the decoupling claim.

Authors: We agree that explicit validation of the visual isolation stage on real SAR data is necessary. The revised Methods section will include IoU measurements on a held-out subset of RGBP real scenes, together with ablations that isolate the effects of geometric distortion, lighting bleed, and texture overlap. These results will quantify segmentation reliability and show that residual errors do not materially degrade the subsequent region-aware retrieval. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces ProCap as a new two-stage pipeline (visual isolation via segmentation followed by region-aware retrieval) and the RGBP dataset as independent contributions. No equations, fitted parameters renamed as predictions, self-citations that bear the central load, or self-definitional reductions appear in the abstract or described content. The decoupling claim is presented as a methodological advance rather than reducing by construction to its inputs, and the dual-captioning protocol is defined separately. This is a standard case of a self-contained empirical contribution with no circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that VLMs inherently confuse virtual and physical layers and that automated segmentation plus retrieval can isolate them reliably; no free parameters or invented entities beyond the framework and dataset are specified in the abstract.

axioms (1)

domain assumption Standard VLMs struggle with virtual-physical ambiguity due to projection distortion
Explicitly stated as motivation in the abstract.

invented entities (2)

ProCap framework no independent evidence
purpose: Decouple projected content from physical scenes for accurate captioning
Newly proposed two-stage pipeline
RGBP dataset no independent evidence
purpose: Large-scale SAR semantic benchmark with decoupled annotations
Claimed as the first such dataset

pith-pipeline@v0.9.0 · 5525 in / 1123 out tokens · 30330 ms · 2026-05-13T22:45:28.330225+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ProCap employs a two-stage pipeline: first it visually isolates virtual and physical layers via automated segmentation; then it uses region-aware retrieval to avoid ambiguous semantic context due to projection distortion.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 1 internal anchor

[1]

Agrawal, K

H. Agrawal, K. Desai, Y . Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson. nocaps: novel object captioning at scale. InICCV, pp. 8947–8956, 2019. 3, 1 9 Accepted by 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR)

work page 2019
[2]

Llama 3 model card, 2024

AI@Meta. Llama 3 model card, 2024. 1, 2

work page 2024
[3]

Alayrac, J

J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Ruther- ford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. a. Bi´nkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan. Flamingo: a visual lang...

work page 2022
[4]

Anderson, B

P. Anderson, B. Fernando, M. Johnson, and S. Gould. SPICE: Semantic propositional image caption evaluation. InECCV, 2016. 6

work page 2016
[5]

Asahina, T

R. Asahina, T. Nomoto, T. Yoshida, and Y . Watanabe. Realistic 3d swept-volume display with hidden-surface removal using physical materials. InIEEE VR, pp. 113–121, 2021. 1

work page 2021
[6]

S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y . Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y . Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang,...

work page 2025
[7]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin. Qwen2.5-VL technical report, 2025. 1, 2, 3, 4

work page 2025
[8]

Banerjee and A

S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. InPro- ceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics, 2005. 6

work page 2005
[9]

A. H. Bermano, M. Billeter, D. Iwai, and A. Grundh ¨ofer. Makeup Lamps: Live augmentation of human faces via projection.Computer Graphics F orum, 36(2):311–323, 2017. 1, 2

work page 2017
[10]

Bimber and R

O. Bimber and R. Raskar.Spatial Augmented Reality: Merging Real and Virtual Worlds. A. K. Peters, Ltd., 2005. 1

work page 2005
[11]

Bitton-Guetta, Y

N. Bitton-Guetta, Y . Bitton, J. Hessel, L. Schmidt, Y . Elovici, G. Stanovsky, and R. Schwartz. Breaking Common Sense: WHOOPS! a vision-and-language benchmark of synthetic and compositional im- ages. InICCV, pp. 2616–2627, 2023. 3, 1

work page 2023
[12]

W. Dai, J. Li, D. LI, A. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.,NeurIPS, vol. 36, pp. 49250– 49267, 2023. 2

work page 2023
[13]

Q. Deng, J. Li, H. Ling, and B. Huang. GS-ProCams: Gaussian splatting-based projector-camera systems.IEEE TVCG, 2025. 2

work page 2025
[14]

Y . Deng, H. Ling, and B. Huang. LAPIG: Language guided projector image generation with surface adaptation and stylization.IEEE TVCG,

work page
[15]

X. Dong, H. Ling, and B. Huang. Adaptive color structured light for calibration and shape reconstruction. InIEEE International Symposium on Mixed and Augmented Reality, pp. 1240–1249, 2023. 1, 2

work page 2023
[16]

L. Duan, Y . Xiu, and M. Gorlatova. Advancing the understanding and evaluation of AR-generated scenes: When vision-language models shine and stumble. InIEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pp. 156–161, 2025. 2

work page 2025
[17]

Y . Erel, D. Iwai, and A. H. Bermano. Neural projection mapping using reflectance fields.IEEE TVCG, 29(11):4339–4349, 2023. 1, 2, 3

work page 2023
[18]

Y . Erel, O. Kozlovsky-Mordenfeld, D. Iwai, K. Sato, and A. H. Bermano. Casper DPM: Cascaded perceptual dynamic projection mapping onto hands. InSIGGRAPH Asia. ACM, 2024. 1, 2

work page 2024
[19]

Geng and H

X. Geng and H. Liu. OpenLLaMA: An open reproduction of LLaMA,

work page
[20]

Grundh¨ofer and D

A. Grundh¨ofer and D. Iwai. Robust, error-tolerant photometric projec- tor compensation.IEEE TIP, 24(12):5086–5099, 2015. 1, 2

work page 2015
[21]

Gupta, P

A. Gupta, P. Dollar, and R. Girshick. LVIS: A dataset for large vocabu- lary instance segmentation. InCVPR, 2019. 5, 9

work page 2019
[22]

Gururangan, A

S. Gururangan, A. Marasovi ´c, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith. Don’t stop pretraining: Adapt language models to domains and tasks. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020. 3

work page 2020
[23]

Houlsby, A

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly. Parameter-efficient trans- fer learning for NLP. InICML, vol. 97 ofProceedings of Machine Learning Research, pp. 2790–2799. PMLR, 2019. 3

work page 2019
[24]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022. 3

work page 2022
[25]

Huang and H

B. Huang and H. Ling. End-to-end projector photometric compensation. InCVPR, pp. 6803–6812, 2019. 2, 3

work page 2019
[26]

Huang and H

B. Huang and H. Ling. DeProCams: Simultaneous relighting, compen- sation and shape reconstruction for projector-camera systems.IEEE TVCG, 27(5):2725–2735, 2021. 2, 3

work page 2021
[27]

Huang, T

B. Huang, T. Sun, and H. Ling. End-to-end full projector compensation. IEEE TPAMI, 2022. 2, 3

work page 2022
[28]

Huang, Y

B. Huang, Y . Tang, S. Ozdemir, and H. Ling. A fast and flexible projector-camera calibration system.IEEE Transactions on Automation Science and Engineering, 2021. 1, 2

work page 2021
[29]

Huang, T.-C

T.-H. Huang, T.-C. Wang, and H. H. Chen. Radiometric compensation of images projected on non-white surfaces by exploiting chromatic adaptation and perceptual anchoring.IEEE TIP, 26(1):147–159, 2017. 2

work page 2017
[30]

D. Iwai. Projection mapping technologies: A review of current trends and future directions.Proceedings of the Japan Academy. Series B, Physical and Biological Sciences, 100(3):234–251, 2024. 1

work page 2024
[31]

D. Iwai. Projection mapping technologies for seamless spatial augmen- tation. InProceedings of Laser Display and Lighting Conference, pp. 181–185. Springer Nature Singapore, 2025. 2

work page 2025
[32]

Iwai and K

D. Iwai and K. Sato. Limpid desk: see-through access to disorderly desktop in projection-based mixed reality. InProceedings of the ACM Symposium on Virtual Reality Software and Technology, p. 112–115. ACM, 2006. 1, 2

work page 2006
[33]

Izacard and E

G. Izacard and E. Grave. Leveraging passage retrieval with genera- tive models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main V olume, pp. 874–880. Association for Computational Linguistics, 2021. 3

work page 2021
[34]

C. Jia, Y . Yang, Y . Xia, Y .-T. Chen, Z. Parekh, H. Pham, Q. Le, Y .-H. Sung, Z. Li, and T. Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InICML, vol. 139 ofProceedings of Machine Learning Research, pp. 4904–4916. PMLR,

work page
[35]

Kagami and K

S. Kagami and K. Hashimoto. Animated Stickies: Fast video projec- tion mapping onto a markerless plane through a direct closed-loop alignment.IEEE TVCG, 25(11):3094–3104, 2019. 1, 2

work page 2019
[36]

Kageyama, D

Y . Kageyama, D. Iwai, and K. Sato. Online projector deblurring using a convolutional neural network.IEEE TVCG, 28(5):2223–2233, 2022. 1, 2

work page 2022
[37]

Kaminokado, D

T. Kaminokado, D. Iwai, and K. Sato. Augmented environment map- ping for appearance editing of glossy surfaces. InIEEE International Symposium on Mixed and Augmented Reality, pp. 55–65, 2019. 1, 2

work page 2019
[38]

Kusuyama, Y

H. Kusuyama, Y . Kageyama, D. Iwai, and K. Sato. A multi-aperture coaxial projector balancing shadow suppression and deblurring.IEEE TVCG, pp. 1–11, 2024. 2

work page 2024
[39]

Y . Lee. Qwen2-VL-Finetune, 2024. 6

work page 2024
[40]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS, vol. 33, pp. 9459–9474, 2020. 3

work page 2020
[41]

C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang, and J. Gao. Multimodal foundation models: From specialists to general-purpose assistants. F oundations and Trends in Computer Graphics and Vision, 16(1-2):1– 214, 2024. 8

work page 2024
[42]

J. Li, Q. Deng, H. Ling, and B. Huang. DPCS: Path tracing-based differentiable projector-camera systems.IEEE TVCG, 2025. 2

work page 2025
[43]

J. Li, D. Li, S. Savarese, and S. Hoi. BLIP-2: bootstrapping language- 10 Accepted by 2026 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) image pre-training with frozen image encoders and large language models. InICML, 2023. 1, 2, 3, 5

work page 2026
[44]

J. Li, D. Li, C. Xiong, and S. Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML, vol. 162, pp. 12888–12900, 2022. 1, 2

work page 2022
[45]

J. Li, D. M. V o, A. Sugimoto, and H. Nakayama. EVCap: Retrieval- augmented image captioning with external visual-name memory for open-world comprehension. InCVPR, pp. 13733–13742, 2024. 3, 4, 5, 9

work page 2024
[46]

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in context. InECCV, pp. 740–755, 2014. 3, 7, 9, 1

work page 2014
[47]

K. Luo, G. Yang, W. Xian, H. Haraldsson, B. Hariharan, and S. Be- longie. Stay Positive: Non-negative image synthesis for augmented reality. InCVPR, pp. 10050–10060, 2021. 2

work page 2021
[48]

Nomoto, R

T. Nomoto, R. Koishihara, and Y . Watanabe. Realistic dynamic projec- tion mapping using real-time ray tracing. InSIGGRAPH. ACM, 2020. 2

work page 2020
[49]

Nomoto, W

T. Nomoto, W. Li, H.-L. Peng, and Y . Watanabe. Dynamic projec- tion mapping with networked multi-projectors based on pixel-parallel intensity control. InSIGGRAPH Asia. ACM, 2020. 2

work page 2020
[50]

Nomoto, W

T. Nomoto, W. Li, H.-L. Peng, and Y . Watanabe. Dynamic multi- projection mapping based on parallel intensity control.IEEE TVCG, 28(5):2125–2134, 2022. 2

work page 2022
[51]

Papineni, S

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics, 2002. 6

work page 2002
[52]

H.-L. Peng, K. Sato, S. Nakagawa, and Y . Watanabe. Perceptually- aligned dynamic facial projection mapping by high-speed face-tracking method and lens-shift co-axial setup.IEEE TVCG, 31(10):6824–6838,

work page
[53]

Peng and Y

H.-L. Peng and Y . Watanabe. High-speed human arm projection map- ping with skin deformation. InSIGGRAPH Asia. ACM, 2020. 2

work page 2020
[54]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, vol. 139, pp. 8748–8763, 2021. 1, 2

work page 2021
[55]

Ramesh, M

A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. V oss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. InICMR, vol. 139 ofProceedings of Machine Learning Research, pp. 8821–8831. PMLR, 2021. 9

work page 2021
[56]

Raskar, J

R. Raskar, J. van Baar, P. Beardsley, T. Willwacher, S. Rao, and C. For- lines. iLamps: geometrically aware and self-configuring projectors. ACM TOG, 22(3):809–818, 2003. 2

work page 2003
[57]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pp. 10684–10695, 2022. 9

work page 2022
[58]

Q. Sun, Y . Fang, L. Wu, X. Wang, and Y . Cao. EV A-CLIP: Improved training techniques for clip at scale, 2023. 4

work page 2023
[59]

Takeuchi, H

M. Takeuchi, H. Kusuyama, D. Iwai, and K. Sato. Projection mapping under environmental lighting by replacing room lights with heteroge- neous projectors.IEEE TVCG, 30(5), 2024. 2

work page 2024
[60]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi`ere, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

Urakawa and Y

Y . Urakawa and Y . Watanabe. Neural inverse rendering for high- accuracy 3d measurement of moving objects with fewer phase-shifting patterns. InICCV, pp. 27692–27701, 2025. 2

work page 2025
[62]

P. K. A. Vasu, F. Faghri, C.-L. Li, C. Koc, N. True, A. Antony, G. San- thanam, J. Gabriel, P. Grasch, O. Tuzel, and H. Pouransari. FastVLM: Efficient vision encoding for vision language models. InCVPR, 2025. 1, 2, 6, 7, 8, 3, 4

work page 2025
[63]

Vedantam, C

R. Vedantam, C. Lawrence Zitnick, and D. Parikh. CIDEr: Consensus- based image description evaluation. InCVPR, 2015. 6

work page 2015
[64]

T.-J. J. Wang, J. Laaksonen, T. Langer, H. Arponen, and T. E. Bishop. Learning by Hallucinating: Vision-language pre-training with weak supervision. InIEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1073–1083, 2023. 2

work page 2023
[65]

Y . Wang, H. Ling, and B. Huang. CompenHR: Efficient full compen- sation for high-resolution projector. InIEEE VR, pp. 135–145, 2023. 2

work page 2023
[66]

Y . Wang, H. Ling, and B. Huang. ViComp: Video compensation for projector-camera systems.IEEE TVCG, 30(5):2347–2356, 2024. 2

work page 2024
[67]

Y . Xiu, T. Scargill, and M. Gorlatova. ViDDAR: Vision language model-based task-detrimental content detection for augmented reality. IEEE TVCG, 31(5):3194–3203, 2025. 2

work page 2025
[68]

Yasui, R

M. Yasui, R. Iwataki, M. Ishikawa, and Y . Watanabe. Projection map- ping with a brightly lit surrounding using a mixed light field approach. IEEE TVCG, 30(5):2217–2227, 2024. 2

work page 2024
[69]

Yasunaga, A

M. Yasunaga, A. Aghajanyan, W. Shi, R. James, J. Leskovec, P. Liang, M. Lewis, L. Zettlemoyer, and W.-T. Yih. Retrieval-augmented mul- timodal language modeling. InICML, vol. 202 ofProceedings of Machine Learning Research, pp. 39755–39769. PMLR, 2023. 3

work page 2023
[70]

Zhang, G

P. Zhang, G. Zeng, T. Wang, and W. Lu. TinyLlama: An open-source small language model, 2024. 6, 1

work page 2024
[71]

Zhang, S

S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer. OPT: Open pre-trained transformer language models, 2022. 6, 1

work page 2022
[72]

Zheng, W.-L

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, 2023. 6, 1

work page 2023
[73]

J. Zhu, W. Wang, Z. Chen, Z. Liu, S. Ye, L. Gu, H. Tian, Y . Duan, W. Su, J. Shao, Z. Gao, E. Cui, X. Wang, Y . Cao, Y . Liu, X. Wei, H. Zhang, H. Wang, W. Xu, H. Li, J. Wang, N. Deng, S. Li, Y . He, T. Jiang, J. Luo, Y . Wang, C. He, B. Shi, X. Zhang, W. Shao, J. He, Y . Xiong, W. Qu, P. Sun, P. Jiao, H. Lv, L. Wu, K. Zhang, H. Deng, J. Ge, K. Chen, L. W...

work page arXiv 2025
[74]

We highlight incorrect captioned objects in red and correct ones in blue

using ProCap Vicuna-1.5-7B. We highlight incorrect captioned objects in red and correct ones in blue. 5

work page