Recognition: unknown
E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
Pith reviewed 2026-05-10 05:17 UTC · model grok-4.3
The pith
Vision-language models show a large gap from humans on active 5-DoF viewpoint planning for 3D visual search.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
E3VS-Bench shows that all evaluated VLMs exhibit a substantial performance gap from humans on tasks requiring active 5-DoF viewpoint control and coherent planning, even though the scenes are rendered with 3D Gaussian Splatting to preserve fine-grained visual details that demand multi-view inspection.
What carries the argument
E3VS-Bench: a collection of 99 3D Gaussian Splatting scenes and 2,014 question-driven episodes that force agents to execute sequences of 5-DoF viewpoint changes to resolve viewpoint-dependent questions.
If this is right
- Embodied visual search requires agents to plan and execute viewpoint sequences rather than rely on passive or single-frame observations.
- 3D Gaussian Splatting reconstructions enable questions based on small text, internal contents, and angle-specific attributes that mesh-based simulators often lose.
- Strong 2D reasoning in current models does not automatically produce competent 3D active perception under full viewpoint freedom.
- Human performance on these episodes provides a concrete target for measuring progress in coherent viewpoint planning.
- The benchmark isolates the specific failure mode of active perception that must be addressed for real-world 3D tasks.
Where Pith is reading between the lines
- Adding explicit 3D world models or memory of past viewpoints could help agents generate better planning sequences.
- The same 5-DoF requirement may appear in robotic manipulation tasks where camera placement directly affects grasp or inspection success.
- Training regimes that reward multi-view consistency or reconstruction accuracy might narrow the observed human-model gap.
- Extending the benchmark to dynamic scenes or real-robot captures would test whether the identified limitations persist outside simulation.
Load-bearing premise
The questions in the benchmark genuinely cannot be answered from a single view or constrained motion and instead require active 5-DoF inspection, while 3D Gaussian Splatting scenes retain the fine details needed for those questions.
What would settle it
A model achieving human-level accuracy on E3VS-Bench while restricted to single-view or 2D inputs, or human subjects dropping to model-level performance when limited to the same restricted inputs.
Figures
read the original abstract
Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces E3VS-Bench, a benchmark for embodied 3D visual search consisting of 99 high-fidelity scenes reconstructed via 3D Gaussian Splatting and 2,014 question-driven episodes. Agents must actively control 5-DoF viewpoints to gather evidence for questions that depend on fine-grained, viewpoint-specific details (e.g., visibility changes from vertical shifts, contents inside containers, or angle-dependent attributes) that cannot be resolved from single views or constrained motion. The work evaluates multiple state-of-the-art VLMs against human performance and reports a substantial gap, attributing it to limitations in active perception and coherent viewpoint planning under full 5-DoF control.
Significance. If the episodes are validated to require unrestricted 5-DoF exploration and 3DGS is shown to preserve necessary fine details better than mesh-based alternatives, the benchmark would fill a clear gap in existing embodied AI and visual search evaluations (which rely on static observations or limited egocentric motion). The human-model comparison provides a concrete baseline for measuring progress in viewpoint-dependent reasoning, and the photorealistic rendering choice is a technical strength that enables more realistic question design than prior simulators.
major comments (2)
- [Benchmark construction (§3) and Experiments (§5)] The headline claim that models lag humans specifically due to failures in active 5-DoF viewpoint planning (abstract and §5) depends on every one of the 2,014 episodes being unsolvable from a single view or constrained (e.g., 3-DoF) motion. No single-view oracle accuracy, restricted-DoF human trials, or per-question validation is reported to confirm this; without such checks the observed gap could arise from generic VLM reasoning limits rather than the claimed 5-DoF active-perception deficit.
- [Benchmark construction (§3)] §4 (or equivalent benchmark design section): the assertion that 3DGS 'preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators' is load-bearing for the claim that the questions genuinely require multi-view 5-DoF inspection, yet no quantitative comparison (e.g., detail preservation metrics or ablation against mesh renderings) is provided to support this over alternatives.
minor comments (2)
- [Experiments (§5)] The abstract and evaluation sections would benefit from explicit reporting of error bars, exact metrics (e.g., accuracy per question type), episode length statistics, and data exclusion rules to allow assessment of result robustness.
- [Benchmark construction (§3)] Clarify the exact process for question generation and human annotation protocol (e.g., how questions were filtered to ensure they require 5-DoF) to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Benchmark construction (§3) and Experiments (§5)] The headline claim that models lag humans specifically due to failures in active 5-DoF viewpoint planning (abstract and §5) depends on every one of the 2,014 episodes being unsolvable from a single view or constrained (e.g., 3-DoF) motion. No single-view oracle accuracy, restricted-DoF human trials, or per-question validation is reported to confirm this; without such checks the observed gap could arise from generic VLM reasoning limits rather than the claimed 5-DoF active-perception deficit.
Authors: We agree that quantitative validation would make the central claim more robust. The 2,014 episodes were constructed through a human annotation process in which annotators explicitly verified that each question requires viewpoint-specific information unavailable from the initial observation or under constrained motion; this per-question validation is described in §3 but was not quantified with oracle baselines. To directly address the concern, we will add single-view VLM accuracies (oracle performance when models are given only the starting view) in the revised experiments section. This will demonstrate that the performance gap is not solely attributable to generic reasoning limitations. We will also expand the description of the annotation protocol to include more details on how viewpoint dependence was ensured. revision: yes
-
Referee: [Benchmark construction (§3)] §4 (or equivalent benchmark design section): the assertion that 3DGS 'preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators' is load-bearing for the claim that the questions genuinely require multi-view 5-DoF inspection, yet no quantitative comparison (e.g., detail preservation metrics or ablation against mesh renderings) is provided to support this over alternatives.
Authors: The statement draws on well-documented properties of 3D Gaussian Splatting in the novel-view-synthesis literature, where it avoids the meshing and texturing artifacts that degrade high-frequency details. We did not include a direct quantitative ablation because the benchmark scenes are native 3DGS reconstructions without paired mesh versions. In revision we will (i) expand the justification with specific citations to comparative studies on detail preservation and (ii) add qualitative side-by-side renderings of representative scenes (text, fine textures, interior views) to illustrate the difference. A full quantitative metric comparison across all 99 scenes would require additional reconstruction work and is therefore only partially feasible; we will note this limitation explicitly. revision: partial
Circularity Check
No circularity: benchmark construction is independent of any self-referential derivation
full rationale
The paper introduces E3VS-Bench as a new evaluation resource consisting of 99 3DGS scenes and 2,014 episodes, motivated by the claim that 3DGS enables questions unsolvable from single views. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central empirical result (VLMs lag humans under 5-DoF) is obtained by running external models on the benchmark and comparing to human performance; it does not reduce to any quantity defined by the authors' prior work or by construction within the paper itself. Self-citations, if present, are not load-bearing for the benchmark's validity or the reported gap.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators
Reference graph
Works this paper leans on
-
[1]
International Journal of Computer Vision1, 333–356 (1988)
Aloimonos, Y., Weiss, I., Bandyopadhyay, A.: Active vision. International Journal of Computer Vision1, 333–356 (1988)
1988
-
[2]
In: IEEE Conf
Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19129–19139 (June 2022)
2022
-
[3]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Proceedings of the IEEE76(8), 966–1005 (1988)
Bajcsy, R.: Active perception. Proceedings of the IEEE76(8), 966–1005 (1988)
1988
-
[5]
Chaplot, D.S., Jiang, H., Gupta, S., Gupta, A.: Semantic curiosity for active visual learning. In: Eur. Conf. Comput. Vis. (2020)
2020
-
[6]
Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: Eur. Conf. Comput. Vis. (2020)
2020
-
[7]
In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2025)
Cheng, K., Li, Z., Sun, X., Min, B.C., Bedi, A.S., Bera, A.: Efficienteqa: An ef- ficient approach to open-vocabulary embodied question answering. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2025)
2025
-
[8]
In: IEEE Conf
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied Question Answering. In: IEEE Conf. Comput. Vis. Pattern Recog. (2018)
2018
-
[9]
arXiv preprint arXiv:2509.20021 (2025)
Feng,T.,Wang,X.,Jiang,Y.G.,Zhu,W.:Embodiedai:Fromllmstoworldmodels. arXiv preprint arXiv:2509.20021 (2025)
-
[10]
In: Proc
Ginting, M.F., Kim, D.K., Meng, X., Reinke, A.M., Krishna, B.J., Kayhani, N., Peltzer, O., Fan, D., Shaban, A., Kim, S.K., Kochenderfer, M., Agha-mohammadi, A.a., Omidshafiei, S.: Enter the mind palace: Reasoning and planning for long- term active embodied question answering. In: Proc. Conference on Robot Learning (CoRL) (2025)
2025
-
[11]
arXiv preprint arXiv:2601.09668 , year=
Huang, A., Yao, C., Han, C., Wan, F., Guo, H., Lv, H., Zhou, H., Wang, J., Zhou, J., Sun, J., Hu, J., Lin, K., Zhao, L., Huang, M., Yuan, S., Qu, W., Wang, X., Lai, Y., Zhao, Y., Zhang, Y., Shi, Y., Chen, Y., Weng, Z., Meng, Z., Li, A., Kong, A., Dong, B., Wan, C., Wang, D., Qi, D., Li, D., Yu, E., Li, G., Yin, H., Zhou, H., Zhang, H., Yan, H., Zhou, H., ...
-
[12]
Jiang, K., Liu, Y., Chen, W., Luo, J., Chen, Z., Pan, L., Li, G., Lin, L.: Beyond the destination: A novel benchmark for exploration-aware embodied question an- swering. In: Int. Conf. Comput. Vis. pp. 9091–9101 (October 2025)
2025
-
[13]
ACM Transactions on Graphics (TOG)42(4), 1–14 (July 2023)
Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG)42(4), 1–14 (July 2023)
2023
-
[14]
Eye, robot: Learning to look to act with a bc-rl perception-action loop,
Kerr, J., Hari, K., Weber, E., Kim, C.M., Yi, B., Bonnen, T., Goldberg, K., Kanazawa, A.: Eye, robot: Learning to look to act with a bc-rl perception-action loop. In: Proc. Conference on Robot Learning (CoRL) (2025),http://arxiv.org/ abs/2506.10968
-
[15]
Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search. In: Int. Conf. Learn. Represent. (2026)
2026
-
[16]
Lee, J., Miyanishi, T., Kurita, S., Sakamoto, K., Azuma, D., Matsuo, Y., Inoue, N.: Citynav: Language-goal aerial navigation dataset with geographic information. In: Int. Conf. Comput. Vis. pp. 5912–5922 (October 2025)
2025
-
[17]
In: IEEE Conf
Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9098–9108 (June 2025)
2025
-
[18]
Li, K., Yao, L., Wu, J., Yu, T., Chen, J., Bai, H., Hou, L., Hong, L., Zhang, W., Zhang, N.L.: Insight-o3: Empowering multimodal foundation models with gener- alized visual search. In: Int. Conf. Learn. Represent. (2026)
2026
-
[19]
Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: Aerialvln: Vision-and- languagenavigationforuavs.In:Int.Conf.Comput.Vis.pp.15384–15394(October 2023)
2023
-
[20]
In: Proc
Ma, M., Ma, Q., Li, Y., Cheng, J., Yang, R., Ren, B., Popovic, N., Wei, M., Sebe, N., Van Gool, L., et al.: Scenesplat++: A large dataset and comprehensive benchmark for language gaussian splatting. In: Proc. Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)
2025
-
[21]
Ma,X.,Yong,S.,Zheng,Z.,Li,Q.,Liang,Y.,Zhu,S.C.,Huang,S.:Sqa3d:Situated question answering in 3d scenes. In: Int. Conf. Learn. Represent. (2023)
2023
-
[22]
In: IEEE Conf
Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Sil- wal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., Rajeswaran, A.: Openeqa: Embodied question answering in the era of foundation mod...
2024
-
[23]
In: IEEE Conf
Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., van den Hengel, A.: Reverie: Remote embodied visual referring expression in real indoor environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)
2020
-
[24]
In: IEEE Conf
Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., Hengel, A.v.d.: Reverie: Remote embodied visual referring expression in real indoor environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2020)
2020
-
[25]
Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A platform for embodied ai research. In: Int. Conf. Comput. Vis. (October 2019) E3VS-Bench 19
2019
-
[26]
In: Proc
Saxena, S., Buchanan, B., Paxton, C., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering. In: Proc. Conference on Robot Learning (CoRL) (2025)
2025
-
[27]
Scarpellini, G., Rosa, S., Morerio, P., Natale, L., Bue, A.D.: Look around and learn: self-improving object detection by exploration. In: Eur. Conf. Comput. Vis. (2024)
2024
-
[28]
In: IEEE Conf
Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettle- moyer, L., Fox, D.: Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2020)
2020
-
[29]
Su, A., Wang, H., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel- spacereasoningwithcuriosity-drivenreinforcementlearning.In:Proc.AnnualCon- ference on Neural Information Processing Systems (NeurIPS) (2025)
2025
-
[30]
Psychological Review (2006)
Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M.: Contextual guidance of eye movements and attention in real-world scenes. Psychological Review (2006)
2006
-
[31]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review arXiv 2025
-
[32]
In: IEEE Conf
Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., Batra, D.: Embodied Question Answering in Photorealistic Environ- ments with Point Cloud Perception. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6659–6668 (2019)
2019
-
[33]
Wolfe, J.M.: Visual search: How do we find what we are looking for? Annual Review of Vision Science6, 539–562 (Sep 2020).https://doi.org/10.1146/annurev- vision-091718-015048
-
[34]
Trends in Cognitive Sciences (2011)
Wolfe, J.M., Vo, M.L.H., Evans, K.K., Greene, M.R.: Visual search in scenes in- volves selective and nonselective pathways. Trends in Cognitive Sciences (2011)
2011
-
[35]
In: IEEE Conf
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 13084–13094 (June 2024)
2024
-
[36]
In: Proc
Xiong, H., Xu, X., Wu, J., Hou, Y., Bohg, J., Song, S.: Vision in action: Learn- ing active perception from human demonstrations. In: Proc. Conference on Robot Learning (CoRL) (2025)
2025
-
[37]
Yang, J., Ren, Z., Xu, M., Chen, X., Crandall, D., Parikh, D., Batra, D.: Embodied amodal recognition: Learning to move to perceive objects. In: Int. Conf. Comput. Vis. pp. 2040–2050 (2019)
2040
-
[38]
Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Int. Conf. Comput. Vis. pp. 12–22 (2023)
2023
-
[39]
In: Proc
Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Shorinwa, O., Majumdar, A.: Womap: World models for embodied open-vocabulary object localization. In: Proc. Conference on Robot Learning (CoRL) (2025)
2025
- [40]
-
[41]
Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. arXiv preprint arXiv:2508.11630 (2025) E3VS-Bench 1 A Appendix This supplementary material provides additional details that complement the main paper. We describe the dataset construction pipeline, implementation de- tails fo...
work page internal anchor Pith review arXiv 2025
-
[42]
Occlusion & Clipping Check: –Is the object partially hidden by other furniture or clutter? –If YES and the GT count is very low (1 or 2) compared to the actual number of common parts, it is likely INVALID
-
[43]
Structural Plausibility: –For the specific object in the image, is the GT count physically plau- sible for a complete product? –Trash bins, stools, or side tables often have 3 legs→This is VALID –Large sofas, heavy cabinets, or standard dining chairs with only 1-2 legs→INVALID (unless wall-mounted, which is rare)
-
[44]
There are probably more parts hidden,
Ambiguity Check: –If a human looking at the image would say "There are probably more parts hidden," but GT gives a low count, the task is poor quality→ INVALID Please respond with only ’valid’ or ’invalid’. Question: [QUESTION] GT Answer: [ANSWER] Fig.14.Prompt for Filtering Invalid Counting QAs. E3VS-Bench 11 System Prompt for Embodied 3D Visual Search Y...
-
[45]
END_IMAGE: The actual view seen by the model at the end of navigation
-
[46]
GOAL_IMAGE: The image used to derive the ground-truth answer
-
[47]
QUESTION: The query asked
-
[48]
RESPONSE: The model’s answer
-
[49]
Chair",
REFERENCE INFO: One example of a correct answer. Core Principles –Visual Grounding is Paramount: The answer MUST be derived from the END_IMAGE. –No Blind Guessing: Even if the RESPONSE matches the REFERENCE INFO, it is incorrect if the object is NOT visible in the END_IMAGE. –Object Existence: An object is considered to exist only if it is clearly visible...
-
[50]
Visibility: Is the object clearly and unambiguously visible in the END_IMAGE? (Reject hallucinations or objects inferred from dark/blurry areas)
-
[51]
Mop" a "Toilet Brush
Identity: Does the visual appearance in END_IMAGE match the object name in the RESPONSE? •Example Failure: Calling a "Mop" a "Toilet Brush". •Example Failure: Calling a "Table" a "Chair". •Use GOAL_IMAGE to resolve ambiguity about what the target object looks like. –If any referenced object is invisible or misidentified (wrong category):→Score 1 Step 4: V...
-
[52]
It refers to objects validated in Step 3 (Clearly visible)
-
[53]
Trash can
It satisfies the same semantic intent as the REFERENCE/GOAL. •Acceptable: Synonyms ("Trash can" vs "Bin"), Shape approximations ("Cir- cular" vs "Oval"), Hypernyms ("Furniture" vs "Chair" - if accurate). •Unacceptable: Different object categories ("Table" vs "Chair"), contradicting attributes ("Red" vs "Blue")
-
[54]
–If it fails any of these:→Score 1 –If it is a valid alternative:→Score 5 Output Format Reasoning: [Explain the score based on the steps above
It is visually consistent with the END_IMAGE. –If it fails any of these:→Score 1 –If it is a valid alternative:→Score 5 Output Format Reasoning: [Explain the score based on the steps above. Explicitly mention if the object was visible in END_IMAGE and if the Identity matched.] Your mark: [Integer from 1 to 5] Fig.17.System Prompt for VLM-as-a-Judge
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.