DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes

Hyunwoo Ha; Tae-Hyun Oh; Wonjun Jo; Yohan Park

arxiv: 2512.24985 · v4 · submitted 2025-12-31 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes

Yohan Park , Hyunwoo Ha , Wonjun Jo , Tae-Hyun Oh This is my paper

Pith reviewed 2026-05-16 18:39 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords vision-language modelslow-light benchmarkvisual primitivessensor noiseimage enhancementindoor scenesembodied perceptionsynthetic data

0 comments

The pith

Vision-language models degrade consistently when answering basic visual questions in low-light indoor scenes, and low-light enhancement offers only partial unstable recovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DarkQA creates a controlled benchmark of 9.4K question-image pairs that isolate five families of visual primitives under graded low-light conditions. Degradations are synthesized in linear RAW space to model illumination loss and sensor noise, then passed through an ISP-style pipeline and validated against real paired camera data. Tests on representative VLMs show clear performance drops as light levels fall and noise rises. Standard low-light image enhancement methods improve results in a severity-dependent way but produce inconsistent outcomes across models and degradation strengths. The benchmark separates perceptual failures from higher-level embodied reasoning so developers can diagnose where robustness is missing.

Core claim

DarkQA demonstrates that current vision-language models exhibit reliable accuracy drops on elementary visual-primitive questions once illumination falls and sensor noise increases, while common low-light image enhancement pipelines deliver only severity-dependent and unstable recovery.

What carries the argument

DarkQA benchmark consisting of deterministically generated question-image pairs with physics-based low-light synthesis performed in linear RAW space followed by ISP rendering.

If this is right

VLMs will continue to produce unreliable answers on simple perceptual tasks in dark indoor environments without targeted robustness improvements.
LLIE preprocessing cannot be treated as a dependable plug-in solution for VLM inputs across varying light levels.
Benchmarks that isolate low-level visual primitives can expose failure modes before they compound in full embodied tasks.
Training or architectural changes that address illumination drop and noise directly will be necessary for reliable 24/7 VLM operation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embodied agents that rely on VLMs for navigation or manipulation will encounter compounded errors when low-light perceptual failures interact with planning modules.
The same synthesis approach could be extended to other degradations such as motion blur or haze to create a broader suite of controlled perception tests.
Developers should incorporate low-light synthetic data into VLM fine-tuning loops to close the observed gap before real-world deployment.

Load-bearing premise

The synthetic low-light images generated in RAW space and rendered through an ISP pipeline match the statistics of real camera captures closely enough to predict actual VLM failures.

What would settle it

Collect real paired low-light and normal-light images of the same indoor scenes, ask the same questions to the same VLMs, and check whether the performance gap matches the synthetic results within a small margin.

Figures

Figures reproduced from arXiv: 2512.24985 by Hyunwoo Ha, Tae-Hyun Oh, Wonjun Jo, Yohan Park.

**Figure 1.** Figure 1: Illustration of the DarkEQA benchmark. Traditional Embodied Question Answering (EQA) primarily evaluates VLMs on well-lit images, overlooking their robustness to real-world low-light conditions. We present DarkEQA, a new benchmark designed to address this evaluation void. DarkEQA assesses VLM performance under two distinct conditions: clean, well-lit inputs (L0) and a multi-level ladder of physics-based lo… view at source ↗

**Figure 2.** Figure 2: Low-light synthesis pipeline with disentangled illumination and noise factors. To generate controlled low-light inputs for our benchmark, we adopt an ISP-inspired unprocessing and noise formulation from prior work [18], [19]. Crucially, we produce paired variants for each original image to disentangle failure sources in VLM-based EQA: (a) a physics-based branch (top) that unprocesses sRGB to Bayer RAW, inj… view at source ↗

**Figure 3.** Figure 3: Example low-light image synthesization. Synthesized low-light image examples across degradation levels L0–L5. The top row shows EV drop only, while the bottom row shows EV drop combined with noise injection. The lower-right insets show 1/4-image crops with pixel intensities amplified for visibility; the numbers (×10, ×20, ×50) indicate the amplification factor. face in dark environments. III. DARKEQA: DATA… view at source ↗

**Figure 2.** Figure 2: Unprocessing (sRGB → RAW). We first normalize an 8- bit sRGB image I ∈ {0, . . . , 255} H×W×3 , where H and W denote the image height and width, respectively, to IsRGB = I 255 ∈ [0, 1]H×W×3 . To obtain a camera-linear RAW image from IsRGB, we invert the ISP following [18]. We denote the unprocessing operator by u(·), and express the resulting Bayer RAW mosaic as B = u(IsRGB), (4) where B ∈ [0, 1] H 2 × W 2… view at source ↗

**Figure 4.** Figure 4: Question family of our DarkEQA benchmark. Five DarkEQA question categories with examples. DarkEQA encompasses questions asking room-type recognition, room affordance check, object recognition, object attribute. between top-two closest objects exceeds a minimum threshold to ensure perceptual validity. If satisfied, the closest object is determined as the ground-truth answer. In this example, “chair” is iden… view at source ↗

**Figure 6.** Figure 6: Summary of the evaluation results on our DarkEQA. Degradation level indicates the severity of low-light corruption: L0 corresponds to the original (well-lit) input, and higher levels (L1 → L5) denote progressively darker (lower-illumination) inputs. We evaluate a range of open-source VLMs (LLaVA [12], [13], InternVL [14], and Qwen-VL [15] series, 7B–32B). The shaded regions in (a) and (b) denote the minimu… view at source ↗

**Figure 7.** Figure 7: Question-wise accuracy. We plot VLM accuracy across different question types under increasing low-light degradation, where darker lines indicate more severe degradation and the gray dashed line denotes the GPT-4 Blind-LLM baseline. We observe significant drops in “Room Type Recognition” and “Object Attribute – Color,” where VLM performance falls below the GPT-4 Blind-LLM baseline. Effectiveness of low-ligh… view at source ↗

read the original abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DarkQA is a practical new benchmark for low-light VLM testing via RAW-space synthesis and deterministic primitives, but the synthetic-to-real error match for actual question-answering needs tighter evidence.

read the letter

DarkQA introduces a benchmark that measures VLM performance on five visual-primitive question families under controlled low-light indoor conditions. The setup generates 9.4K image-question pairs by applying physics-based illumination drop and sensor noise in linear RAW space, then rendering through an ISP pipeline, with validation against real paired camera captures. Results show steady VLM accuracy drops as light decreases and sensor noise rises, while LLIE methods give partial but unstable gains. This is the main advance: a focused, reproducible instrument that isolates low-light perceptual issues before they mix with full embodied tasks. The deterministic question generation and open release are clear positives for anyone who needs repeatable low-light evaluation. The synthesis approach is more grounded than simple brightness reduction or post-capture noise addition used in earlier work. The soft spot is the validation step. The paper confirms the generated images resemble real low-light captures at the pixel or perceptual level, yet it does not show whether the same VLMs produce matching error patterns or confusion matrices on the exact questions when run on synthetic versus real pairs. Image similarity alone does not guarantee the benchmark isolates the failure modes it claims to measure. This is a moderate gap rather than a fatal one, since the pipeline is physics-motivated and the trends are consistent. The paper is aimed at researchers working on robust vision for robots or continuous-operation agents. It is solid enough to deserve referee time; the empirical claims are testable and the benchmark itself is a usable contribution.

Referee Report

2 major / 2 minor

Summary. The paper introduces DarkQA, a benchmark of 9.4K deterministically generated question-image pairs for evaluating VLMs on five families of visual primitives in low-light indoor egocentric scenes. Degradations are synthesized in linear RAW space with physics-based illumination drop and sensor noise, rendered via an ISP pipeline, and validated against real paired low-light camera captures. Experiments on representative VLMs and LLIE methods report consistent performance degradation under low illumination and noise, with LLIE recovery being severity-dependent but unstable.

Significance. If the central empirical claims hold, DarkQA fills a clear gap by providing a controlled, physically motivated instrument for isolating low-light perceptual failures in VLMs before they compound in embodied tasks. The open release of the dataset and code supports reproducibility and future work on robust 24/7 vision-language reasoning.

major comments (2)

[§4] §4 (synthesis validation): The manuscript validates the RAW-space synthesis against real paired camera data using image-level metrics, yet provides no quantitative comparison of VLM accuracy drops, confusion matrices, or primitive-specific error patterns between the synthetic and real sets. Image similarity alone does not establish that the benchmark triggers the same VLM failure modes claimed to be measured.
[Abstract and §5] Abstract and §5 (results): Reported VLM accuracy degradation trends lack error bars, standard deviations across runs, or statistical significance tests, making it difficult to judge the reliability and consistency of the observed drops across models and degradation levels.

minor comments (2)

[Abstract] The abstract states that 'a wide range of state-of-the-art VLMs and LLIE models' were evaluated but does not specify the exact count or selection criteria; adding this detail would improve clarity.
[§5] Figure captions and axis labels in the results section could more explicitly indicate whether plotted values are means over multiple seeds or single-run results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and planned updates to the manuscript.

read point-by-point responses

Referee: [§4] §4 (synthesis validation): The manuscript validates the RAW-space synthesis against real paired camera data using image-level metrics, yet provides no quantitative comparison of VLM accuracy drops, confusion matrices, or primitive-specific error patterns between the synthetic and real sets. Image similarity alone does not establish that the benchmark triggers the same VLM failure modes claimed to be measured.

Authors: We agree that direct VLM-level comparisons would provide stronger evidence that the synthetic degradations elicit the same failure modes as real captures. Our §4 validation prioritizes image-level metrics (PSNR, SSIM, and noise statistics) because they quantify the physics-based fidelity of illumination drop and sensor noise in linear RAW space before ISP rendering. In the revised manuscript we will add a supplementary analysis comparing VLM accuracy, degradation trends, and primitive-specific error patterns on a subset of the real paired low-light captures (where question annotations can be aligned), or explicitly note data limitations if full coverage is not feasible. revision: partial
Referee: [Abstract and §5] Abstract and §5 (results): Reported VLM accuracy degradation trends lack error bars, standard deviations across runs, or statistical significance tests, making it difficult to judge the reliability and consistency of the observed drops across models and degradation levels.

Authors: We thank the referee for highlighting this. The reported numbers reflect deterministic evaluation runs chosen for reproducibility across the 9.4K pairs. In the revision we will re-execute the VLM evaluations with multiple random seeds (where model inference involves any stochasticity), report standard deviations, add error bars to the accuracy plots in §5, and include statistical significance tests (e.g., paired t-tests) between illumination levels. These changes will be reflected in both the main text and figures. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluation are self-contained

full rationale

The manuscript presents DarkQA as an external measurement instrument: synthetic RAW-space degradations plus ISP rendering are generated deterministically, then validated against real paired camera captures. No equations, fitted parameters, or predictions are defined in terms of the target VLM scores; the reported degradation patterns and LLIE recovery statistics are direct outputs of running existing VLMs on the held-out benchmark images. The validation step compares image-level statistics to real data rather than re-using the VLM question-answering results to justify the synthesis itself. Consequently the derivation chain contains no self-definitional, fitted-input, or self-citation reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms or free parameters; its central contribution is an empirical measurement instrument whose validity rests on the physical fidelity of the image synthesis pipeline.

pith-pipeline@v0.9.0 · 5602 in / 1068 out tokens · 30430 ms · 2026-05-16T18:39:03.211930+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

physics-based low-light synthesis pipeline... EV drop... sensor noise... ISP-inspired rendering
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

five visual-primitive families... Room-Type Recognition, Object Recognition...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 6 internal anchors

[1]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser,et al., “Openscene: 3d scene understanding with open vocabularies,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824, 2023

work page 2023
[2]

Open-vocabulary semantic segmentation with mask-adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070, 2023

work page 2023
[3]

Clip-nav: Using clip for zero-shot vision-and-language navigation,

V . S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2211.16649, 2022

work page arXiv 2022
[4]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10632–10643, 2025

work page 2025
[5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

work page 2023
[6]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang,et al., “Palm-e: An embodied multimodal language model,” 2023

work page 2023
[7]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Embodied question answering,

A. Das, D. Gordon, C. Divi, D. Batra, G. Gkioxari, and D. Parikh, “Embodied question answering,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 2054–2063, 2018

work page 2054
[9]

Openeqa: Embodied question answering in the era of foundation models,

A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud,et al., “Openeqa: Embodied question answering in the era of foundation models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16488–16498, 2024

work page 2024
[10]

On the illumination influence for object learning on robot companions,

I. Keller and K. S. Lohan, “On the illumination influence for object learning on robot companions,”Frontiers in Robotics and AI, vol. 6, p. 154, 2020

work page 2020
[11]

Ai models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,”Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024
[12]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024

work page 2024
[13]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Darkir: Robust low-light image restoration,

D. Feijoo, J. C. Benito, A. Garcia, and M. V . Conde, “Darkir: Robust low-light image restoration,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10879–10889, 2025

work page 2025
[18]

Unprocessing images for learned raw denoising,

T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045, 2019

work page 2019
[19]

Physics-based noise modeling for extreme low-light photography,

K. Wei, Y . Fu, Y . Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8520–8537, 2021

work page 2021
[20]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129– 19139, 2022

work page 2022
[21]

CoSpace: Benchmarking continuous space perception ability for vision-language models,

X. Zhuet al., “CoSpace: Benchmarking continuous space perception ability for vision-language models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[22]

ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark,

R. Danget al., “ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[23]

Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

W. Zhang, Z. Zhou, Z. Zheng, C. Gao, J. Cui, Y . Li, X. Chen, and X.-P. Zhang, “Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,”arXiv preprint arXiv:2503.11094, 2025

work page arXiv 2025
[24]

RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,

C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield, “RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Oral Presentation

work page 2025
[25]

Memory-centric embodied question answer,

M. Zhai, Z. Gao, Y . Wu, and Y . Jia, “Memory-centric embodied question answer,”arXiv preprint arXiv:2505.13948, 2025

work page arXiv 2025
[26]

3D scene graphs: A structure for unified semantics, 3D space, and camera,

I. Armeni, Z.-Y . He, G. Gkioxari, A. R. Zamir, M. Fischer, and S. Savarese, “3D scene graphs: A structure for unified semantics, 3D space, and camera,” inIEEE International Conference on Computer Vision, pp. 5664–5673, 2019

work page 2019
[27]

Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,

S. Saxena, B. Buchanan, C. Paxton, P. Liu, B. Chen, N. Vaskevicius, L. Palmieri, J. Francis, and O. Kroemer, “Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,”arXiv preprint arXiv:2412.14480, 2024

work page arXiv 2024
[28]

Graphpad: Inference- time 3d scene graph updates for embodied question answering,

M. Q. Ali, S. Nair, A. Wong, Y . Cui, and Y . Chen, “Graphpad: Inference- time 3d scene graph updates for embodied question answering,”arXiv preprint arXiv:2506.01174, 2025

work page arXiv 2025
[29]

ProMQA: Question answering dataset for multimodal procedural activity understanding,

K. Hasegawa, W. Imrattanatrai, Z.-Q. Cheng, M. Asada, S. Holm, Y . Wang, K. Fukuda, and T. Mitamura, “ProMQA: Question answering dataset for multimodal procedural activity understanding,” inConfer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Lingui...

work page 2025
[30]

GPR: Grounded procedural reasoning for embodied AI,

A. Kamath, A. Kembhavi, and L. Paull, “GPR: Grounded procedural reasoning for embodied AI,” inIEEE International Conference on Computer Vision, pp. 22096–22106, 2023

work page 2023
[31]

Noisyeqa: Benchmarking embodied question answering against noisy queries,

T. Wu, C. Zhou, Y . H. Wong, L. Gu, and J. Yang, “Noisyeqa: Benchmarking embodied question answering against noisy queries,” arXiv preprint arXiv:2412.10726, 2024

work page arXiv 2024
[32]

Robust monocular depth estimation under challenging conditions,

S. Gasperini, N. Morbitzer, H. Jung, N. Navab, and F. Tombari, “Robust monocular depth estimation under challenging conditions,” inIEEE International Conference on Computer Vision, pp. 8177–8186, 2023

work page 2023
[33]

Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,

K. Wang, Z. Zhang, Z. Yan, X. Li, B. Xu, J. Li, and J. Yang, “Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,” inIEEE International Conference on Computer Vision, pp. 16055–16064, 2021

work page 2021
[34]

Human pose estimation in extremely low-light conditions,

S. Lee, J. Rim, B. Jeong, G. Kim, B. Woo, H. Lee, S. Cho, and S. Kwak, “Human pose estimation in extremely low-light conditions,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 704–714, 2023

work page 2023
[35]

Yolo in the dark-domain adaptation method for merging multiple models,

Y . Sasagawa and H. Nagahara, “Yolo in the dark-domain adaptation method for merging multiple models,” inEuropean Conference on Computer Vision, pp. 345–359, Springer, 2020

work page 2020
[36]

Defeat-net: General monocular depth via simultaneous unsupervised representation learning,

J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monocular depth via simultaneous unsupervised representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14402–14413, 2020

work page 2020
[37]

Learning to see in the dark,

C. Chen, Q. Chen, J. Xu, and V . Koltun, “Learning to see in the dark,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300, 2018

work page 2018
[38]

Llnet: A deep autoencoder approach to natural low-light image enhancement,

K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoder approach to natural low-light image enhancement,”Pattern Recognition, vol. 61, pp. 650–662, 2017

work page 2017
[39]

Mbllen: Low-light image/video enhancement using cnns.,

F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns.,” inBritish Machine Vision Conference, p. 4, Northumbria University, 2018

work page 2018
[40]

Clahe-based low-light image enhancement for robust object detection in overhead power transmission system,

Z. Yuan, J. Zeng, Z. Wei, L. Jin, S. Zhao, X. Liu, Y . Zhang, and G. Zhou, “Clahe-based low-light image enhancement for robust object detection in overhead power transmission system,”IEEE Transactions on Power Delivery, vol. 38, no. 3, pp. 2240–2243, 2023

work page 2023
[41]

A low-light image enhance- ment method for both denoising and contrast enlarging,

L. Li, R. Wang, W. Wang, and W. Gao, “A low-light image enhance- ment method for both denoising and contrast enlarging,” inIEEE International Conference on Image Processing, pp. 3730–3734, 2015

work page 2015
[42]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems Datasets and Benchmarks, 2021

work page 2021
[43]

Benchmarking denoising algorithms with real photographs,

T. Plotz and S. Roth, “Benchmarking denoising algorithms with real photographs,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595, 2017

work page 2017
[44]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan,et al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

work page 2024

[1] [1]

Openscene: 3d scene understanding with open vocabularies,

S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser,et al., “Openscene: 3d scene understanding with open vocabularies,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824, 2023

work page 2023

[2] [2]

Open-vocabulary semantic segmentation with mask-adapted clip,

F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070, 2023

work page 2023

[3] [3]

Clip-nav: Using clip for zero-shot vision-and-language navigation,

V . S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2211.16649, 2022

work page arXiv 2022

[4] [4]

Thinking in space: How multimodal large language models see, remember, and recall spaces,

J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10632–10643, 2025

work page 2025

[5] [5]

Rt-2: Vision-language-action models transfer web knowledge to robotic control,

B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023

work page 2023

[6] [6]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang,et al., “Palm-e: An embodied multimodal language model,” 2023

work page 2023

[7] [7]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Embodied question answering,

A. Das, D. Gordon, C. Divi, D. Batra, G. Gkioxari, and D. Parikh, “Embodied question answering,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 2054–2063, 2018

work page 2054

[9] [9]

Openeqa: Embodied question answering in the era of foundation models,

A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud,et al., “Openeqa: Embodied question answering in the era of foundation models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16488–16498, 2024

work page 2024

[10] [10]

On the illumination influence for object learning on robot companions,

I. Keller and K. S. Lohan, “On the illumination influence for object learning on robot companions,”Frontiers in Robotics and AI, vol. 6, p. 154, 2020

work page 2020

[11] [11]

Ai models collapse when trained on recursively generated data,

I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,”Nature, vol. 631, no. 8022, pp. 755–759, 2024

work page 2024

[12] [12]

Improved baselines with visual instruction tuning,

H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024

work page 2024

[13] [13]

LLaVA-OneVision: Easy Visual Task Transfer

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

GPT-4o System Card

A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Darkir: Robust low-light image restoration,

D. Feijoo, J. C. Benito, A. Garcia, and M. V . Conde, “Darkir: Robust low-light image restoration,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10879–10889, 2025

work page 2025

[18] [18]

Unprocessing images for learned raw denoising,

T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045, 2019

work page 2019

[19] [19]

Physics-based noise modeling for extreme low-light photography,

K. Wei, Y . Fu, Y . Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8520–8537, 2021

work page 2021

[20] [20]

Scanqa: 3d question answering for spatial scene understanding,

D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129– 19139, 2022

work page 2022

[21] [21]

CoSpace: Benchmarking continuous space perception ability for vision-language models,

X. Zhuet al., “CoSpace: Benchmarking continuous space perception ability for vision-language models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[22] [22]

ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark,

R. Danget al., “ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[23] [23]

Open3d-vqa: A benchmark for comprehen- sive spatial reasoning with multimodal large language model in open space.arXiv preprint arXiv:2503.11094, 2025

W. Zhang, Z. Zhou, Z. Zheng, C. Gao, J. Cui, Y . Li, X. Chen, and X.-P. Zhang, “Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,”arXiv preprint arXiv:2503.11094, 2025

work page arXiv 2025

[24] [24]

RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,

C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield, “RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Oral Presentation

work page 2025

[25] [25]

Memory-centric embodied question answer,

M. Zhai, Z. Gao, Y . Wu, and Y . Jia, “Memory-centric embodied question answer,”arXiv preprint arXiv:2505.13948, 2025

work page arXiv 2025

[26] [26]

3D scene graphs: A structure for unified semantics, 3D space, and camera,

I. Armeni, Z.-Y . He, G. Gkioxari, A. R. Zamir, M. Fischer, and S. Savarese, “3D scene graphs: A structure for unified semantics, 3D space, and camera,” inIEEE International Conference on Computer Vision, pp. 5664–5673, 2019

work page 2019

[27] [27]

Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,

S. Saxena, B. Buchanan, C. Paxton, P. Liu, B. Chen, N. Vaskevicius, L. Palmieri, J. Francis, and O. Kroemer, “Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,”arXiv preprint arXiv:2412.14480, 2024

work page arXiv 2024

[28] [28]

Graphpad: Inference- time 3d scene graph updates for embodied question answering,

M. Q. Ali, S. Nair, A. Wong, Y . Cui, and Y . Chen, “Graphpad: Inference- time 3d scene graph updates for embodied question answering,”arXiv preprint arXiv:2506.01174, 2025

work page arXiv 2025

[29] [29]

ProMQA: Question answering dataset for multimodal procedural activity understanding,

K. Hasegawa, W. Imrattanatrai, Z.-Q. Cheng, M. Asada, S. Holm, Y . Wang, K. Fukuda, and T. Mitamura, “ProMQA: Question answering dataset for multimodal procedural activity understanding,” inConfer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Lingui...

work page 2025

[30] [30]

GPR: Grounded procedural reasoning for embodied AI,

A. Kamath, A. Kembhavi, and L. Paull, “GPR: Grounded procedural reasoning for embodied AI,” inIEEE International Conference on Computer Vision, pp. 22096–22106, 2023

work page 2023

[31] [31]

Noisyeqa: Benchmarking embodied question answering against noisy queries,

T. Wu, C. Zhou, Y . H. Wong, L. Gu, and J. Yang, “Noisyeqa: Benchmarking embodied question answering against noisy queries,” arXiv preprint arXiv:2412.10726, 2024

work page arXiv 2024

[32] [32]

Robust monocular depth estimation under challenging conditions,

S. Gasperini, N. Morbitzer, H. Jung, N. Navab, and F. Tombari, “Robust monocular depth estimation under challenging conditions,” inIEEE International Conference on Computer Vision, pp. 8177–8186, 2023

work page 2023

[33] [33]

Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,

K. Wang, Z. Zhang, Z. Yan, X. Li, B. Xu, J. Li, and J. Yang, “Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,” inIEEE International Conference on Computer Vision, pp. 16055–16064, 2021

work page 2021

[34] [34]

Human pose estimation in extremely low-light conditions,

S. Lee, J. Rim, B. Jeong, G. Kim, B. Woo, H. Lee, S. Cho, and S. Kwak, “Human pose estimation in extremely low-light conditions,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 704–714, 2023

work page 2023

[35] [35]

Yolo in the dark-domain adaptation method for merging multiple models,

Y . Sasagawa and H. Nagahara, “Yolo in the dark-domain adaptation method for merging multiple models,” inEuropean Conference on Computer Vision, pp. 345–359, Springer, 2020

work page 2020

[36] [36]

Defeat-net: General monocular depth via simultaneous unsupervised representation learning,

J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monocular depth via simultaneous unsupervised representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14402–14413, 2020

work page 2020

[37] [37]

Learning to see in the dark,

C. Chen, Q. Chen, J. Xu, and V . Koltun, “Learning to see in the dark,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300, 2018

work page 2018

[38] [38]

Llnet: A deep autoencoder approach to natural low-light image enhancement,

K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoder approach to natural low-light image enhancement,”Pattern Recognition, vol. 61, pp. 650–662, 2017

work page 2017

[39] [39]

Mbllen: Low-light image/video enhancement using cnns.,

F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns.,” inBritish Machine Vision Conference, p. 4, Northumbria University, 2018

work page 2018

[40] [40]

Clahe-based low-light image enhancement for robust object detection in overhead power transmission system,

Z. Yuan, J. Zeng, Z. Wei, L. Jin, S. Zhao, X. Liu, Y . Zhang, and G. Zhou, “Clahe-based low-light image enhancement for robust object detection in overhead power transmission system,”IEEE Transactions on Power Delivery, vol. 38, no. 3, pp. 2240–2243, 2023

work page 2023

[41] [41]

A low-light image enhance- ment method for both denoising and contrast enlarging,

L. Li, R. Wang, W. Wang, and W. Gao, “A low-light image enhance- ment method for both denoising and contrast enlarging,” inIEEE International Conference on Image Processing, pp. 3730–3734, 2015

work page 2015

[42] [42]

Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,

S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems Datasets and Benchmarks, 2021

work page 2021

[43] [43]

Benchmarking denoising algorithms with real photographs,

T. Plotz and S. Roth, “Benchmarking denoising algorithms with real photographs,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595, 2017

work page 2017

[44] [44]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan,et al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024

work page 2024