DarkQA: Benchmarking Vision-Language Models on Visual-Primitive Question Answering in Low-Light Indoor Scenes
Pith reviewed 2026-05-16 18:39 UTC · model grok-4.3
The pith
Vision-language models degrade consistently when answering basic visual questions in low-light indoor scenes, and low-light enhancement offers only partial unstable recovery.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DarkQA demonstrates that current vision-language models exhibit reliable accuracy drops on elementary visual-primitive questions once illumination falls and sensor noise increases, while common low-light image enhancement pipelines deliver only severity-dependent and unstable recovery.
What carries the argument
DarkQA benchmark consisting of deterministically generated question-image pairs with physics-based low-light synthesis performed in linear RAW space followed by ISP rendering.
If this is right
- VLMs will continue to produce unreliable answers on simple perceptual tasks in dark indoor environments without targeted robustness improvements.
- LLIE preprocessing cannot be treated as a dependable plug-in solution for VLM inputs across varying light levels.
- Benchmarks that isolate low-level visual primitives can expose failure modes before they compound in full embodied tasks.
- Training or architectural changes that address illumination drop and noise directly will be necessary for reliable 24/7 VLM operation.
Where Pith is reading between the lines
- Embodied agents that rely on VLMs for navigation or manipulation will encounter compounded errors when low-light perceptual failures interact with planning modules.
- The same synthesis approach could be extended to other degradations such as motion blur or haze to create a broader suite of controlled perception tests.
- Developers should incorporate low-light synthetic data into VLM fine-tuning loops to close the observed gap before real-world deployment.
Load-bearing premise
The synthetic low-light images generated in RAW space and rendered through an ISP pipeline match the statistics of real camera captures closely enough to predict actual VLM failures.
What would settle it
Collect real paired low-light and normal-light images of the same indoor scenes, ask the same questions to the same VLMs, and check whether the performance gap matches the synthetic results within a small margin.
Figures
read the original abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments, a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkQA, an open-source benchmark for evaluating perceptual primitives under multi-level low-light conditions in embodied scenarios. DarkQA evaluates single-view egocentric observations across controlled degradation levels, isolating low-light perceptual failures before they are entangled with complex embodied tasks. The benchmark contains 9.4K deterministically generated and verifiable question-image pairs spanning five visual-primitive families. A key design feature of DarkQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline; we further validate the synthesis against real paired low-light camera data. We evaluate representative VLMs and Low-Light Image Enhancement (LLIE) preprocessing methods. Results show consistent VLM degradation under low illumination and sensor noise, while LLIE provides severity-dependent but unstable recovery. We demonstrate the utility of DarkQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models, and systematically reveal VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance. Project website: https://darkqa-benchmark.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DarkQA, a benchmark of 9.4K deterministically generated question-image pairs for evaluating VLMs on five families of visual primitives in low-light indoor egocentric scenes. Degradations are synthesized in linear RAW space with physics-based illumination drop and sensor noise, rendered via an ISP pipeline, and validated against real paired low-light camera captures. Experiments on representative VLMs and LLIE methods report consistent performance degradation under low illumination and noise, with LLIE recovery being severity-dependent but unstable.
Significance. If the central empirical claims hold, DarkQA fills a clear gap by providing a controlled, physically motivated instrument for isolating low-light perceptual failures in VLMs before they compound in embodied tasks. The open release of the dataset and code supports reproducibility and future work on robust 24/7 vision-language reasoning.
major comments (2)
- [§4] §4 (synthesis validation): The manuscript validates the RAW-space synthesis against real paired camera data using image-level metrics, yet provides no quantitative comparison of VLM accuracy drops, confusion matrices, or primitive-specific error patterns between the synthetic and real sets. Image similarity alone does not establish that the benchmark triggers the same VLM failure modes claimed to be measured.
- [Abstract and §5] Abstract and §5 (results): Reported VLM accuracy degradation trends lack error bars, standard deviations across runs, or statistical significance tests, making it difficult to judge the reliability and consistency of the observed drops across models and degradation levels.
minor comments (2)
- [Abstract] The abstract states that 'a wide range of state-of-the-art VLMs and LLIE models' were evaluated but does not specify the exact count or selection criteria; adding this detail would improve clarity.
- [§5] Figure captions and axis labels in the results section could more explicitly indicate whether plotted values are means over multiple seeds or single-run results.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below with clarifications and planned updates to the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (synthesis validation): The manuscript validates the RAW-space synthesis against real paired camera data using image-level metrics, yet provides no quantitative comparison of VLM accuracy drops, confusion matrices, or primitive-specific error patterns between the synthetic and real sets. Image similarity alone does not establish that the benchmark triggers the same VLM failure modes claimed to be measured.
Authors: We agree that direct VLM-level comparisons would provide stronger evidence that the synthetic degradations elicit the same failure modes as real captures. Our §4 validation prioritizes image-level metrics (PSNR, SSIM, and noise statistics) because they quantify the physics-based fidelity of illumination drop and sensor noise in linear RAW space before ISP rendering. In the revised manuscript we will add a supplementary analysis comparing VLM accuracy, degradation trends, and primitive-specific error patterns on a subset of the real paired low-light captures (where question annotations can be aligned), or explicitly note data limitations if full coverage is not feasible. revision: partial
-
Referee: [Abstract and §5] Abstract and §5 (results): Reported VLM accuracy degradation trends lack error bars, standard deviations across runs, or statistical significance tests, making it difficult to judge the reliability and consistency of the observed drops across models and degradation levels.
Authors: We thank the referee for highlighting this. The reported numbers reflect deterministic evaluation runs chosen for reproducibility across the 9.4K pairs. In the revision we will re-execute the VLM evaluations with multiple random seeds (where model inference involves any stochasticity), report standard deviations, add error bars to the accuracy plots in §5, and include statistical significance tests (e.g., paired t-tests) between illumination levels. These changes will be reflected in both the main text and figures. revision: yes
Circularity Check
No circularity: benchmark construction and empirical evaluation are self-contained
full rationale
The manuscript presents DarkQA as an external measurement instrument: synthetic RAW-space degradations plus ISP rendering are generated deterministically, then validated against real paired camera captures. No equations, fitted parameters, or predictions are defined in terms of the target VLM scores; the reported degradation patterns and LLIE recovery statistics are direct outputs of running existing VLMs on the held-out benchmark images. The validation step compares image-level statistics to real data rather than re-using the VLM question-answering results to justify the synthesis itself. Consequently the derivation chain contains no self-definitional, fitted-input, or self-citation reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
physics-based low-light synthesis pipeline... EV drop... sensor noise... ISP-inspired rendering
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
five visual-primitive families... Room-Type Recognition, Object Recognition...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Openscene: 3d scene understanding with open vocabularies,
S. Peng, K. Genova, C. Jiang, A. Tagliasacchi, M. Pollefeys, T. Funkhouser,et al., “Openscene: 3d scene understanding with open vocabularies,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824, 2023
work page 2023
-
[2]
Open-vocabulary semantic segmentation with mask-adapted clip,
F. Liang, B. Wu, X. Dai, K. Li, Y . Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7061–7070, 2023
work page 2023
-
[3]
Clip-nav: Using clip for zero-shot vision-and-language navigation,
V . S. Dorbala, G. Sigurdsson, R. Piramuthu, J. Thomason, and G. S. Sukhatme, “Clip-nav: Using clip for zero-shot vision-and-language navigation,”arXiv preprint arXiv:2211.16649, 2022
-
[4]
Thinking in space: How multimodal large language models see, remember, and recall spaces,
J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10632–10643, 2025
work page 2025
-
[5]
Rt-2: Vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” inConference on Robot Learning, pp. 2165–2183, PMLR, 2023
work page 2023
-
[6]
Palm-e: An embodied multimodal language model,
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang,et al., “Palm-e: An embodied multimodal language model,” 2023
work page 2023
-
[7]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman,et al., “Do as i can, not as i say: Grounding language in robotic affordances,”arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
A. Das, D. Gordon, C. Divi, D. Batra, G. Gkioxari, and D. Parikh, “Embodied question answering,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 2054–2063, 2018
work page 2054
-
[9]
Openeqa: Embodied question answering in the era of foundation models,
A. Majumdar, A. Ajay, X. Zhang, P. Putta, S. Yenamandra, M. Henaff, S. Silwal, P. Mcvay, O. Maksymets, S. Arnaud,et al., “Openeqa: Embodied question answering in the era of foundation models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16488–16498, 2024
work page 2024
-
[10]
On the illumination influence for object learning on robot companions,
I. Keller and K. S. Lohan, “On the illumination influence for object learning on robot companions,”Frontiers in Robotics and AI, vol. 6, p. 154, 2020
work page 2020
-
[11]
Ai models collapse when trained on recursively generated data,
I. Shumailov, Z. Shumaylov, Y . Zhao, N. Papernot, R. Anderson, and Y . Gal, “Ai models collapse when trained on recursively generated data,”Nature, vol. 631, no. 8022, pp. 755–759, 2024
work page 2024
-
[12]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y . Li, and Y . J. Lee, “Improved baselines with visual instruction tuning,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 26296–26306, 2024
work page 2024
-
[13]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu,et al., “Llava-onevision: Easy visual task transfer,”arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
W. Wang, Z. Gao, L. Gu, H. Pu, L. Cui, X. Wei, Z. Liu, L. Jing, S. Ye, J. Shao,et al., “Internvl3. 5: Advancing open-source multi- modal models in versatility, reasoning, and efficiency,”arXiv preprint arXiv:2508.18265, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv,et al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford,et al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Darkir: Robust low-light image restoration,
D. Feijoo, J. C. Benito, A. Garcia, and M. V . Conde, “Darkir: Robust low-light image restoration,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10879–10889, 2025
work page 2025
-
[18]
Unprocessing images for learned raw denoising,
T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045, 2019
work page 2019
-
[19]
Physics-based noise modeling for extreme low-light photography,
K. Wei, Y . Fu, Y . Zheng, and J. Yang, “Physics-based noise modeling for extreme low-light photography,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 8520–8537, 2021
work page 2021
-
[20]
Scanqa: 3d question answering for spatial scene understanding,
D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19129– 19139, 2022
work page 2022
-
[21]
CoSpace: Benchmarking continuous space perception ability for vision-language models,
X. Zhuet al., “CoSpace: Benchmarking continuous space perception ability for vision-language models,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[22]
R. Danget al., “ECBench: Can multi-modal foundation models understand the egocentric world? A holistic embodied cognition benchmark,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[23]
W. Zhang, Z. Zhou, Z. Zheng, C. Gao, J. Cui, Y . Li, X. Chen, and X.-P. Zhang, “Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space,”arXiv preprint arXiv:2503.11094, 2025
-
[24]
RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,
C. H. Song, V . Blukis, J. Tremblay, S. Tyree, Y . Su, and S. Birchfield, “RoboSpatial: Teaching spatial understanding to 2D and 3D vision- language models for robotics,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. Oral Presentation
work page 2025
-
[25]
Memory-centric embodied question answer,
M. Zhai, Z. Gao, Y . Wu, and Y . Jia, “Memory-centric embodied question answer,”arXiv preprint arXiv:2505.13948, 2025
-
[26]
3D scene graphs: A structure for unified semantics, 3D space, and camera,
I. Armeni, Z.-Y . He, G. Gkioxari, A. R. Zamir, M. Fischer, and S. Savarese, “3D scene graphs: A structure for unified semantics, 3D space, and camera,” inIEEE International Conference on Computer Vision, pp. 5664–5673, 2019
work page 2019
-
[27]
Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,
S. Saxena, B. Buchanan, C. Paxton, P. Liu, B. Chen, N. Vaskevicius, L. Palmieri, J. Francis, and O. Kroemer, “Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering,”arXiv preprint arXiv:2412.14480, 2024
-
[28]
Graphpad: Inference- time 3d scene graph updates for embodied question answering,
M. Q. Ali, S. Nair, A. Wong, Y . Cui, and Y . Chen, “Graphpad: Inference- time 3d scene graph updates for embodied question answering,”arXiv preprint arXiv:2506.01174, 2025
-
[29]
ProMQA: Question answering dataset for multimodal procedural activity understanding,
K. Hasegawa, W. Imrattanatrai, Z.-Q. Cheng, M. Asada, S. Holm, Y . Wang, K. Fukuda, and T. Mitamura, “ProMQA: Question answering dataset for multimodal procedural activity understanding,” inConfer- ence of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Lingui...
work page 2025
-
[30]
GPR: Grounded procedural reasoning for embodied AI,
A. Kamath, A. Kembhavi, and L. Paull, “GPR: Grounded procedural reasoning for embodied AI,” inIEEE International Conference on Computer Vision, pp. 22096–22106, 2023
work page 2023
-
[31]
Noisyeqa: Benchmarking embodied question answering against noisy queries,
T. Wu, C. Zhou, Y . H. Wong, L. Gu, and J. Yang, “Noisyeqa: Benchmarking embodied question answering against noisy queries,” arXiv preprint arXiv:2412.10726, 2024
-
[32]
Robust monocular depth estimation under challenging conditions,
S. Gasperini, N. Morbitzer, H. Jung, N. Navab, and F. Tombari, “Robust monocular depth estimation under challenging conditions,” inIEEE International Conference on Computer Vision, pp. 8177–8186, 2023
work page 2023
-
[33]
Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,
K. Wang, Z. Zhang, Z. Yan, X. Li, B. Xu, J. Li, and J. Yang, “Regularizing nighttime weirdness: Efficient self-supervised monocular depth estimation in the dark,” inIEEE International Conference on Computer Vision, pp. 16055–16064, 2021
work page 2021
-
[34]
Human pose estimation in extremely low-light conditions,
S. Lee, J. Rim, B. Jeong, G. Kim, B. Woo, H. Lee, S. Cho, and S. Kwak, “Human pose estimation in extremely low-light conditions,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 704–714, 2023
work page 2023
-
[35]
Yolo in the dark-domain adaptation method for merging multiple models,
Y . Sasagawa and H. Nagahara, “Yolo in the dark-domain adaptation method for merging multiple models,” inEuropean Conference on Computer Vision, pp. 345–359, Springer, 2020
work page 2020
-
[36]
Defeat-net: General monocular depth via simultaneous unsupervised representation learning,
J. Spencer, R. Bowden, and S. Hadfield, “Defeat-net: General monocular depth via simultaneous unsupervised representation learning,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14402–14413, 2020
work page 2020
-
[37]
C. Chen, Q. Chen, J. Xu, and V . Koltun, “Learning to see in the dark,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300, 2018
work page 2018
-
[38]
Llnet: A deep autoencoder approach to natural low-light image enhancement,
K. G. Lore, A. Akintayo, and S. Sarkar, “Llnet: A deep autoencoder approach to natural low-light image enhancement,”Pattern Recognition, vol. 61, pp. 650–662, 2017
work page 2017
-
[39]
Mbllen: Low-light image/video enhancement using cnns.,
F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns.,” inBritish Machine Vision Conference, p. 4, Northumbria University, 2018
work page 2018
-
[40]
Z. Yuan, J. Zeng, Z. Wei, L. Jin, S. Zhao, X. Liu, Y . Zhang, and G. Zhou, “Clahe-based low-light image enhancement for robust object detection in overhead power transmission system,”IEEE Transactions on Power Delivery, vol. 38, no. 3, pp. 2240–2243, 2023
work page 2023
-
[41]
A low-light image enhance- ment method for both denoising and contrast enlarging,
L. Li, R. Wang, W. Wang, and W. Gao, “A low-light image enhance- ment method for both denoising and contrast enlarging,” inIEEE International Conference on Image Processing, pp. 3730–3734, 2015
work page 2015
-
[42]
Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,
S. K. Ramakrishnan, A. Gokaslan, E. Wijmans, O. Maksymets, A. Clegg, J. M. Turner, E. Undersander, W. Galuba, A. Westbury, A. X. Chang, M. Savva, Y . Zhao, and D. Batra, “Habitat-matterport 3d dataset (HM3d): 1000 large-scale 3d environments for embodied AI,” inAdvances in Neural Information Processing Systems Datasets and Benchmarks, 2021
work page 2021
-
[43]
Benchmarking denoising algorithms with real photographs,
T. Plotz and S. Roth, “Benchmarking denoising algorithms with real photographs,” inIEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595, 2017
work page 2017
-
[44]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan,et al., “The llama 3 herd of models,”arXiv e-prints, pp. arXiv–2407, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.