pith. sign in

arxiv: 2507.18064 · v2 · submitted 2025-07-24 · 💻 cs.CV

Adapting Large VLMs with Iterative and Manual Instructions for Generative Low-light Enhancement

Pith reviewed 2026-05-19 02:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords low-light image enhancementvision-language modelsdiffusion modelsgenerative enhancementinstruction guidancesemantic priorsiterative inference
0
0 comments X

The pith

Adapting large vision-language models with iterative and manual instructions guides diffusion models to produce more realistic low-light image enhancements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VLM-IMI, a framework that adapts large vision-language models to generate textual descriptions of how a low-light scene should appear under normal lighting. These descriptions serve as semantic cues that direct a diffusion-based enhancement process, addressing the lack of normal-light guidance in prior methods. The approach includes a fusion module to combine image and text features, an iterative refinement strategy for inference without ground-truth references, and support for direct user-provided instructions. This matters because many low-light scenes involve complex lighting where purely visual or prior-based fixes produce unnatural results, and semantic text guidance could yield outputs that better match human expectations of scene content.

Core claim

VLM-IMI consists of a Normal-Light Instruction Prior Generation branch that uses a VLM to produce textual cues describing desired normal-light content, and an Instruction-aware Light Enhancement Diffusion branch that incorporates those cues via a learnable fusion module to steer the generative process. At inference time the system applies iterative instruction refinement to progressively improve quality, while also permitting manual user instructions fed directly into the language model for customized outputs, and experiments show these elements together yield higher perceptual quality and realism than existing state-of-the-art low-light enhancement techniques.

What carries the argument

the learnable instruction prior fusion module that dynamically aligns and merges image features with VLM-generated text features to condition the diffusion enhancement process

If this is right

  • The framework enables user-controlled enhancement by accepting custom textual instructions that override or supplement the generated priors.
  • Iterative instruction refinement during inference can progressively correct illumination and detail without requiring paired normal-light training data at test time.
  • Semantic text guidance from VLMs allows the diffusion model to handle complex lighting where brightness adjustments alone produce artifacts or loss of scene identity.
  • Cross-modal fusion of image and text priors leads to outputs with greater detail coherence and realism than methods relying solely on visual priors or low-light inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same instruction-generation and fusion pattern could be tested on related restoration tasks such as dehazing or shadow removal to check whether semantic cues transfer across degradation types.
  • If the iterative refinement loop converges reliably, it suggests a path toward reference-free enhancement pipelines that improve with each additional instruction pass.
  • Manual instruction support opens the possibility of interactive editing workflows where a user describes desired lighting or style changes in natural language.

Load-bearing premise

Textual descriptions produced by the vision-language model correctly capture the desired normal-light appearance of the scene and can be used as reliable enhancement cues even without any ground-truth normal-light image available.

What would settle it

A controlled test on scenes with unusual or ambiguous objects where the generated text instructions either produce visible semantic mismatches or fail to improve perceptual quality over non-instruction baselines.

Figures

Figures reproduced from arXiv: 2507.18064 by Cong Wang, Jinshan Pan, Kin-man Lam, Liyan Wang, Xiaoran Sun, Yang Yang, Yeying Jin, Zhixun Su.

Figure 1
Figure 1. Figure 1: Model comparisons. (a) Summary of the paradigm of previous methods, which usu￾ally embed model priors into diffusion models. However, these approaches cannot flexibly handle low-light images with varying illumination, reflection, and spatial or contextual information. (b) In contrast, we propose a large VLM that dynamically generates visually-pleasing results using differ￾ent text instructions to solve the… view at source ↗
Figure 2
Figure 2. Figure 2: Visual comparisons. Previous methods (b)-(d) exhibit color distortion, incorrect expo￾sure, or artifacts that degrade visual quality. In contrast, our approach can generate diverse results (e)-(g) based on different instructions, which can meet the requirements for various low-light sce￾narios. This design enables our method to effectively handle various low-light scenarios. improve visual quality by intro… view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework of VLM-IMI. (a) Training Pipeline of VLM-IMI: a large vision￾language model (VLM) extracts textual description instructions from normal-light images, captur￾ing lighting conditions, scene semantics, and contextual cues. These instructions are then encoded using a large language model (LLM) text encoder. (b) Instruction Prior Fusion Module: this module facilitates cross-modal interaction b… view at source ↗
Figure 4
Figure 4. Figure 4: Different instructions result in different outputs for enhancing the generation of visu￾ally pleasing results. Text instructions can control lighting enhancement and produce results under different lighting conditions. The corresponding Grad-CAM heatmaps [46] highlight the model’s attention areas influenced by the text instructions, such as faces or background regions, showing how instructions affect visua… view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison on LOL [54] and LSRW [14]. Our VLM-IMI is able to produce more realistic results with sharper structures and textures. 4.2 Comparisons with State-of-the-Arts We compare our method with current representative low-light image enhancement approaches. It is important to note that our reported results represent the average of five independent evaluations. More results are shown in the Appendix… view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison on real datasets, including DICM [22], NPE [51], and VV [48]. Our VLM-IMI is able to generate results with better naturalness. 6.76 5.92 3.96 3.91 3.38 3.93 3.02 2.72 NIQE NIQE (a) Input (b) Tab. 2(a) (c) Tab. 2(b) (d) Tab. 2(c) (e) Ours [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparison on text instructions on LSRW [14] and VV [48]. We note that without using Lighting Information (b), Shadows and Reflections (c), and Spatial and Contextual Informa￾tion (d), the model cannot produce visually-pleasing images. In contrast, our full model is able to generate results with better naturalness according to visual quality as well as no-reference metrics, NIQE. Visual Comparisons … view at source ↗
Figure 8
Figure 8. Figure 8: Effect of iterative instruction strategy on three real-world datasets. We note that the results improve with the second iteration. NIQE 5.28 4.18 4.37 (a) Input (b) k = 1 (c) k = 2 (Ours) (d) k = 3 [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison of iterative instruction strategy on DICM [22]. We note that the results improve with the second iteration, while the first iteration leads to under-exposure results, and more iterations produce over-exposure results. Effect of Instruction Prior Fusion Module. To validate the effectiveness of the Transformer-based instruction prior fusion module, we conduct experiments under four settings… view at source ↗
Figure 10
Figure 10. Figure 10: Text instructions can control lighting enhancement, producing results under different [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparison of text instructions on paired datasets, LOL [54] (upper) and LSRW [14] (lower). B.3 More Results on the Iterative Instruction strategy [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual comparison of text instructions on real-world datasets, DICM [22] (upper), NPE [51] (middle), and VV [48] (lower). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual comparison of the iterative instruction strategy on real-world datasets about DICM [22] (upper), NPE [51] (middle), and VV [48] (lower). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparison on the LOL [54] dataset. Our VLM-IMI produces more realistic results with sharper structures and textures. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual comparison on the LSRW [14] dataset. Our VLM-IMI generates more realistic results with sharper structures and textures. (a) Input (b) PairLIE (c) NeRCo (d) CLIP-LIT (e) DiffLL (f) GASD (g) QuadPrior (h) Ours 3.95 4.97 4.07 3.88 3.90 4.18 3.76 4.15 5.04 3.98 4.01 4.73 4.47 3.88 NIQE NIQE NIQE 4.37 3.66 3.70 3.38 4.28 3.66 2.82 NIQE 3.85 5.61 4.39 3.92 4.02 4.62 3.69 [PITH_FULL_IMAGE:figures/full_fi… view at source ↗
Figure 16
Figure 16. Figure 16: Visual comparison of real-world images on the DICM [22] dataset. Our VLM-IMI generates results with better naturalness. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visual comparison of real-world images on the NPE [51] dataset. Our VLM-IMI generates results with better naturalness. (a) Input (b) PairLIE (c) NeRCo (d) CLIP-LIT (e) DiffLL (f) GASD (g) QuadPrior (h) Ours 2.57 5.84 3.19 2.78 2.31 3.07 2.03 4.37 6.31 4.72 3.86 3.65 3.93 3.35 NIQE NIQE NIQE 2.39 5.12 2.91 3.32 2.66 4.13 2.25 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual comparison of real-world images on the VV [48] dataset. Our VLM-IMI gen￾erates results with better naturalness. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
read the original abstract

Most existing low-light image enhancement (LLIE) methods rely on pre-trained model priors, low-light inputs, or both, while neglecting the semantic guidance available from normal-light images. This limitation hinders their effectiveness in complex lighting conditions. In this paper, we propose VLM-IMI, a framework that adapts large vision-language models with iterative and manual instructions for generative LLIE. VLM-IMI mainly contains two branches: Normal-Light Instruction Prior Generation (NL-IPG) and Instruction-aware Light Enhancement Diffusion (IA-LED). The NL-IPG incorporates textual descriptions of the desired normal-light content as enhancement cues, enabling semantically informed restoration. IA-LED incorporates instruction priors from the NL-IPG to guide the diffusion process, enabling precise illumination enhancement. To effectively integrate cross-modal priors, we introduce a learnable instruction prior fusion module, which dynamically aligns and fuses image and text features, promoting the generation of detailed and semantically coherent outputs. During inference, as the ground-truth normal-light images are not available, we propose an inference with an iterative instructions strategy to refine textual instructions, progressively improving visual quality. Our VLM-IMI also inherently supports manual instruction control by allowing users to directly input custom instructions into the LLM to generate user-expected outputs. Experiments across diverse scenarios demonstrate that VLM-IMI outperforms SOTA methods in terms of perception and realism. The source code is available at: https://github.com/sunxiaoran01/VLM-IMI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes VLM-IMI, a framework adapting large vision-language models for generative low-light image enhancement via iterative and manual instructions. It introduces two branches—Normal-Light Instruction Prior Generation (NL-IPG) to produce textual descriptions of desired normal-light content as semantic cues, and Instruction-aware Light Enhancement Diffusion (IA-LED) to guide a diffusion process using these priors through a learnable instruction prior fusion module. At inference, an iterative instructions strategy refines the texts in the absence of ground-truth normal-light images, while also supporting direct manual user instructions to the LLM. Experiments are claimed to show outperformance over SOTA methods on perception and realism metrics across diverse scenarios, with code released.

Significance. If the claims hold, the work would represent a meaningful step in low-light enhancement by injecting VLM-derived semantic guidance into the restoration process, potentially improving coherence and realism beyond purely image-based or reference-free priors. The public code release supports reproducibility and further exploration. The significance is limited by the unverified reliability of the generated textual priors, which form the core of the proposed advantage.

major comments (2)
  1. [NL-IPG branch and inference strategy (as described in abstract)] The headline claim of SOTA perceptual and realism superiority rests on the NL-IPG branch producing textual descriptions that serve as reliable enhancement cues at inference (when ground-truth normal-light images are unavailable). No evaluation of text quality—such as semantic similarity metrics to held-out normal-light captions, human ratings of description accuracy, or ablation removing the iterative refinement loop—is reported, leaving open the possibility that hallucinations or omissions in VLM outputs undermine the downstream IA-LED gains.
  2. [Instruction prior fusion module] The learnable instruction prior fusion module is presented as dynamically aligning image and text features, yet no details on its architecture, training objective, or ablation isolating its contribution versus simpler concatenation or attention mechanisms are visible, making it difficult to assess whether the cross-modal integration is the load-bearing factor for the reported improvements.
minor comments (2)
  1. [Experiments] The abstract references 'experiments across diverse scenarios' but provides no specifics on datasets, number of test images, statistical significance testing, or full set of baselines and metrics; these details are essential for evaluating robustness.
  2. [Inference with iterative instructions] Clarify the exact iterative loop procedure (number of iterations, stopping criterion, and how manual instructions interact with the automatic refinement) to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify key aspects of our work. We provide point-by-point responses below and indicate planned revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [NL-IPG branch and inference strategy (as described in abstract)] The headline claim of SOTA perceptual and realism superiority rests on the NL-IPG branch producing textual descriptions that serve as reliable enhancement cues at inference (when ground-truth normal-light images are unavailable). No evaluation of text quality—such as semantic similarity metrics to held-out normal-light captions, human ratings of description accuracy, or ablation removing the iterative refinement loop—is reported, leaving open the possibility that hallucinations or omissions in VLM outputs undermine the downstream IA-LED gains.

    Authors: We agree that direct evaluation of the textual priors would provide stronger evidence for their reliability as enhancement cues, particularly at inference without ground-truth normal-light images. While the reported end-to-end gains in perceptual and realism metrics offer indirect support for the NL-IPG branch and iterative strategy, we acknowledge that explicit validation is missing. In the revised version we will add semantic similarity metrics (e.g., CLIP-based cosine similarity) against held-out normal-light captions, a small-scale human rating study of description accuracy, and an ablation that disables the iterative refinement loop to quantify its impact. revision: yes

  2. Referee: [Instruction prior fusion module] The learnable instruction prior fusion module is presented as dynamically aligning image and text features, yet no details on its architecture, training objective, or ablation isolating its contribution versus simpler concatenation or attention mechanisms are visible, making it difficult to assess whether the cross-modal integration is the load-bearing factor for the reported improvements.

    Authors: We accept that the current manuscript provides insufficient architectural and experimental detail on the instruction prior fusion module. We will expand the relevant section to describe the module’s exact architecture, the training objective used to optimize the cross-modal alignment, and new ablation studies that compare the proposed fusion against simpler baselines such as feature concatenation and standard attention. These additions will allow readers to evaluate whether the learnable fusion is the primary driver of the observed gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes VLM-IMI as a new framework with two branches (NL-IPG for textual prior generation from VLMs and IA-LED for instruction-aware diffusion enhancement) plus a learnable fusion module whose parameters are trained on data. Claims of SOTA perceptual performance rest on experimental comparisons across scenarios rather than any self-referential equations, fitted inputs renamed as predictions, or load-bearing self-citations. The iterative inference strategy for refining instructions when ground-truth is unavailable is an operational procedure, not a closed loop that reduces the output to the input by construction. No uniqueness theorems, ansatzes smuggled via citation, or renamings of known results appear in the provided sections. This is a standard architectural proposal with independent empirical support.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on pre-trained VLM and diffusion model capabilities plus the assumption that cross-modal alignment can be learned effectively; it introduces one new module and two new inference strategies but does not postulate new physical entities.

free parameters (1)
  • learnable weights in instruction prior fusion module
    These parameters are trained to align image and text features and are central to the claimed performance gains.
axioms (1)
  • domain assumption Pre-trained large VLMs can produce textual descriptions that serve as reliable semantic priors for normal-light image content
    Invoked in the NL-IPG branch to generate enhancement cues without ground-truth normal-light images.
invented entities (1)
  • Instruction prior fusion module no independent evidence
    purpose: Dynamically align and fuse image features with textual instruction priors
    New component introduced to integrate cross-modal information inside the diffusion process.

pith-pipeline@v0.9.0 · 5816 in / 1430 out tokens · 60652 ms · 2026-05-19T02:59:49.094867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · 5 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning.NeurIPS(2022)

  2. [2]

    Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. 2022. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers.arXiv preprint arXiv:2211.01324(2022)

  3. [3]

    2010.Handbook of image and video processing

    Alan C Bovik. 2010.Handbook of image and video processing. Academic press

  4. [4]

    Rongtai Cai and Zekun Chen. 2023. Brain-like retinex: A biologically plausible retinex algorithm for low light image enhancement.PR(2023)

  5. [5]

    Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, and Yulun Zhang. 2023. Retinexformer: One-stage retinex-based transformer for low-light image enhancement. InICCV

  6. [6]

    Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang

  7. [7]

    Black-box prompt optimization: Aligning large language models without model training.arXiv preprint arXiv:2311.04155(2023)

  8. [8]

    Duc-Tien Dang-Nguyen, Cecilia Pasquini, Valentina Conotter, and Giulia Boato. 2015. Raise: A raw images dataset for digital image forensics. InACM MMSys

  9. [9]

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. 2019. Arcface: Additive angular margin loss for deep face recognition. InCVPR

  10. [10]

    Ben Fei, Zhaoyang Lyu, Liang Pan, Junzhe Zhang, Weidong Yang, Tianyue Luo, Bo Zhang, and Bo Dai

  11. [11]

    Generative diffusion prior for unified image restoration and enhancement. InCVPR

  12. [12]

    Zhenqi Fu, Yan Yang, Xiaotong Tu, Yue Huang, Xinghao Ding, and Kai-Kuang Ma. 2023. Learning a simple low-light image enhancer from paired low-light instances. InCVPR

  13. [13]

    Chunle Guo, Chongyi Li, Jichang Guo, Chen Change Loy, Junhui Hou, Sam Kwong, and Runmin Cong

  14. [14]

    Zero-reference deep curve estimation for low-light image enhancement. InCVPR

  15. [15]

    Jinpei Guo, Zheng Chen, Wenbo Li, Yong Guo, and Yulun Zhang. 2025. Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal. arXiv:2502.09873

  16. [16]

    Xiaojie Guo, Yu Li, and Haibin Ling. 2016. LIME: Low-light image enhancement via illumination map estimation.IEEE TIP(2016)

  17. [17]

    Jiang Hai, Zhu Xuan, Ren Yang, Yutong Hao, Fengzhu Zou, Fang Lin, and Songchen Han. 2023. R2rnet: Low-light image enhancement via real-low to real-normal network.JVCIR(2023)

  18. [18]

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models.NeurIPS (2020)

  19. [19]

    Jinhui Hou, Zhiyu Zhu, Junhui Hou, Hui Liu, Huanqiang Zeng, and Hui Yuan. 2023. Global structure- aware diffusion process for low-light image enhancement.NeurIPS(2023)

  20. [20]

    Hai Jiang, Ao Luo, Haoqiang Fan, Songchen Han, and Shuaicheng Liu. 2023. Low-light image enhance- ment with wavelet-based diffusion models.ACM TOG(2023)

  21. [21]

    Yifan Jiang, Xinyu Gong, Ding Liu, Yu Cheng, Chen Fang, Xiaohui Shen, Jianchao Yang, Pan Zhou, and Zhangyang Wang. 2021. Enlightengan: Deep light enhancement without paired supervision.IEEE TIP (2021)

  22. [22]

    Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. 2021. Musiq: Multi-scale image quality transformer. InICCV

  23. [23]

    Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980(2014)

  24. [24]

    Edwin H Land. 1977. The retinex theory of color vision.SciAm(1977)

  25. [25]

    Chulwoo Lee, Chul Lee, and Chang-Su Kim. 2013. Contrast enhancement based on layered difference representation of 2D histograms.IEEE TIP(2013). 10

  26. [26]

    JDMCK Lee and K Toutanova. 2018. Pre-training of deep bidirectional transformers for language under- standing.arXiv preprint arXiv:1810.04805(2018)

  27. [27]

    Xiaozhou Lei, Zixiang Fei, Wenju Zhou, Huiyu Zhou, and Minrui Fei. 2022. Low-light image enhance- ment using the cell vibration model.IEEE TMM(2022)

  28. [28]

    Jinlong Li, Baolu Li, Zhengzhong Tu, Xinyu Liu, Qing Guo, Felix Juefei-Xu, Runsheng Xu, and Hongkai Yu. 2024. Light the night: A multi-condition diffusion framework for unpaired low-light enhancement in autonomous driving. InCVPR

  29. [29]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InICML

  30. [30]

    Mading Li, Jiaying Liu, Wenhan Yang, Xiaoyan Sun, and Zongming Guo. 2018. Structure-revealing low-light image enhancement via robust retinex model.IEEE TIP(2018)

  31. [31]

    Zhexin Liang, Chongyi Li, Shangchen Zhou, Ruicheng Feng, and Chen Change Loy. 2023. Iterative prompt learning for unsupervised backlit image enhancement. InICCV

  32. [32]

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2017. Feature pyramid networks for object detection. InCVPR

  33. [33]

    Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, and Yujiu Yang. 2023. Accelerating diffusion models for inverse problems through shortcut sampling.arXiv preprint arXiv:2305.16965(2023)

  34. [34]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual instruction tuning.NeurIPS (2023)

  35. [35]

    Yuhao Liu, Zhanghan Ke, Fang Liu, Nanxuan Zhao, and Rynson WH Lau. 2024. Diff-plugin: Revitalizing details for diffusion-based low-level tasks. InCVPR

  36. [36]

    Kin Gwn Lore, Adedotun Akintayo, and Soumik Sarkar. 2017. LLNet: A deep autoencoder approach to natural low-light image enhancement. (2017)

  37. [37]

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. InCVPR

  38. [38]

    Ziwei Luo, Fredrik K Gustafsson, Zheng Zhao, Jens Sj ¨olund, and Thomas B Sch ¨on. 2023. Controlling vision-language models for universal image restoration.arXiv preprint arXiv:2310.01018(2023)

  39. [39]

    Long Ma, Dian Jin, Nan An, Jinyuan Liu, Xin Fan, Zhongxuan Luo, and Risheng Liu. 2023. Bilevel fast scene adaptation for low-light image enhancement.IJCV(2023)

  40. [40]

    Long Ma, Tengyu Ma, Risheng Liu, Xin Fan, and Zhongxuan Luo. 2022. Toward fast, flexible, and robust low-light image enhancement. InCVPR

  41. [41]

    completely blind

    Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. 2012. Making a “completely blind” image quality analyzer.IEEE SPL(2012)

  42. [42]

    Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. InICML

  43. [43]

    Ozan ¨Ozdenizci and Robert Legenstein. 2023. Restoring vision in adverse weather conditions with patch- based denoising diffusion models.IEEE TPAMI(2023)

  44. [44]

    Yunpeng Qu, Kun Yuan, Kai Zhao, Qizhi Xie, Jinhua Hao, Ming Sun, and Chao Zhou. 2024. Xpsr: Cross-modal priors for diffusion-based image super-resolution. InECCV

  45. [45]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InICML

  46. [46]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer.JMLR(2020)

  47. [47]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. 2022. High- resolution image synthesis with latent diffusion models. InCVPR. 11

  48. [48]

    Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InCVPR

  49. [49]

    Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra

    Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localiza- tion. InICCV

  50. [50]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  51. [51]

    Vassilios V onikakis, Rigas Kouskouridas, and Antonios Gasteratos. 2018. On the evaluation of illumina- tion compensation algorithms.MTAP(2018)

  52. [52]

    Haoyuan Wang, Ke Xu, and Rynson WH Lau. 2022. Local color distributions prior for image enhance- ment. InECCV

  53. [53]

    Ruixing Wang, Qing Zhang, Chi-Wing Fu, Xiaoyong Shen, Wei-Shi Zheng, and Jiaya Jia. 2019. Under- exposed photo enhancement using deep illumination estimation. InCVPR

  54. [54]

    Shuhang Wang, Jin Zheng, Hai-Miao Hu, and Bo Li. 2013. Naturalness preserved enhancement algorithm for non-uniform illumination images.IEEE TIP(2013)

  55. [55]

    Wenjing Wang, Huan Yang, Jianlong Fu, and Jiaying Liu. 2024. Zero-reference low-light enhancement via physical quadruple priors. InCVPR

  56. [56]

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity.IEEE TIP(2004)

  57. [57]

    Chen Wei, Wenjing Wang, Wenhan Yang, and Jiaying Liu. 2018. Deep retinex decomposition for low- light enhancement.arXiv preprint arXiv:1808.04560(2018)

  58. [58]

    Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. 2022. Deblurring via stochastic refinement. InCVPR

  59. [59]

    Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. 2024. Seesr: Towards semantics-aware real-world image super-resolution. InCVPR

  60. [60]

    Wenhui Wu, Jian Weng, Pingping Zhang, Xu Wang, Wenhan Yang, and Jianmin Jiang. 2022. Uretinex- net: Retinex-based deep unfolding network for low-light image enhancement. InCVPR

  61. [61]

    Xin Xu, Shiqin Wang, Zheng Wang, Xiaolong Zhang, and Ruimin Hu. 2021. Exploring image enhance- ment for salient object detection in low light images.ACM TOMM(2021)

  62. [62]

    Shuzhou Yang, Moxuan Ding, Yanmin Wu, Zihan Li, and Jian Zhang. 2023. Implicit neural representation for cooperative low-light image enhancement. InICCV

  63. [63]

    Shaoliang Yang, Dongming Zhou, Jinde Cao, and Yanbu Guo. 2022. Rethinking low-light enhancement via transformer-GAN.IEEE SPL(2022)

  64. [64]

    Wenhan Yang, Shiqi Wang, Yuming Fang, Yue Wang, and Jiaying Liu. 2020. From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement. InCVPR

  65. [65]

    Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, and Jiayi Ma. 2023. Diff-retinex: Rethinking low-light image enhancement with a generative diffusion model. InICCV

  66. [66]

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. InICCV

  67. [67]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR

  68. [68]

    Yonghua Zhang, Xiaojie Guo, Jiayi Ma, Wei Liu, and Jiawan Zhang. 2021. Beyond brightening low-light images.IJCV(2021). 12 A Additional Controllable Result Figure 10 presents additional controllable enhancement results alongside corresponding Grad-CAM visualizations [46], demonstrating how different text instructions lead to distinct output variations. The...

  69. [69]

    B.3 More Results on the Iterative Instruction strategy Figure 13 provides more visualizations of different iterative instruction strategies applied to three real-world datasets

    (lower). B.3 More Results on the Iterative Instruction strategy Figure 13 provides more visualizations of different iterative instruction strategies applied to three real-world datasets. Whenk= 2, the contrast and brightness of the image in the shadowed areas improve significantly, revealing clearer details. The overall image appears more natural in subje...