arxiv: 2604.19587 · v1 · submitted 2026-04-21 · 💻 cs.CV

Recognition: unknown

SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Ying Zeng , Miaosen Luo , Guangyuan Li , Yang Yang , Ruiyang Fan , Linxiao Shi , Qirui Yang , Jian Zhang

show 5 more authors

Chengcheng Liu Siming Zheng Jinwei Chen Bo Li Peng-Tao Jiang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords automatic photographic image editingimage quality reasoninggenerative image enhancementreinforcement learning for editingphoto-realistic retouchingimage critic modulereasoning-guided generation

0 comments

The pith

SmartPhotoCrafter automatically edits photos by reasoning about quality deficiencies and generating targeted enhancements without user instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that photographic image editing can be automated as a single reasoning-to-generation process rather than depending on vague human directions. An Image Critic first analyzes an input photo for aesthetic shortfalls, then a Photographic Artist module produces the actual changes to improve visual appeal. This matters for non-expert users who struggle to express precise retouching goals, since the system handles both restoration and tonal adjustments while keeping results realistic and consistent with color semantics. The authors train the model in three stages that progressively add reasoning supervision and reinforcement learning to align the two modules. Experiments indicate the resulting edits exceed those of prior generative models in realism and sensitivity to tone-related cues.

Core claim

SmartPhotoCrafter formulates automatic photographic image editing as a tightly coupled reasoning-to-generation process in which the Image Critic module identifies aesthetic deficiencies and the Photographic Artist module performs targeted edits, trained end-to-end through foundation pretraining, reasoning-guided multi-edit supervision, and coordinated reinforcement learning to deliver photo-realistic results on restoration and retouching tasks while adhering to color- and tone-related semantics.

What carries the argument

The unified reasoning-to-generation pipeline that pairs an Image Critic for deficiency identification with a Photographic Artist for edit realization, jointly optimized via multi-stage training that includes reinforcement learning.

If this is right

The method supports both image restoration and retouching while maintaining consistent adherence to color- and tone-related semantics.
It achieves higher tonal sensitivity to retouching needs than existing generative models.
Photo-realistic enhancements become possible without requiring users to supply explicit aesthetic instructions.
A stage-specific dataset progressively builds reasoning capability, controllable generation, and cross-module collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same critic-plus-artist structure with staged reinforcement learning could be adapted to other generative tasks such as video enhancement or style transfer where internal quality assessment is needed.
If the critic's judgments prove stable across cultural or stylistic variations, the model might reduce reliance on subjective user prompts in consumer photo apps.
Mobile-camera integration could allow real-time automatic corrections during capture by running the reasoning step on-device before final image output.

Load-bearing premise

The Image Critic can reliably detect aesthetic deficiencies and the training data plus reinforcement learning produce edits that match broad human aesthetic preferences without any explicit instructions.

What would settle it

A side-by-side human evaluation study on the same input photographs in which participants rate SmartPhotoCrafter outputs against those from instruction-based editing models or professional retouchers for realism, tonal accuracy, and overall appeal.

read the original abstract

Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SmartPhotoCrafter gives a workable three-stage pipeline for automatic photo editing via critic-artist modules and RL coordination, but the claimed gains over prior generative models rest on unshown experiments.

read the letter

The key thing here is a system that removes user instructions entirely by letting an Image Critic flag aesthetic problems in a photo and a Photographic Artist apply targeted fixes. The training runs through foundation pretraining, then reasoning-guided multi-edit supervision, and finally coordinated RL to link the two modules. They also built progressive datasets for each stage, with emphasis on keeping edits photo-realistic and consistent on color and tone.

Referee Report

0 major / 2 minor

Summary. The paper introduces SmartPhotoCrafter, a unified model for automatic photographic image editing formulated as a reasoning-to-generation process. It consists of an Image Critic module that performs image quality comprehension and identifies aesthetic deficiencies, followed by a Photographic Artist module that executes targeted edits for enhancement without requiring explicit human instructions. The approach uses a three-stage training pipeline—foundation pretraining for basic aesthetic understanding, adaptation via reasoning-guided multi-edit supervision, and coordinated reinforcement learning to jointly optimize reasoning and generation—along with stage-specific datasets. Experiments are claimed to show outperformance over existing generative models in photo-realistic enhancement, with strong adherence to color- and tone-related semantics for both restoration and retouching tasks.

Significance. If the empirical results hold after controlling for model scale and data, the work could meaningfully advance automatic, instruction-free image editing by making professional-level photographic adjustments accessible to non-experts. The tight coupling of comprehension and generation through RL coordination, combined with explicit emphasis on photo-realism and tonal sensitivity, offers a coherent pipeline that addresses a practical gap in consumer photography tools. The progressive dataset construction for building cross-module collaboration is a constructive element.

minor comments (2)

[Abstract] The abstract states that experiments demonstrate outperformance and higher tonal sensitivity but provides no quantitative metrics, baseline models, or dataset sizes; adding these details (even at a high level) would strengthen the summary for readers.
The description of the Image Critic's deficiency identification and the RL coordination objective remains high-level; a concrete example of a reasoning trace or loss formulation would clarify how the modules interact without explicit instructions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of SmartPhotoCrafter, including the recognition of its unified reasoning-to-generation pipeline, multi-stage training, and potential impact on automatic photographic editing. We appreciate the minor_revision recommendation and will incorporate any minor clarifications or improvements in the revised manuscript.

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with no derivations

full rationale

The paper describes a procedural multi-stage training pipeline (foundation pretraining, reasoning-guided adaptation, and RL coordination) for an image editing model consisting of an Image Critic and Photographic Artist. All central claims of outperformance and tonal sensitivity are presented as empirical experimental results rather than mathematical derivations or first-principles predictions. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text. The method is self-contained as a descriptive architecture whose validity rests on external benchmarks and datasets, not on internal reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond standard deep-learning assumptions; the claim rests on the unstated premise that aesthetic quality can be learned from the constructed stage-specific dataset.

pith-pipeline@v0.9.0 · 5611 in / 1095 out tokens · 37691 ms · 2026-05-10T02:06:55.929591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 27 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023) 19

2023
[3]

In: The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition (2011)

Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input / output image pairs. In: The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition (2011)

2011
[4]

In: CVPR 2011

Bychkovsky, V., Paris, S., Chan, E., Durand, F.: Learning photographic global tonal adjustment with a database of input/output image pairs. In: CVPR 2011. pp. 97–104. IEEE (2011)

2011
[5]

arXiv preprint arXiv:2506.05384 (2025)

Cai, Z., Zhang, J., Yuan, X., Jiang, P.T., Chen, W., Tang, B., Yao, L., Wang, Q., Chen, J., Li, B.: Q-ponder: A unified training pipeline for reasoning-based visual quality assessment. arXiv preprint arXiv:2506.05384 (2025)

work page arXiv 2025
[6]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

2021
[7]

arXiv preprint arXiv:2511.12998 (2025)

Chang, Z., Duan, Z.P., Zhang, J., Guo, C.L., Liu, S., Chun, H., Park, H., Liu, Z., Li, C.: Pertouch: Vlm-driven agent for personalized and semantic image retouching. arXiv preprint arXiv:2511.12998 (2025)

work page arXiv 2025
[8]

arXiv preprint arXiv:2505.23130(2025)

Chen, H., Tao, K., Wang, Y., Wang, X., Zhu, L., Gu, J.: Photoartagent: Intelligent photo retouching with language model-based artist agents. arXiv preprint arXiv:2505.23130 (2025)

work page arXiv 2025
[9]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

Chen, Y.S., Wang, Y.C., Kao, M.H., Chuang, Y.Y.: Deep photo enhancer: Unpaired learning for image enhancement from photographs with gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

2018
[10]

In: 13th International Conference on Learning Representations, ICLR 2025

Cui, Y., Zamir, S., Khan, S., Knoll, A., Shah, M., Khan, F.: Adair: Adaptive all-in-one image restoration via frequency mining and modulation. In: 13th International Conference on Learning Representations, ICLR 2025. pp. 57335–57356. 13th International Conference on Learning Representations, ICLR 2025, International Conference on Learning Representations,...

2025
[11]

In: Pro- ceedings of the 26th ACM International Conference on Multimedia

Deng, Y., Loy, C.C., Tang, X.: Aesthetic-driven image enhancement by adversarial learning. In: Pro- ceedings of the 26th ACM International Conference on Multimedia. p. 870–878. MM ’18, Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3240508.3240531, https://doi.org/10.1145/3240508.3240531

work page doi:10.1145/3240508.3240531 2018
[12]

In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id= Z9FjSaBuYt

Duan, C., Fang, R., Wang, Y., Wang, K., Huang, L., Zeng, X., Li, H., Liu, X.: Got-r1: Unleashing reasoning capability of autoregressive visual generation with reinforcement learning. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id= Z9FjSaBuYt

2026
[13]

ACM Transactions on Graphics (TOG)44(4), 1–12 (2025)

Dutt, N.S., Ceylan, D., Mitra, N.J.: Monetgpt: Solving puzzles enhances mllms’ image retouching skills. ACM Transactions on Graphics (TOG)44(4), 1–12 (2025)

2025
[14]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3677–3686 (2020)

2020
[15]

In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers

Fortes, A., Wei, T., Zhou, S., Pan, X.: Bokeh diffusion: Defocus blur control in text-to-image diffusion models. In: Proceedings of the SIGGRAPH Asia 2025 Conference Papers. pp. 1–11 (2025)

2025
[16]

ACM Trans

Gharbi, M., Chen, J., Barron, J.T., Hasinoff, S.W., Durand, F.: Deep bilateral learning for real-time image enhancement. ACM Trans. Graph.36(4) (Jul 2017). https://doi.org/10.1145/3072959.3073592, https://doi.org/10.1145/3072959.3073592

work page doi:10.1145/3072959.3073592 2017
[17]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=9RFocgIccP

Gu, X., Li, M., Zhang, L., Chen, F., Wen, L., Luo, T., Zhu, S.: Multi-reward as condition for instruction- based image editing. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=9RFocgIccP

2025
[18]

Available: http://dx.doi.org/10.1038/s41586-025-09422-z

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., Zhang, X., Yu, X., Wu, Y., Wu, Z.F., Gou, Z., Shao, Z., Li, Z., Gao, Z., Liu, A., Xue, B., Wang, B., Wu, B., 20 Feng, B., Lu, C., Zhao, C., Deng, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F., Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Xu,...

work page doi:10.1038/s41586-025-09422-z 2025
[19]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Hessel, J., Holtzman, A., Forbes, M., Bras, R.L., Choi, Y.: Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718 (2021)

work page internal anchor Pith review arXiv 2021
[20]

Advances in neural information processing systems30 (2017)

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30 (2017)

2017
[21]

In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 6840–6851. Curran Associates, Inc. (2020),https://proceedings.neurips.cc/paper_files/paper/ 2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

2020
[22]

Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010)

2010
[23]

IEEE Transactions on Image Processing29, 4041–4056 (2020)

Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing29, 4041–4056 (2020)

2020
[24]

In: Proceedings of the IEEE/CVF international conference on computer vision

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 5148–5157 (2021)

2021
[25]

Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

2024
[26]

Labs, B.F.: FLUX.2: Frontier Visual Intelligence.https://bfl.ai/blog/flux-2(2025)

2025
[27]

In: Proceedings of the IEEE/CVF international conference on computer vision

Li, H., Chen, X., Dong, J., Tang, J., Pan, J.: Foundir: Unleashing million-scale training data to advance foundation models for image restoration. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12626–12636 (2025)

2025
[28]

Li, H., Jiang, L., Yan, Q., Song, Y., Kang, H., Liu, Z., Lu, X., Wu, B., Cai, D.: Thinkrl-edit: Thinking in reinforcement learning for reasoning-centric image editing (2026),https://arxiv.org/abs/2601.03467

work page arXiv 2026
[29]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview.net/forum?id=Bds54EfR9x

Li, W., Zhang, X., Zhao, S., ZHANG, Y., Li, J., zhang, L., Zhang, J.: Q-insight: Understanding image quality via visual reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview.net/forum?id=Bds54EfR9x

2026
[30]

In: IJCAI

Li, Z., Chen, X., Wang, S., Pun, C.M.: A large-scale film style dataset for learning multi-frequency driven film enhancement. In: IJCAI. vol. 2023, pp. 1160–1168 (2023) 21

2023
[31]

Uniworld-V2: Reinforce im- age editing with diffusion negative-aware finetuning and MLLM implicit feedback.arXiv preprint arXiv:2510.16888, 2025

Li, Z., Liu, Z., Zhang, Q., Lin, B., Yuan, S., Yan, Z., Ye, Y., Yu, W., Niu, Y., Yuan, L.: Uniworld- v2: Reinforce image editing with diffusion negative-aware finetuning and mllm implicit feedback. arXiv preprint arXiv:2510.16888 (2025)

work page arXiv 2025
[32]

In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX)

Lin, H., Hosu, V., Saupe, D.: Kadid-10k: A large-scale artificially distorted iqa database. In: 2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX). pp. 1–3. IEEE (2019)

2019
[33]

Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

Lin, Y., Lin, Z., Lin, K., Bai, J., Pan, P., Li, C., Chen, H., Wang, Z., Ding, X., Li, W., et al.: Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612 (2025)

work page arXiv 2025
[34]

Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization.arXiv preprint arXiv:2511.23002, 2025

Lin, Y., Wang, L., Lin, K., Lin, Z., Gong, K., Li, W., Lin, B., Li, Z., Zhang, S., Peng, Y., et al.: Jarvisevo: Towards a self-evolving photo editing agent with synergistic editor-evaluator optimization. arXiv preprint arXiv:2511.23002 (2025)

work page arXiv 2025
[35]

Step1X-Edit: A Practical Framework for General Image Editing

Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., et al.: Step1x-edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

work page internal anchor Pith review arXiv 2025
[36]

IEEE Transactions on image processing21(12), 4695–4708 (2012)

Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing21(12), 4695–4708 (2012)

2012
[37]

completely blind

Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Signal processing letters20(3), 209–212 (2012)

2012
[38]

In: 2012 IEEE conference on computer vision and pattern recognition

Murray, N., Marchesotti, L., Perronnin, F.: Ava: A large-scale database for aesthetic visual analysis. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2408–2415. IEEE (2012)

2012
[39]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

Nah, S., Hyun Kim, T., Mu Lee, K.: Deep multi-scale convolutional neural network for dynamic scene deblurring. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

2017
[40]

In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: SDXL: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=di52zR8xgf

2024
[41]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

Qin, X., Wang, Z., Li, F., Chen, H., Pei, R., Li, W., Cao, X.: Camedit: Continuous camera parameter control for photorealistic image editing. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025)

2025
[42]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

Qu, L., Tian, J., He, S., Tang, Y., Lau, R.W.H.: Deshadownet: A multi-context embedding deep network for shadow removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

2017
[43]

In: Proceedings of 3rd IEEE international conference on image processing

Rahman, Z.u., Jobson, D.J., Woodell, G.A.: Multi-scale retinex for color image enhancement. In: Proceedings of 3rd IEEE international conference on image processing. vol. 3, pp. 1003–1006. IEEE (1996)

1996
[44]

In: European Conference on Computer Vision

Rim, J., Lee, H., Won, J., Cho, S.: Real-world blur dataset for learning and benchmarking deblurring algorithms. In: European Conference on Computer Vision. pp. 184–201. Springer (2020)

2020
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022)

2022
[46]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Seizinger, T., Vasluianu, F.A., Conde, M.V., Wu, Z., Timofte, R.: Bokehlicious: Photorealistic bokeh rendering with controllable apertures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8908–8917 (2025)

2025
[47]

IEEE Transactions on image processing20(5), 1211–1220 (2010) 22

Sen, D., Pal, S.K.: Automatic exact histogram specification for contrast enhancement and visual system based quantitative evaluation. IEEE Transactions on image processing20(5), 1211–1220 (2010) 22

2010
[48]

Clip-fields: Weakly supervised semantic fields for robotic memory

Shafiullah, N.M.M., Paxton, C., Pinto, L., Chintala, S., Szlam, A.: Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663 (2022)

work page arXiv 2022
[49]

In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=St1giarCHLP

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2021),https://openreview.net/forum?id=St1giarCHLP

2021
[50]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3667–3676 (2020)

2020
[51]

doi:10.1109/TIP.2018.2831899 , shorttitle =

Talebi, H., Milanfar, P.: Nima: Neural image assessment. IEEE Transactions on Image Processing27(8), 3998–4011 (2018). https://doi.org/10.1109/TIP.2018.2831899

work page doi:10.1109/tip.2018.2831899 2018
[52]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

Wang, J., Li, X., Yang, J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)

2018
[53]

IEEE transactions on image processing13(4), 600–612 (2004)

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing13(4), 600–612 (2004)

2004
[54]

Deep Retinex Decomposition for Low-Light Enhancement

Wei, C., Wang, W., Yang, W., Liu, J.: Deep retinex decomposition for low-light enhancement. arXiv preprint arXiv:1808.04560 (2018)

work page Pith review arXiv 2018
[55]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

work page internal anchor Pith review arXiv 2025
[56]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Wu, C., Zheng, P., Yan, R., Xiao, S., Luo, X., Wang, Y., Li, W., Jiang, X., Liu, Y., Zhou, J., et al.: Omnigen2: Exploration to advanced multimodal generation. arXiv preprint arXiv:2506.18871 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wu, H., Zhang, Z., Zhang, E., Chen, C., Liao, L., Wang, A., Xu, K., Li, C., Hou, J., Zhai, G., Xue, G., Sun, W., Yan, Q., Lin, W.: Q-instruct: Improving low-level visual abilities for multi-modality foundation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 25490–25500 (June 2024)

2024
[58]

In: Proceedings of the 41st International Conference on Machine Learning

Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., et al.: Q-align: teaching lmms for visual scoring via discrete text-defined levels. In: Proceedings of the 41st International Conference on Machine Learning. pp. 54015–54029 (2024)

2024
[59]

arXiv preprint arXiv:2602.17558 (2026)

Wu, Q., Shi, J., Jenni, S., Kafle, K., Wang, T., Chang, S., Zhao, H.: Retouchiq: Mllm agents for instruction-based image retouching with generalist reward. arXiv preprint arXiv:2602.17558 (2026)

work page arXiv 2026
[60]

Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460, 2025

Wu, T., Zou, J., Liang, J., Zhang, L., Ma, K.: Visualquality-r1: Reasoning-induced image quality assessment via reinforcement learning to rank. arXiv preprint arXiv:2505.14460 (2025)

work page arXiv 2025
[62]

ACM Trans

Yan, Z., Zhang, H., Wang, B., Paris, S., Yu, Y.: Automatic photo adjustment using deep neural networks. ACM Trans. Graph.35(2) (Feb 2016). https://doi.org/10.1145/2790296,https://doi.org/10.1145/ 2790296

work page doi:10.1145/2790296 2016
[63]

Yang, Q., Yang, Y., Zeng, Y., Hu, X., Li, B., Yue, H., Yang, J., Jiang, P.T.: Cameramaster: Unifiedcamera semantic-parameter control for photography retouching (2025),https://arxiv.org/abs/2511.21024

work page arXiv 2025
[64]

IEEE Transactions on Image Processing30, 2072–2086 (2021)

Yang, W., Wang, W., Huang, H., Wang, S., Liu, J.: Sparse gradient regularized deep retinex network for robust low-light image enhancement. IEEE Transactions on Image Processing30, 2072–2086 (2021)

2072
[65]

Yao, M., You, Z., Tam, K.M., Wang, M., Xue, T.: Photoagent: Agentic photo editing with exploratory visual aesthetic planning (2026),https://arxiv.org/abs/2602.22809 23

work page arXiv 2026
[66]

arXiv preprint arXiv:2602.09084 (2026)

Ye, R., Zhang, J., Liu, Z., Zhu, Z., Yang, S., Li, L., Fu, T., Dernoncourt, F., Zhao, Y., Zhu, J., et al.: Agent banana: High-fidelity image editing with agentic thinking and tooling. arXiv preprint arXiv:2602.09084 (2026)

work page arXiv 2026
[67]

Ye-Bin, M., Miles, R., Oh, T.H., Elezi, I., Deng, J.: Retouchllm: Training-free code-based image retouching with vision language models (2025),https://arxiv.org/abs/2510.08054

work page arXiv 2025
[68]

Yin, F., Liu, S., Han, Y., Wang, Z., Xing, P., Wang, R., Cheng, W., Wang, Y., Li, A., Yin, Z., Chen, P., Zhang, X., Jiang, D., Zeng, X., Yu, G.: Reasonedit: Towards reasoning-enhanced image editing models (2025),https://arxiv.org/abs/2511.22625

work page arXiv 2025
[69]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

You, Z., Cai, X., Gu, J., Xue, T., Dong, C.: Teaching large language models to regress accurate image quality scores using score distribution. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14483–14494 (2025)

2025
[70]

IEEE Transactions on Multimedia25, 5589–5600 (2022)

Yue, H., Cheng, Y., Mao, Y., Cao, C., Yang, J.: Recaptured screen image demoiréing in raw domain. IEEE Transactions on Multimedia25, 5589–5600 (2022)

2022
[71]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zamfir, E., Wu, Z., Mehta, N., Tan, Y., Paudel, D.P., Zhang, Y., Timofte, R.: Complexity experts are task-discriminative learners for any image restoration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12753–12763 (June 2025)

2025
[72]

IEEE transactions on Image Processing20(8), 2378–2386 (2011)

Zhang, L., Zhang, L., Mou, X., Zhang, D.: Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing20(8), 2378–2386 (2011)

2011
[73]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhang, L., He, Y., Zhang, Q., Liu, Z., Zhang, X., Xiao, C.: Document image shadow removal guided by color-aware background. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1818–1827 (June 2023)

2023
[75]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

2018
[76]

In: International Conference on Learning Representations (ICLR) (2026)

Zheng, K., Chen, H., Ye, H., Wang, H., Zhang, Q., Jiang, K., Su, H., Ermon, S., Zhu, J., Liu, M.Y.: Diffusionnft: Online diffusion reinforcement with forward process. In: International Conference on Learning Representations (ICLR) (2026)

2026
[77]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhu, C., Fan, Q., Zhang, Q., Chen, J., Zhang, H., Xu, C., Shi, B.: Bokehdiff: Neural lens blur with one-step diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9508–9518 (2025)

2025
[78]

arXiv preprint arXiv:2507.07105 (2025)

Zuo, Y., Zheng, Q., Wu, M., Jiang, X., Li, R., Wang, J., Zhang, Y., Mai, G., Wang, L.V., Zou, J., et al.: 4kagent: agentic any image to 4k super-resolution. arXiv preprint arXiv:2507.07105 (2025) 24 A More Qualitative Results We present additional qualitative examples of automatic photographic enhancement. As shown in Figure 6, SmartPhotoCrafter produces ...

work page arXiv 2025
[79]

However, there is noticeable low exposure, particularly in the darker areas surrounding the dog

Image Quality/Degradations Analysis: The image exhibits no significant motion blur, defocus, or distortion. However, there is noticeable low exposure, particularly in the darker areas surrounding the dog. This creates a lack of detail in the background and could be considered a form of visual interference due to shadow occlusion. There is no apparent digi...
[80]

The white balance seems accurate, as the dog’s fur tones look true-to-life

Color Performance&Lighting Analysis: The saturation appears moderate, with natural-looking colors. The white balance seems accurate, as the dog’s fur tones look true-to-life. However, the contrast is somewhat subdued due to the low exposure, leading to less distinction between the subject and its surroundings. The lighting is focused on the dog, creating ...
[81]

The focus is sharp on the dog, drawing attention directly to it

Composition&Layout Analysis: The main subject, the dog, is clearly defined and positioned slightly off-center, adhering to the rule of thirds. The focus is sharp on the dog, drawing attention directly to it. The camera viewpoint is at eye level, providing a natural perspective. The background blur is effective in isolating the subject, though it might be ...
[82]

The creativity lies in the use of lighting and shadow to evoke emotion

Aesthetic Impression Analysis: The tone style is somewhat muted due to the low exposure, giving the image a contemplative and serene feel. The creativity lies in the use of lighting and shadow to evoke emotion. The emotional expression is calm and introspective, with the dog appearing relaxed yet alert. The semantic richness is moderate, as the image tell...

Showing first 80 references.