pith. sign in

arxiv: 2512.00993 · v2 · submitted 2025-11-30 · 💻 cs.CV

PhotoFramer: Multi-modal Image Composition Instruction

Pith reviewed 2026-05-17 02:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords photo compositionmulti-modal guidanceimage-to-image generationinstruction tuningphotography assistanceview synthesissynthetic degradation
0
0 comments X p. Extension

The pith

A model supplies natural-language advice and corrected example images to fix poorly composed photos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a system that accepts a badly framed photo and returns both spoken-style instructions for improvement and a new, better-composed picture. They care about this because most casual photographers do not know the standard rules for framing and therefore produce unbalanced shots. To train the system they create a dataset organized as three sub-tasks: shifting the frame, zooming in, and changing the viewpoint. Shift and zoom data come from existing crop collections; viewpoint data are made by training a degradation model on multi-view sets and then applying it to expert photos. The resulting model is fine-tuned to handle both text and images together, and experiments show that the text instructions alone already steer composition while adding the example image improves results further over baselines that use only examples.

Core claim

Given a poorly composed input image, PhotoFramer first produces natural-language instructions describing how to improve the composition and then generates a well-composed example image; the model is trained on a hierarchical dataset whose three levels are shift, zoom-in, and view-change, with the last level synthesized by a degradation model applied to expert photographs.

What carries the argument

The hierarchical decomposition of composition guidance into shift, zoom-in, and view-change tasks, together with a two-stage synthetic pipeline that first learns to degrade good photos and then applies the learned degradation to create training pairs.

If this is right

  • Textual instructions alone are sufficient to improve composition decisions in a multi-modal model.
  • Pairing the same instructions with an illustrative example image produces consistent gains over an example-only baseline.
  • The trained model can be used directly as a practical composition assistant for everyday phone photography.
  • Expert photographic priors can be made accessible without requiring users to study formal rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such guidance could be embedded inside camera apps to give live suggestions before the shutter is pressed.
  • The same hierarchical task structure might transfer to other visual decision tasks such as video framing or product photography.
  • If the synthetic degradation step generalizes, similar pipelines could be used to create training data for other subjective image-quality tasks.

Load-bearing premise

The synthetic degradation model produces poor-composition images that match the distribution of mistakes real users actually make when taking photos.

What would settle it

A user study in which participants are given either the model's text-plus-image guidance or no guidance, then asked to retake the same scene, followed by blind expert ratings of the resulting photographs for composition quality.

Figures

Figures reproduced from arXiv: 2512.00993 by Chao Dong, He Zhang, Jinjin Gu, Ke Wang, Tianfan Xue, Xin Cai, Zhiyuan You, Zhoutong Zhang.

Figure 1
Figure 1. Figure 1: We propose PhotoFramer, a model designed for composition instruction during photo capturing. Given a poorly composed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Task paradigm and data example. Given a poorly composed image, our PhotoFramer is required to generate a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Dataset construction for the shift and zoom-in tasks. For [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Dataset construction for the view-change task. (a) Leveraging our composition assessment model in Sec. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PhotoFramer architecture. We adopt Bagel [ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparison between our PhotoFramer and baseline methods. Open-source editing models fail to improve composi [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Text guidance is important for image generation. If we [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Finetuned Bagel (i.e., on both textual and visual data) successfully includes the whole wooden house as in the text guid￾ance, outperforming finetuned Kontext (i.e., on visual data only). … zooming in slightly to focus more closely on the central figure in the boat … This change eliminates unnecessary background details … Original Result Text Guidance [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
read the original abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed input image, the model outputs natural-language guidance on improving the framing together with a generated well-composed example image. Composition guidance is organized hierarchically into shift, zoom-in, and view-change subtasks. Shift and zoom-in pairs are drawn from existing cropping datasets; view-change pairs are synthesized by sampling multi-view data, training a degradation model that maps expert photos to poor compositions, and applying the model to expert images. A joint text-and-image model is then fine-tuned on the resulting dataset. The abstract states that extensive experiments show textual instructions effectively steer composition and that coupling instructions with exemplars yields consistent improvements over exemplar-only baselines.

Significance. If the central claims hold after the requested validation, the work would offer a practical route toward interactive composition assistants that make photographic priors accessible via both language and visual examples. The hierarchical task decomposition and the two-stage synthetic-data pipeline for viewpoint changes constitute a concrete contribution to dataset construction in this area.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.
  2. [§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by the inclusion of at least one key quantitative result (e.g., preference rate or FID) to ground the 'consistent improvements' statement.
  2. [§3] Notation for the degradation model (e.g., input/output domains and loss terms) should be introduced explicitly when first referenced in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.

    Authors: We agree that the current description of the human evaluation in §4 is insufficiently detailed to allow full assessment of the reported gains. The preference study compared exemplar-only baselines against our multi-modal outputs, but the manuscript omitted a complete protocol description and statistical reporting. In the revised manuscript we will expand §4 to include: a precise description of the human-preference protocol (number of participants, image sampling method, question wording, and presentation order); quantitative preference percentages with standard error bars; and statistical significance tests (e.g., binomial or McNemar tests). These additions will be reflected in both §4 and the abstract where appropriate. revision: yes

  2. Referee: [§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.

    Authors: We acknowledge that explicit validation of the synthetic poor-composition distribution is important. The degradation model was trained on real multi-view pairs drawn from existing datasets, and its outputs were used to create training pairs for the view-change task. While we did not conduct a dedicated human study directly comparing synthetic degradations to real user framing errors, we evaluated the model with quantitative perceptual metrics on held-out multi-view pairs and observed consistent downstream gains. In the revision we will add these quantitative similarity metrics (e.g., LPIPS and SSIM on validation pairs) and additional qualitative examples to §3.2, together with a clearer discussion of the modeling assumptions and their limitations. A full-scale human comparison against real user mistakes would require new paired data collection and is noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's method curates training data from external cropping datasets for shift/zoom-in tasks and uses a separate two-stage pipeline on multi-view datasets to train a degradation model that synthesizes view-change pairs before finetuning the multi-modal model; experiments then report performance gains. No equation, result, or claim reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The pipeline depends on independent external sources and a separately trained component whose outputs are not tautologically equivalent to the target composition-steering claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic degradation model produces realistic poor compositions and that the generated examples are objectively better; no explicit free parameters or invented physical entities are introduced in the abstract description.

axioms (1)
  • domain assumption Human photographers improve composition by shifting, zooming, or changing viewpoint
    Invoked to justify the three sub-task hierarchy in the data curation section.

pith-pipeline@v0.9.0 · 5542 in / 1179 out tokens · 36893 ms · 2026-05-17T02:43:01.180311+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

    cs.CV 2026-05 unverdicted novelty 6.0

    CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

  2. LumiVideo: An Intelligent Agentic System for Video Color Grading

    cs.CV 2026-04 unverdicted novelty 6.0

    LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 2 Pith papers · 11 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5, 13, 14

  2. [2]

    FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

    Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

  3. [3]

    Instruction-based image manipulation by watching how things move

    Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InCVPR, 2025. 2

  4. [4]

    ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

    Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 2

  5. [5]

    Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study

    Yi-Ling Chen, Tzu-Wei Huang, Kai-Han Chang, Yu-Chen Tsai, Hwann-Tzong Chen, and Bing-Yu Chen. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. InWACV, 2017. 2, 4

  6. [6]

    Learning to compose with professional pho- tographs on the web

    Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose with professional pho- tographs on the web. InACM MM, 2017. 2

  7. [7]

    Mobile computational photography: A tour.Annual Review of Vision Science, 2021

    Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. Mobile computational photography: A tour.Annual Review of Vision Science, 2021. 1

  8. [8]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 5, 6, 7

  9. [9]

    Unsplash lite dataset 1.3.0, 2020

    Unsplash Developers. Unsplash lite dataset 1.3.0, 2020. 2, 5, 16

  10. [10]

    Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models

    Chen Fang, Zhe Lin, Radomir Mech, and Xiaohui Shen. Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models. InACM MM,

  11. [11]

    Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval

    Farshid Farhat, Mohammad Mahdi Kamani, Sahil Mishra, and James Z Wang. Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval. InACM MM Workshops, 2017. 2

  12. [12]

    CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022

    Farshid Farhat, Mohammad Mahdi Kamani, and James Z Wang. CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022. 2

  13. [13]

    Routledge, 2017

    Michael Freeman.The Photographer’s Eye Digitally Remas- tered 10th Anniversary Edition: Composition and Design for Better Digital Photos. Routledge, 2017. 2

  14. [14]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: Multimodal models with unified multi-granularity compre- hension and generation.arXiv preprint arXiv:2404.14396,

  15. [15]

    Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018

    Guanjun Guo, Hanzi Wang, Chunhua Shen, Yan Yan, and Hong-Yuan Mark Liao. Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018. 2

  16. [16]

    Composing photos like a photographer

    Chaoyi Hong, Shuaiyuan Du, Ke Xian, Hao Lu, Zhiguo Cao, and Weicai Zhong. Composing photos like a photographer. InCVPR, 2021. 2

  17. [17]

    Learning subject-aware cropping by out- painting professional photos

    James Hong, Lu Yuan, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Learning subject-aware cropping by out- painting professional photos. InAAAI, 2024. 2

  18. [18]

    Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report

    Andrey Ignatov, Georgii Perevozchikov, Radu Timofte, Cheng Li, Lian Liu, Jun Cao, Heng Sun, Wu Pan, Song Wang, KeQiang Yu, et al. Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report. InCVPR Workshops, 2025. 1

  19. [19]

    Decoupled weight decay regularization

    Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. InICLR, 2019. 6, 14

  20. [20]

    Re- thinking image cropping: Exploring diverse compositions from global views

    Gengyun Jia, Huaibo Huang, Chaoyou Fu, and Ran He. Re- thinking image cropping: Exploring diverse compositions from global views. InCVPR, 2022. 2

  21. [21]

    PIPAL: A large-scale image quality assessment dataset for perceptual image restoration

    Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. PIPAL: A large-scale image quality assessment dataset for perceptual image restoration. InECCV, 2020. 6

  22. [22]

    Adam: A method for stochastic opti- mization

    Diederik P Kingma. Adam: A method for stochastic opti- mization. InICLR, 2015. 12

  23. [23]

    GPT-4V(ision) system card, 2024

    Black Forest Labs. GPT-4V(ision) system card, 2024. 6

  24. [24]

    Routledge, 2013

    Michael Langford.Basic photography. Routledge, 2013. 2

  25. [25]

    Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018

    Jun-Tae Lee, Han-Ul Kim, Chul Lee, and Chang-Su Kim. Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018. 2, 5, 12

  26. [26]

    A2- RL: Aesthetics aware reinforcement learning for image crop- ping

    Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2- RL: Aesthetics aware reinforcement learning for image crop- ping. InCVPR, 2018. 2

  27. [27]

    Learning to learn cropping models for different aspect ratio require- ments

    Debang Li, Junge Zhang, and Kaiqi Huang. Learning to learn cropping models for different aspect ratio require- ments. InCVPR, 2020. 2

  28. [28]

    Composing good shots by exploiting mutual relations

    Debang Li, Junge Zhang, Kaiqi Huang, and Ming-Hsuan Yang. Composing good shots by exploiting mutual relations. InCVPR, 2020. 2 9

  29. [29]

    Towards smart point-and-shoot photography

    Jiawan Li, Fei Zhou, Zhipeng Zhong, Jiongzhi Lin, and Guoping Qiu. Towards smart point-and-shoot photography. InCVPR, 2025. 2

  30. [30]

    Q-Insight: Understanding image quality via visual reinforcement learning

    Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-Insight: Understanding image quality via visual reinforcement learning. InNeurIPS,

  31. [31]

    Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025

    Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025. 4, 14

  32. [32]

    DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 5, 14

  33. [33]

    Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025

    Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025. 2

  34. [34]

    On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989

    Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989. 12

  35. [35]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2024. 2

  36. [36]

    Beyond image borders: Learn- ing feature extrapolation for unbounded image composition

    Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. Beyond image borders: Learn- ing feature extrapolation for unbounded image composition. InICCV, 2023. 2

  37. [37]

    Image and video processing on mobile devices: a survey.The Visual Computer, 2021

    Chamin Morikawa, Michihiro Kobayashi, Masaki Satoh, Ya- suhiro Kuroda, Teppei Inomata, Hitoshi Matsuo, Takeshi Miura, and Masaki Hilaga. Image and video processing on mobile devices: a survey.The Visual Computer, 2021. 1

  38. [38]

    A V A: A large-scale database for aesthetic visual analysis

    Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In CVPR, 2012. 4, 5, 12

  39. [39]

    GPT-4V(ision) system card, 2023

    OpenAI. GPT-4V(ision) system card, 2023. 2

  40. [40]

    GPT-4o System Card

    OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

  41. [41]

    GPT-5 system card, 2025

    OpenAI. GPT-5 system card, 2025. 6, 7

  42. [42]

    DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4

  43. [43]

    U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020

    Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R Zaiane, and Martin Jagersand. U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020. 4

  44. [44]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4

  45. [45]

    High-resolution image syn- thesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2

  46. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5, 13

  47. [47]

    Spatial-semantic collaborative cropping for user generated content

    Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, and Qingyao Wu. Spatial-semantic collaborative cropping for user generated content. InAAAI, 2024. 2

  48. [48]

    Emu: Generative pretraining in multimodality

    Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 3

  49. [49]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2, 3

  50. [50]

    DXOMARK - quality testing, scores and reviews, 2025

    DXOMARK Team. DXOMARK - quality testing, scores and reviews, 2025. 1

  51. [51]

    MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

    Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 3

  52. [52]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 6

  53. [53]

    Image cropping with composition and saliency aware aes- thetic score map

    Yi Tu, Li Niu, Weijie Zhao, Dawei Cheng, and Liqing Zhang. Image cropping with composition and saliency aware aes- thetic score map. InAAAI, 2020. 2

  54. [54]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 5

  55. [55]

    Deep cropping via at- tention box prediction and aesthetics assessment

    Wenguan Wang and Jianbing Shen. Deep cropping via at- tention box prediction and aesthetics assessment. InICCV,

  56. [56]

    Good view hunting: Learning photo composition from dense view pairs

    Zijun Wei, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomir Mech, Minh Hoai, and Dimitris Samaras. Good view hunting: Learning photo composition from dense view pairs. InCVPR, 2018. 2, 3, 4, 11, 14

  57. [57]

    Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 3

  58. [58]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7

  59. [59]

    Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels

    Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InICML,

  60. [60]

    NExT-GPT: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. InICML,

  61. [61]

    Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank

    Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank. InNeurIPS, 2025. 5, 13

  62. [62]

    Show-o: One single transformer to unify multimodal understanding and generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2, 3

  63. [63]

    Learning the change for automatic image cropping

    Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. Learning the change for automatic image cropping. In CVPR, 2013. 2, 4, 14

  64. [64]

    Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023

    Guo-Ye Yang, Wen-Yang Zhou, Yun Cai, Song-Hai Zhang, and Fang-Lue Zhang. Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023. 2, 4, 14

  65. [65]

    Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

    Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

  66. [66]

    Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

    Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InECCV, 2024. 3, 6

  67. [67]

    Teaching large language models to regress accurate image quality scores using score distribution

    Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InCVPR,

  68. [68]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2014. 14

  69. [69]

    Reliable and efficient image cropping: A grid anchor based approach

    Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Reliable and efficient image cropping: A grid anchor based approach. InCVPR, 2019. 2

  70. [70]

    Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020

    Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020. 2, 3, 4, 5, 12, 14

  71. [71]

    Image composition assessment with saliency-augmented multi-pattern pooling

    Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented multi-pattern pooling. InBMVC, 2021. 2, 4, 5, 12

  72. [72]

    Human- centric image cropping with partition-aware and content- preserving features

    Bo Zhang, Li Niu, Xing Zhao, and Liqing Zhang. Human- centric image cropping with partition-aware and content- preserving features. InECCV, 2022. 2

  73. [73]

    ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025

    Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M Patel, and Luming Liang. ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025. 2

  74. [74]

    Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018

    Xiaoyan Zhang, Zhuopeng Li, Martin Constable, Kap Luk Chan, Zhenhua Tang, and Gaoyang Tang. Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018. 2

  75. [75]

    Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding

    Zhaoran Zhao, Peng Lu, Anran Zhang, Peipei Li, Xia Li, Xu- annan Liu, Yang Hu, Shiyi Chen, Liwei Wang, and Wenhao Guo. Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. InCVPR, 2025. 2

  76. [76]

    MiniGPT-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 2

  77. [77]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2 Appendix A. More Results We have added more qualitative results in Figs. A9 to A11, including t...