PhotoFramer: Multi-modal Image Composition Instruction

arxiv: 2512.00993 · v2 · submitted 2025-11-30 · 💻 cs.CV

PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You , Ke Wang , He Zhang , Xin Cai , Jinjin Gu , Tianfan Xue , Chao Dong , Zhoutong Zhang This is my paper

Pith reviewed 2026-05-17 02:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords photo compositionmulti-modal guidanceimage-to-image generationinstruction tuningphotography assistanceview synthesissynthetic degradation

0 comments p. Extension

The pith

A model supplies natural-language advice and corrected example images to fix poorly composed photos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors build a system that accepts a badly framed photo and returns both spoken-style instructions for improvement and a new, better-composed picture. They care about this because most casual photographers do not know the standard rules for framing and therefore produce unbalanced shots. To train the system they create a dataset organized as three sub-tasks: shifting the frame, zooming in, and changing the viewpoint. Shift and zoom data come from existing crop collections; viewpoint data are made by training a degradation model on multi-view sets and then applying it to expert photos. The resulting model is fine-tuned to handle both text and images together, and experiments show that the text instructions alone already steer composition while adding the example image improves results further over baselines that use only examples.

Core claim

Given a poorly composed input image, PhotoFramer first produces natural-language instructions describing how to improve the composition and then generates a well-composed example image; the model is trained on a hierarchical dataset whose three levels are shift, zoom-in, and view-change, with the last level synthesized by a degradation model applied to expert photographs.

What carries the argument

The hierarchical decomposition of composition guidance into shift, zoom-in, and view-change tasks, together with a two-stage synthetic pipeline that first learns to degrade good photos and then applies the learned degradation to create training pairs.

If this is right

Textual instructions alone are sufficient to improve composition decisions in a multi-modal model.
Pairing the same instructions with an illustrative example image produces consistent gains over an example-only baseline.
The trained model can be used directly as a practical composition assistant for everyday phone photography.
Expert photographic priors can be made accessible without requiring users to study formal rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such guidance could be embedded inside camera apps to give live suggestions before the shutter is pressed.
The same hierarchical task structure might transfer to other visual decision tasks such as video framing or product photography.
If the synthetic degradation step generalizes, similar pipelines could be used to create training data for other subjective image-quality tasks.

Load-bearing premise

The synthetic degradation model produces poor-composition images that match the distribution of mistakes real users actually make when taking photos.

What would settle it

A user study in which participants are given either the model's text-plus-image guidance or no guidance, then asked to retake the same scene, followed by blind expert ratings of the resulting photographs for composition quality.

Figures

Figures reproduced from arXiv: 2512.00993 by Chao Dong, He Zhang, Jinjin Gu, Ke Wang, Tianfan Xue, Xin Cai, Zhiyuan You, Zhoutong Zhang.

**Figure 1.** Figure 1: We propose PhotoFramer, a model designed for composition instruction during photo capturing. Given a poorly composed [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Task paradigm and data example. Given a poorly composed image, our PhotoFramer is required to generate a [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset construction for the shift and zoom-in tasks. For [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Dataset construction for the view-change task. (a) Leveraging our composition assessment model in Sec. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: PhotoFramer architecture. We adopt Bagel [ [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative comparison between our PhotoFramer and baseline methods. Open-source editing models fail to improve composi [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 10.** Figure 10: Text guidance is important for image generation. If we [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 12.** Figure 12: Finetuned Bagel (i.e., on both textual and visual data) successfully includes the whole wooden house as in the text guidance, outperforming finetuned Kontext (i.e., on visual data only). … zooming in slightly to focus more closely on the central figure in the boat … This change eliminates unnecessary background details … Original Result Text Guidance [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

read the original abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PhotoFramer gives a workable multi-modal setup for composition advice with a clean hierarchical split and synthetic view data, but the experiments are described too thinly to judge the real gains.

read the letter

This paper describes PhotoFramer, a system that takes a poorly composed photo and outputs both text instructions on how to fix it and a better example image. The authors organize the guidance into shift, zoom-in, and view-change steps, which matches how people actually adjust their framing. The new part is the two-stage pipeline for view-change data. They pull pairs from multi-view datasets, train a model to degrade good photos into bad ones, and then apply it to create training examples. This lets them scale up data for the harder viewpoint adjustments without needing lots of manual work. They then fine-tune a model that handles both text and image output together. The approach is straightforward and reuses existing cropping and multi-view resources, which is efficient. The abstract says this leads to better results than just using exemplars alone. The main weakness is the evaluation. The claims about consistent improvements and effective steering by text rest on experiments that are not detailed here—no metrics, no error bars, and no explanation of the human study setup. More importantly, there's no check shown that the synthetic bad compositions from the degradation model actually resemble real user framing errors. If they don't, the reported gains could be tied to the artificial data rather than general composition knowledge. This kind of work would interest researchers developing AI tools for photography or multi-modal image editing. Someone looking for ideas on synthetic data for instructional tasks might pick up the pipeline. I think it deserves peer review. The core idea is clear and applied, so referees can assess the full results and any additional validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed input image, the model outputs natural-language guidance on improving the framing together with a generated well-composed example image. Composition guidance is organized hierarchically into shift, zoom-in, and view-change subtasks. Shift and zoom-in pairs are drawn from existing cropping datasets; view-change pairs are synthesized by sampling multi-view data, training a degradation model that maps expert photos to poor compositions, and applying the model to expert images. A joint text-and-image model is then fine-tuned on the resulting dataset. The abstract states that extensive experiments show textual instructions effectively steer composition and that coupling instructions with exemplars yields consistent improvements over exemplar-only baselines.

Significance. If the central claims hold after the requested validation, the work would offer a practical route toward interactive composition assistants that make photographic priors accessible via both language and visual examples. The hierarchical task decomposition and the two-stage synthetic-data pipeline for viewpoint changes constitute a concrete contribution to dataset construction in this area.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.
[§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.

minor comments (2)

[Abstract] The abstract would be strengthened by the inclusion of at least one key quantitative result (e.g., preference rate or FID) to ground the 'consistent improvements' statement.
[§3] Notation for the degradation model (e.g., input/output domains and loss terms) should be introduced explicitly when first referenced in the method section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.

Authors: We agree that the current description of the human evaluation in §4 is insufficiently detailed to allow full assessment of the reported gains. The preference study compared exemplar-only baselines against our multi-modal outputs, but the manuscript omitted a complete protocol description and statistical reporting. In the revised manuscript we will expand §4 to include: a precise description of the human-preference protocol (number of participants, image sampling method, question wording, and presentation order); quantitative preference percentages with standard error bars; and statistical significance tests (e.g., binomial or McNemar tests). These additions will be reflected in both §4 and the abstract where appropriate. revision: yes
Referee: [§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.

Authors: We acknowledge that explicit validation of the synthetic poor-composition distribution is important. The degradation model was trained on real multi-view pairs drawn from existing datasets, and its outputs were used to create training pairs for the view-change task. While we did not conduct a dedicated human study directly comparing synthetic degradations to real user framing errors, we evaluated the model with quantitative perceptual metrics on held-out multi-view pairs and observed consistent downstream gains. In the revision we will add these quantitative similarity metrics (e.g., LPIPS and SSIM on validation pairs) and additional qualitative examples to §3.2, together with a clearer discussion of the modeling assumptions and their limitations. A full-scale human comparison against real user mistakes would require new paired data collection and is noted as future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's method curates training data from external cropping datasets for shift/zoom-in tasks and uses a separate two-stage pipeline on multi-view datasets to train a degradation model that synthesizes view-change pairs before finetuning the multi-modal model; experiments then report performance gains. No equation, result, or claim reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The pipeline depends on independent external sources and a separately trained component whose outputs are not tautologically equivalent to the target composition-steering claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the synthetic degradation model produces realistic poor compositions and that the generated examples are objectively better; no explicit free parameters or invented physical entities are introduced in the abstract description.

axioms (1)

domain assumption Human photographers improve composition by shifting, zooming, or changing viewpoint
Invoked to justify the three sub-task hierarchy in the data curation section.

pith-pipeline@v0.9.0 · 5542 in / 1179 out tokens · 36893 ms · 2026-05-17T02:43:01.180311+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Extensive experiments demonstrate that textual instructions effectively steer image composition

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
cs.CV 2026-05 unverdicted novelty 6.0

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
LumiVideo: An Intelligent Agentic System for Video Color Grading
cs.CV 2026-04 unverdicted novelty 6.0

LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 2 Pith papers · 11 internal anchors

[1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5, 13, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Instruction-based image manipulation by watching how things move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InCVPR, 2025. 2

work page 2025
[4]

ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 2

work page arXiv 2025
[5]

Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study

Yi-Ling Chen, Tzu-Wei Huang, Kai-Han Chang, Yu-Chen Tsai, Hwann-Tzong Chen, and Bing-Yu Chen. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. InWACV, 2017. 2, 4

work page 2017
[6]

Learning to compose with professional pho- tographs on the web

Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose with professional pho- tographs on the web. InACM MM, 2017. 2

work page 2017
[7]

Mobile computational photography: A tour.Annual Review of Vision Science, 2021

Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. Mobile computational photography: A tour.Annual Review of Vision Science, 2021. 1

work page 2021
[8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Unsplash lite dataset 1.3.0, 2020

Unsplash Developers. Unsplash lite dataset 1.3.0, 2020. 2, 5, 16

work page 2020
[10]

Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models

Chen Fang, Zhe Lin, Radomir Mech, and Xiaohui Shen. Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models. InACM MM,

work page
[11]

Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval

Farshid Farhat, Mohammad Mahdi Kamani, Sahil Mishra, and James Z Wang. Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval. InACM MM Workshops, 2017. 2

work page 2017
[12]

CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022

Farshid Farhat, Mohammad Mahdi Kamani, and James Z Wang. CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022. 2

work page 2022
[13]

Routledge, 2017

Michael Freeman.The Photographer’s Eye Digitally Remas- tered 10th Anniversary Edition: Composition and Design for Better Digital Photos. Routledge, 2017. 2

work page 2017
[14]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: Multimodal models with unified multi-granularity compre- hension and generation.arXiv preprint arXiv:2404.14396,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018

Guanjun Guo, Hanzi Wang, Chunhua Shen, Yan Yan, and Hong-Yuan Mark Liao. Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018. 2

work page 2018
[16]

Composing photos like a photographer

Chaoyi Hong, Shuaiyuan Du, Ke Xian, Hao Lu, Zhiguo Cao, and Weicai Zhong. Composing photos like a photographer. InCVPR, 2021. 2

work page 2021
[17]

Learning subject-aware cropping by out- painting professional photos

James Hong, Lu Yuan, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Learning subject-aware cropping by out- painting professional photos. InAAAI, 2024. 2

work page 2024
[18]

Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report

Andrey Ignatov, Georgii Perevozchikov, Radu Timofte, Cheng Li, Lian Liu, Jun Cao, Heng Sun, Wu Pan, Song Wang, KeQiang Yu, et al. Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report. InCVPR Workshops, 2025. 1

work page 2025
[19]

Decoupled weight decay regularization

Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. InICLR, 2019. 6, 14

work page 2019
[20]

Re- thinking image cropping: Exploring diverse compositions from global views

Gengyun Jia, Huaibo Huang, Chaoyou Fu, and Ran He. Re- thinking image cropping: Exploring diverse compositions from global views. InCVPR, 2022. 2

work page 2022
[21]

PIPAL: A large-scale image quality assessment dataset for perceptual image restoration

Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. PIPAL: A large-scale image quality assessment dataset for perceptual image restoration. InECCV, 2020. 6

work page 2020
[22]

Adam: A method for stochastic opti- mization

Diederik P Kingma. Adam: A method for stochastic opti- mization. InICLR, 2015. 12

work page 2015
[23]

GPT-4V(ision) system card, 2024

Black Forest Labs. GPT-4V(ision) system card, 2024. 6

work page 2024
[24]

Routledge, 2013

Michael Langford.Basic photography. Routledge, 2013. 2

work page 2013
[25]

Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018

Jun-Tae Lee, Han-Ul Kim, Chul Lee, and Chang-Su Kim. Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018. 2, 5, 12

work page 2018
[26]

A2- RL: Aesthetics aware reinforcement learning for image crop- ping

Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2- RL: Aesthetics aware reinforcement learning for image crop- ping. InCVPR, 2018. 2

work page 2018
[27]

Learning to learn cropping models for different aspect ratio require- ments

Debang Li, Junge Zhang, and Kaiqi Huang. Learning to learn cropping models for different aspect ratio require- ments. InCVPR, 2020. 2

work page 2020
[28]

Composing good shots by exploiting mutual relations

Debang Li, Junge Zhang, Kaiqi Huang, and Ming-Hsuan Yang. Composing good shots by exploiting mutual relations. InCVPR, 2020. 2 9

work page 2020
[29]

Towards smart point-and-shoot photography

Jiawan Li, Fei Zhou, Zhipeng Zhong, Jiongzhi Lin, and Guoping Qiu. Towards smart point-and-shoot photography. InCVPR, 2025. 2

work page 2025
[30]

Q-Insight: Understanding image quality via visual reinforcement learning

Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-Insight: Understanding image quality via visual reinforcement learning. InNeurIPS,

work page
[31]

Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025. 4, 14

work page 2025
[32]

DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 5, 14

work page 2024
[33]

Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025

Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025. 2

work page arXiv 2025
[34]

On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989

Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989. 12

work page 1989
[35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2024. 2

work page 2024
[36]

Beyond image borders: Learn- ing feature extrapolation for unbounded image composition

Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. Beyond image borders: Learn- ing feature extrapolation for unbounded image composition. InICCV, 2023. 2

work page 2023
[37]

Image and video processing on mobile devices: a survey.The Visual Computer, 2021

Chamin Morikawa, Michihiro Kobayashi, Masaki Satoh, Ya- suhiro Kuroda, Teppei Inomata, Hitoshi Matsuo, Takeshi Miura, and Masaki Hilaga. Image and video processing on mobile devices: a survey.The Visual Computer, 2021. 1

work page 2021
[38]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In CVPR, 2012. 4, 5, 12

work page 2012
[39]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 2

work page 2023
[40]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

GPT-5 system card, 2025

OpenAI. GPT-5 system card, 2025. 6, 7

work page 2025
[42]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4

work page 2024
[43]

U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R Zaiane, and Martin Jagersand. U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020. 4

work page 2020
[44]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4

work page 2021
[45]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2

work page 2022
[46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Spatial-semantic collaborative cropping for user generated content

Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, and Qingyao Wu. Spatial-semantic collaborative cropping for user generated content. InAAAI, 2024. 2

work page 2024
[48]

Emu: Generative pretraining in multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 3

work page 2024
[49]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

DXOMARK - quality testing, scores and reviews, 2025

DXOMARK Team. DXOMARK - quality testing, scores and reviews, 2025. 1

work page 2025
[51]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 3

work page internal anchor Pith review arXiv 2024
[52]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Image cropping with composition and saliency aware aes- thetic score map

Yi Tu, Li Niu, Weijie Zhao, Dawei Cheng, and Liqing Zhang. Image cropping with composition and saliency aware aes- thetic score map. InAAAI, 2020. 2

work page 2020
[54]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 5

work page 2025
[55]

Deep cropping via at- tention box prediction and aesthetics assessment

Wenguan Wang and Jianbing Shen. Deep cropping via at- tention box prediction and aesthetics assessment. InICCV,

work page
[56]

Good view hunting: Learning photo composition from dense view pairs

Zijun Wei, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomir Mech, Minh Hoai, and Dimitris Samaras. Good view hunting: Learning photo composition from dense view pairs. InCVPR, 2018. 2, 3, 4, 11, 14

work page 2018
[57]

Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 3

work page 2025
[58]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InICML,

work page
[60]

NExT-GPT: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. InICML,

work page
[61]

Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank. InNeurIPS, 2025. 5, 13

work page 2025
[62]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2, 3

work page 2025
[63]

Learning the change for automatic image cropping

Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. Learning the change for automatic image cropping. In CVPR, 2013. 2, 4, 14

work page 2013
[64]

Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023

Guo-Ye Yang, Wen-Yang Zhou, Yun Cai, Song-Hai Zhang, and Fang-Lue Zhang. Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023. 2, 4, 14

work page 2023
[65]

Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

work page arXiv
[66]

Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InECCV, 2024. 3, 6

work page 2024
[67]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InCVPR,

work page
[68]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2014. 14

work page 2014
[69]

Reliable and efficient image cropping: A grid anchor based approach

Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Reliable and efficient image cropping: A grid anchor based approach. InCVPR, 2019. 2

work page 2019
[70]

Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020

Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020. 2, 3, 4, 5, 12, 14

work page 2020
[71]

Image composition assessment with saliency-augmented multi-pattern pooling

Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented multi-pattern pooling. InBMVC, 2021. 2, 4, 5, 12

work page 2021
[72]

Human- centric image cropping with partition-aware and content- preserving features

Bo Zhang, Li Niu, Xing Zhao, and Liqing Zhang. Human- centric image cropping with partition-aware and content- preserving features. InECCV, 2022. 2

work page 2022
[73]

ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025

Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M Patel, and Luming Liang. ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025. 2

work page arXiv 2025
[74]

Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018

Xiaoyan Zhang, Zhuopeng Li, Martin Constable, Kap Luk Chan, Zhenhua Tang, and Gaoyang Tang. Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018. 2

work page 2018
[75]

Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding

Zhaoran Zhao, Peng Lu, Anran Zhang, Peipei Li, Xia Li, Xu- annan Liu, Yang Hu, Shiyi Chen, Liwei Wang, and Wenhao Guo. Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. InCVPR, 2025. 2

work page 2025
[76]

MiniGPT-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 2

work page 2024
[77]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2 Appendix A. More Results We have added more qualitative results in Figs. A9 to A11, including t...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5, 13, 14

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Instruction-based image manipulation by watching how things move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InCVPR, 2025. 2

work page 2025

[4] [4]

ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding

Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 2

work page arXiv 2025

[5] [5]

Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study

Yi-Ling Chen, Tzu-Wei Huang, Kai-Han Chang, Yu-Chen Tsai, Hwann-Tzong Chen, and Bing-Yu Chen. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. InWACV, 2017. 2, 4

work page 2017

[6] [6]

Learning to compose with professional pho- tographs on the web

Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose with professional pho- tographs on the web. InACM MM, 2017. 2

work page 2017

[7] [7]

Mobile computational photography: A tour.Annual Review of Vision Science, 2021

Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. Mobile computational photography: A tour.Annual Review of Vision Science, 2021. 1

work page 2021

[8] [8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

Unsplash lite dataset 1.3.0, 2020

Unsplash Developers. Unsplash lite dataset 1.3.0, 2020. 2, 5, 16

work page 2020

[10] [10]

Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models

Chen Fang, Zhe Lin, Radomir Mech, and Xiaohui Shen. Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models. InACM MM,

work page

[11] [11]

Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval

Farshid Farhat, Mohammad Mahdi Kamani, Sahil Mishra, and James Z Wang. Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval. InACM MM Workshops, 2017. 2

work page 2017

[12] [12]

CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022

Farshid Farhat, Mohammad Mahdi Kamani, and James Z Wang. CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022. 2

work page 2022

[13] [13]

Routledge, 2017

Michael Freeman.The Photographer’s Eye Digitally Remas- tered 10th Anniversary Edition: Composition and Design for Better Digital Photos. Routledge, 2017. 2

work page 2017

[14] [14]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: Multimodal models with unified multi-granularity compre- hension and generation.arXiv preprint arXiv:2404.14396,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018

Guanjun Guo, Hanzi Wang, Chunhua Shen, Yan Yan, and Hong-Yuan Mark Liao. Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018. 2

work page 2018

[16] [16]

Composing photos like a photographer

Chaoyi Hong, Shuaiyuan Du, Ke Xian, Hao Lu, Zhiguo Cao, and Weicai Zhong. Composing photos like a photographer. InCVPR, 2021. 2

work page 2021

[17] [17]

Learning subject-aware cropping by out- painting professional photos

James Hong, Lu Yuan, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Learning subject-aware cropping by out- painting professional photos. InAAAI, 2024. 2

work page 2024

[18] [18]

Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report

Andrey Ignatov, Georgii Perevozchikov, Radu Timofte, Cheng Li, Lian Liu, Jun Cao, Heng Sun, Wu Pan, Song Wang, KeQiang Yu, et al. Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report. InCVPR Workshops, 2025. 1

work page 2025

[19] [19]

Decoupled weight decay regularization

Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. InICLR, 2019. 6, 14

work page 2019

[20] [20]

Re- thinking image cropping: Exploring diverse compositions from global views

Gengyun Jia, Huaibo Huang, Chaoyou Fu, and Ran He. Re- thinking image cropping: Exploring diverse compositions from global views. InCVPR, 2022. 2

work page 2022

[21] [21]

PIPAL: A large-scale image quality assessment dataset for perceptual image restoration

Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. PIPAL: A large-scale image quality assessment dataset for perceptual image restoration. InECCV, 2020. 6

work page 2020

[22] [22]

Adam: A method for stochastic opti- mization

Diederik P Kingma. Adam: A method for stochastic opti- mization. InICLR, 2015. 12

work page 2015

[23] [23]

GPT-4V(ision) system card, 2024

Black Forest Labs. GPT-4V(ision) system card, 2024. 6

work page 2024

[24] [24]

Routledge, 2013

Michael Langford.Basic photography. Routledge, 2013. 2

work page 2013

[25] [25]

Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018

Jun-Tae Lee, Han-Ul Kim, Chul Lee, and Chang-Su Kim. Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018. 2, 5, 12

work page 2018

[26] [26]

A2- RL: Aesthetics aware reinforcement learning for image crop- ping

Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2- RL: Aesthetics aware reinforcement learning for image crop- ping. InCVPR, 2018. 2

work page 2018

[27] [27]

Learning to learn cropping models for different aspect ratio require- ments

Debang Li, Junge Zhang, and Kaiqi Huang. Learning to learn cropping models for different aspect ratio require- ments. InCVPR, 2020. 2

work page 2020

[28] [28]

Composing good shots by exploiting mutual relations

Debang Li, Junge Zhang, Kaiqi Huang, and Ming-Hsuan Yang. Composing good shots by exploiting mutual relations. InCVPR, 2020. 2 9

work page 2020

[29] [29]

Towards smart point-and-shoot photography

Jiawan Li, Fei Zhou, Zhipeng Zhong, Jiongzhi Lin, and Guoping Qiu. Towards smart point-and-shoot photography. InCVPR, 2025. 2

work page 2025

[30] [30]

Q-Insight: Understanding image quality via visual reinforcement learning

Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-Insight: Understanding image quality via visual reinforcement learning. InNeurIPS,

work page

[31] [31]

Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025

Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025. 4, 14

work page 2025

[32] [32]

DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 5, 14

work page 2024

[33] [33]

Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025

Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025. 2

work page arXiv 2025

[34] [34]

On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989

Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989. 12

work page 1989

[35] [35]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2024. 2

work page 2024

[36] [36]

Beyond image borders: Learn- ing feature extrapolation for unbounded image composition

Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. Beyond image borders: Learn- ing feature extrapolation for unbounded image composition. InICCV, 2023. 2

work page 2023

[37] [37]

Image and video processing on mobile devices: a survey.The Visual Computer, 2021

Chamin Morikawa, Michihiro Kobayashi, Masaki Satoh, Ya- suhiro Kuroda, Teppei Inomata, Hitoshi Matsuo, Takeshi Miura, and Masaki Hilaga. Image and video processing on mobile devices: a survey.The Visual Computer, 2021. 1

work page 2021

[38] [38]

A V A: A large-scale database for aesthetic visual analysis

Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In CVPR, 2012. 4, 5, 12

work page 2012

[39] [39]

GPT-4V(ision) system card, 2023

OpenAI. GPT-4V(ision) system card, 2023. 2

work page 2023

[40] [40]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

GPT-5 system card, 2025

OpenAI. GPT-5 system card, 2025. 6, 7

work page 2025

[42] [42]

DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4

work page 2024

[43] [43]

U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020

Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R Zaiane, and Martin Jagersand. U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020. 4

work page 2020

[44] [44]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4

work page 2021

[45] [45]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2

work page 2022

[46] [46]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5, 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Spatial-semantic collaborative cropping for user generated content

Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, and Qingyao Wu. Spatial-semantic collaborative cropping for user generated content. InAAAI, 2024. 2

work page 2024

[48] [48]

Emu: Generative pretraining in multimodality

Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 3

work page 2024

[49] [49]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024

[50] [50]

DXOMARK - quality testing, scores and reviews, 2025

DXOMARK Team. DXOMARK - quality testing, scores and reviews, 2025. 1

work page 2025

[51] [51]

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 3

work page internal anchor Pith review arXiv 2024

[52] [52]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 6

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Image cropping with composition and saliency aware aes- thetic score map

Yi Tu, Li Niu, Weijie Zhao, Dawei Cheng, and Liqing Zhang. Image cropping with composition and saliency aware aes- thetic score map. InAAAI, 2020. 2

work page 2020

[54] [54]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 5

work page 2025

[55] [55]

Deep cropping via at- tention box prediction and aesthetics assessment

Wenguan Wang and Jianbing Shen. Deep cropping via at- tention box prediction and aesthetics assessment. InICCV,

work page

[56] [56]

Good view hunting: Learning photo composition from dense view pairs

Zijun Wei, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomir Mech, Minh Hoai, and Dimitris Samaras. Good view hunting: Learning photo composition from dense view pairs. InCVPR, 2018. 2, 3, 4, 11, 14

work page 2018

[57] [57]

Janus: Decoupling visual encod- ing for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 3

work page 2025

[58] [58]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels

Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InICML,

work page

[60] [60]

NExT-GPT: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. InICML,

work page

[61] [61]

Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank

Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank. InNeurIPS, 2025. 5, 13

work page 2025

[62] [62]

Show-o: One single transformer to unify multimodal understanding and generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2, 3

work page 2025

[63] [63]

Learning the change for automatic image cropping

Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. Learning the change for automatic image cropping. In CVPR, 2013. 2, 4, 14

work page 2013

[64] [64]

Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023

Guo-Ye Yang, Wen-Yang Zhou, Yun Cai, Song-Hai Zhang, and Fang-Lue Zhang. Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023. 2, 4, 14

work page 2023

[65] [65]

Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,

work page arXiv

[66] [66]

Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models

Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InECCV, 2024. 3, 6

work page 2024

[67] [67]

Teaching large language models to regress accurate image quality scores using score distribution

Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InCVPR,

work page

[68] [68]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2014. 14

work page 2014

[69] [69]

Reliable and efficient image cropping: A grid anchor based approach

Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Reliable and efficient image cropping: A grid anchor based approach. InCVPR, 2019. 2

work page 2019

[70] [70]

Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020

Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020. 2, 3, 4, 5, 12, 14

work page 2020

[71] [71]

Image composition assessment with saliency-augmented multi-pattern pooling

Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented multi-pattern pooling. InBMVC, 2021. 2, 4, 5, 12

work page 2021

[72] [72]

Human- centric image cropping with partition-aware and content- preserving features

Bo Zhang, Li Niu, Xing Zhao, and Liqing Zhang. Human- centric image cropping with partition-aware and content- preserving features. InECCV, 2022. 2

work page 2022

[73] [73]

ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025

Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M Patel, and Luming Liang. ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025. 2

work page arXiv 2025

[74] [74]

Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018

Xiaoyan Zhang, Zhuopeng Li, Martin Constable, Kap Luk Chan, Zhenhua Tang, and Gaoyang Tang. Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018. 2

work page 2018

[75] [75]

Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding

Zhaoran Zhao, Peng Lu, Anran Zhang, Peipei Li, Xia Li, Xu- annan Liu, Yang Hu, Shiyi Chen, Liwei Wang, and Wenhao Guo. Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. InCVPR, 2025. 2

work page 2025

[76] [76]

MiniGPT-4: Enhancing vision-language understanding with advanced large language models

Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 2

work page 2024

[77] [77]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2 Appendix A. More Results We have added more qualitative results in Figs. A9 to A11, including t...

work page internal anchor Pith review Pith/arXiv arXiv 2025