PhotoFramer: Multi-modal Image Composition Instruction
Pith reviewed 2026-05-17 02:43 UTC · model grok-4.3
The pith
A model supplies natural-language advice and corrected example images to fix poorly composed photos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given a poorly composed input image, PhotoFramer first produces natural-language instructions describing how to improve the composition and then generates a well-composed example image; the model is trained on a hierarchical dataset whose three levels are shift, zoom-in, and view-change, with the last level synthesized by a degradation model applied to expert photographs.
What carries the argument
The hierarchical decomposition of composition guidance into shift, zoom-in, and view-change tasks, together with a two-stage synthetic pipeline that first learns to degrade good photos and then applies the learned degradation to create training pairs.
If this is right
- Textual instructions alone are sufficient to improve composition decisions in a multi-modal model.
- Pairing the same instructions with an illustrative example image produces consistent gains over an example-only baseline.
- The trained model can be used directly as a practical composition assistant for everyday phone photography.
- Expert photographic priors can be made accessible without requiring users to study formal rules.
Where Pith is reading between the lines
- Such guidance could be embedded inside camera apps to give live suggestions before the shutter is pressed.
- The same hierarchical task structure might transfer to other visual decision tasks such as video framing or product photography.
- If the synthetic degradation step generalizes, similar pipelines could be used to create training data for other subjective image-quality tasks.
Load-bearing premise
The synthetic degradation model produces poor-composition images that match the distribution of mistakes real users actually make when taking photos.
What would settle it
A user study in which participants are given either the model's text-plus-image guidance or no guidance, then asked to retake the same scene, followed by blind expert ratings of the resulting photographs for composition quality.
Figures
read the original abstract
Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this degradation model to expert-taken photos to synthesize poor images to form training pairs. Using this dataset, we finetune a model that jointly processes and generates both text and images, enabling actionable textual guidance with illustrative examples. Extensive experiments demonstrate that textual instructions effectively steer image composition, and coupling them with exemplars yields consistent improvements over exemplar-only baselines. PhotoFramer offers a practical step toward composition assistants that make expert photographic priors accessible to everyday users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed input image, the model outputs natural-language guidance on improving the framing together with a generated well-composed example image. Composition guidance is organized hierarchically into shift, zoom-in, and view-change subtasks. Shift and zoom-in pairs are drawn from existing cropping datasets; view-change pairs are synthesized by sampling multi-view data, training a degradation model that maps expert photos to poor compositions, and applying the model to expert images. A joint text-and-image model is then fine-tuned on the resulting dataset. The abstract states that extensive experiments show textual instructions effectively steer composition and that coupling instructions with exemplars yields consistent improvements over exemplar-only baselines.
Significance. If the central claims hold after the requested validation, the work would offer a practical route toward interactive composition assistants that make photographic priors accessible via both language and visual examples. The hierarchical task decomposition and the two-stage synthetic-data pipeline for viewpoint changes constitute a concrete contribution to dataset construction in this area.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.
- [§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.
minor comments (2)
- [Abstract] The abstract would be strengthened by the inclusion of at least one key quantitative result (e.g., preference rate or FID) to ground the 'consistent improvements' statement.
- [§3] Notation for the degradation model (e.g., input/output domains and loss terms) should be introduced explicitly when first referenced in the method section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below and indicate the revisions we plan to incorporate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that 'coupling them with exemplars yields consistent improvements over exemplar-only baselines' is presented without any quantitative metrics, error bars, statistical tests, or description of the human-preference protocol. Because this comparison is the primary empirical support for the practical utility of the multi-modal output, the absence of these details makes it impossible to assess the magnitude or reliability of the reported gains.
Authors: We agree that the current description of the human evaluation in §4 is insufficiently detailed to allow full assessment of the reported gains. The preference study compared exemplar-only baselines against our multi-modal outputs, but the manuscript omitted a complete protocol description and statistical reporting. In the revised manuscript we will expand §4 to include: a precise description of the human-preference protocol (number of participants, image sampling method, question wording, and presentation order); quantitative preference percentages with standard error bars; and statistical significance tests (e.g., binomial or McNemar tests). These additions will be reflected in both §4 and the abstract where appropriate. revision: yes
-
Referee: [§3.2] §3.2 (View-change data generation pipeline): the degradation model is trained on multi-view pairs to produce synthetic poor compositions, yet no human study, quantitative similarity metric, or comparison against real user framing errors is reported to confirm that the synthetic distribution matches actual user mistakes or that the paired outputs are verifiably superior. This assumption is load-bearing for the view-change component of the hierarchy and therefore for the overall training signal.
Authors: We acknowledge that explicit validation of the synthetic poor-composition distribution is important. The degradation model was trained on real multi-view pairs drawn from existing datasets, and its outputs were used to create training pairs for the view-change task. While we did not conduct a dedicated human study directly comparing synthetic degradations to real user framing errors, we evaluated the model with quantitative perceptual metrics on held-out multi-view pairs and observed consistent downstream gains. In the revision we will add these quantitative similarity metrics (e.g., LPIPS and SSIM on validation pairs) and additional qualitative examples to §3.2, together with a clearer discussion of the modeling assumptions and their limitations. A full-scale human comparison against real user mistakes would require new paired data collection and is noted as future work. revision: partial
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's method curates training data from external cropping datasets for shift/zoom-in tasks and uses a separate two-stage pipeline on multi-view datasets to train a degradation model that synthesizes view-change pairs before finetuning the multi-modal model; experiments then report performance gains. No equation, result, or claim reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The pipeline depends on independent external sources and a separately trained component whose outputs are not tautologically equivalent to the target composition-steering claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human photographers improve composition by shifting, zooming, or changing viewpoint
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Extensive experiments demonstrate that textual instructions effectively steer image composition
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.
-
LumiVideo: An Intelligent Agentic System for Video Color Grading
LumiVideo deploys an LLM-based agent with RAG and Tree of Thoughts to generate ASC-CDL parameters and 3D LUTs for automatic cinematic color grading from raw log video, approaching expert quality.
Reference graph
Works this paper leans on
-
[1]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025. 2, 3, 4, 5, 13, 14
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space
Stephen Batifol, Andreas Blattmann, Frederic Boesel, Sak- sham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, et al. FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Instruction-based image manipulation by watching how things move
Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, and Zhihao Xia. Instruction-based image manipulation by watching how things move. InCVPR, 2025. 2
work page 2025
-
[4]
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, et al. ArtiMuse: Fine-grained image aesthetics as- sessment with joint scoring and expert-level understanding. arXiv preprint arXiv:2507.14533, 2025. 2
-
[5]
Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study
Yi-Ling Chen, Tzu-Wei Huang, Kai-Han Chang, Yu-Chen Tsai, Hwann-Tzong Chen, and Bing-Yu Chen. Quantitative analysis of automatic image cropping algorithms: A dataset and comparative study. InWACV, 2017. 2, 4
work page 2017
-
[6]
Learning to compose with professional pho- tographs on the web
Yi-Ling Chen, Jan Klopp, Min Sun, Shao-Yi Chien, and Kwan-Liu Ma. Learning to compose with professional pho- tographs on the web. InACM MM, 2017. 2
work page 2017
-
[7]
Mobile computational photography: A tour.Annual Review of Vision Science, 2021
Mauricio Delbracio, Damien Kelly, Michael S Brown, and Peyman Milanfar. Mobile computational photography: A tour.Annual Review of Vision Science, 2021. 1
work page 2021
-
[8]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 2, 3, 5, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Unsplash lite dataset 1.3.0, 2020
Unsplash Developers. Unsplash lite dataset 1.3.0, 2020. 2, 5, 16
work page 2020
-
[10]
Chen Fang, Zhe Lin, Radomir Mech, and Xiaohui Shen. Au- tomatic image cropping using visual composition, boundary simplicity and content preservation models. InACM MM,
-
[11]
Farshid Farhat, Mohammad Mahdi Kamani, Sahil Mishra, and James Z Wang. Intelligent portrait composition as- sistance: Integrating deep-learned models and photography idea retrieval. InACM MM Workshops, 2017. 2
work page 2017
-
[12]
Farshid Farhat, Mohammad Mahdi Kamani, and James Z Wang. CAPTAIN: Comprehensive composition assistance for photo taking.ACM Transactions on Multimedia Com- puting, Communications, and Applications, 2022. 2
work page 2022
-
[13]
Michael Freeman.The Photographer’s Eye Digitally Remas- tered 10th Anniversary Edition: Composition and Design for Better Digital Photos. Routledge, 2017. 2
work page 2017
-
[14]
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. SEED-X: Multimodal models with unified multi-granularity compre- hension and generation.arXiv preprint arXiv:2404.14396,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Guanjun Guo, Hanzi Wang, Chunhua Shen, Yan Yan, and Hong-Yuan Mark Liao. Automatic image cropping for vi- sual aesthetic enhancement using deep neural networks and cascaded regression.IEEE TMM, 2018. 2
work page 2018
-
[16]
Composing photos like a photographer
Chaoyi Hong, Shuaiyuan Du, Ke Xian, Hao Lu, Zhiguo Cao, and Weicai Zhong. Composing photos like a photographer. InCVPR, 2021. 2
work page 2021
-
[17]
Learning subject-aware cropping by out- painting professional photos
James Hong, Lu Yuan, Micha ¨el Gharbi, Matthew Fisher, and Kayvon Fatahalian. Learning subject-aware cropping by out- painting professional photos. InAAAI, 2024. 2
work page 2024
-
[18]
Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report
Andrey Ignatov, Georgii Perevozchikov, Radu Timofte, Cheng Li, Lian Liu, Jun Cao, Heng Sun, Wu Pan, Song Wang, KeQiang Yu, et al. Learned smartphone ISP on mo- bile GPUs, mobile AI 2025 challenge: Report. InCVPR Workshops, 2025. 1
work page 2025
-
[19]
Decoupled weight decay regularization
Loshchilov Ilya and Hutter Frank. Decoupled weight decay regularization. InICLR, 2019. 6, 14
work page 2019
-
[20]
Re- thinking image cropping: Exploring diverse compositions from global views
Gengyun Jia, Huaibo Huang, Chaoyou Fu, and Ran He. Re- thinking image cropping: Exploring diverse compositions from global views. InCVPR, 2022. 2
work page 2022
-
[21]
PIPAL: A large-scale image quality assessment dataset for perceptual image restoration
Gu Jinjin, Cai Haoming, Chen Haoyu, Ye Xiaoxing, Jimmy S Ren, and Dong Chao. PIPAL: A large-scale image quality assessment dataset for perceptual image restoration. InECCV, 2020. 6
work page 2020
-
[22]
Adam: A method for stochastic opti- mization
Diederik P Kingma. Adam: A method for stochastic opti- mization. InICLR, 2015. 12
work page 2015
-
[23]
GPT-4V(ision) system card, 2024
Black Forest Labs. GPT-4V(ision) system card, 2024. 6
work page 2024
- [24]
-
[25]
Jun-Tae Lee, Han-Ul Kim, Chul Lee, and Chang-Su Kim. Photographic composition classification and dominant geo- metric element detection for outdoor scenes.Journal of Vi- sual Communication and Image Representation, 2018. 2, 5, 12
work page 2018
-
[26]
A2- RL: Aesthetics aware reinforcement learning for image crop- ping
Debang Li, Huikai Wu, Junge Zhang, and Kaiqi Huang. A2- RL: Aesthetics aware reinforcement learning for image crop- ping. InCVPR, 2018. 2
work page 2018
-
[27]
Learning to learn cropping models for different aspect ratio require- ments
Debang Li, Junge Zhang, and Kaiqi Huang. Learning to learn cropping models for different aspect ratio require- ments. InCVPR, 2020. 2
work page 2020
-
[28]
Composing good shots by exploiting mutual relations
Debang Li, Junge Zhang, Kaiqi Huang, and Ming-Hsuan Yang. Composing good shots by exploiting mutual relations. InCVPR, 2020. 2 9
work page 2020
-
[29]
Towards smart point-and-shoot photography
Jiawan Li, Fei Zhou, Zhipeng Zhong, Jiongzhi Lin, and Guoping Qiu. Towards smart point-and-shoot photography. InCVPR, 2025. 2
work page 2025
-
[30]
Q-Insight: Understanding image quality via visual reinforcement learning
Weiqi Li, Xuanyu Zhang, Shijie Zhao, Yabin Zhang, Junlin Li, Li Zhang, and Jian Zhang. Q-Insight: Understanding image quality via visual reinforcement learning. InNeurIPS,
-
[31]
Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025
Xinqi Lin, Fanghua Yu, Jinfan Hu, Zhiyuan You, Wu Shi, Jimmy S Ren, Jinjin Gu, and Chao Dong. Harnessing diffusion-yielded score priors for image restoration.ACM TOG, 2025. 4, 14
work page 2025
-
[32]
DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. DL3DV-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024. 2, 5, 14
work page 2024
-
[33]
Boyang Liu, Yifan Hu, Senjie Jin, Shihan Dou, Gonglei Shi, Jie Shao, Tao Gui, and Xuanjing Huang. Unlock- ing the essence of beauty: Advanced aesthetic reasoning with relative-absolute policy optimization.arXiv preprint arXiv:2509.21871, 2025. 2
-
[34]
On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989
Dong C Liu and Jorge Nocedal. On the limited memory bfgs method for large scale optimization.Mathematical Program- ming, 1989. 12
work page 1989
-
[35]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2024. 2
work page 2024
-
[36]
Beyond image borders: Learn- ing feature extrapolation for unbounded image composition
Xiaoyu Liu, Ming Liu, Junyi Li, Shuai Liu, Xiaotao Wang, Lei Lei, and Wangmeng Zuo. Beyond image borders: Learn- ing feature extrapolation for unbounded image composition. InICCV, 2023. 2
work page 2023
-
[37]
Image and video processing on mobile devices: a survey.The Visual Computer, 2021
Chamin Morikawa, Michihiro Kobayashi, Masaki Satoh, Ya- suhiro Kuroda, Teppei Inomata, Hitoshi Matsuo, Takeshi Miura, and Masaki Hilaga. Image and video processing on mobile devices: a survey.The Visual Computer, 2021. 1
work page 2021
-
[38]
A V A: A large-scale database for aesthetic visual analysis
Naila Murray, Luca Marchesotti, and Florent Perronnin. A V A: A large-scale database for aesthetic visual analysis. In CVPR, 2012. 4, 5, 12
work page 2012
- [39]
-
[40]
OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024. 7
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [41]
-
[42]
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2: Learning robust visual features without supervi- sion.Transactions on Machine Learning Research, 2024. 4
work page 2024
-
[43]
U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020
Xuebin Qin, Zichen Zhang, Chenyang Huang, Masood De- hghan, Osmar R Zaiane, and Martin Jagersand. U2-Net: Go- ing deeper with nested u-structure for salient object detec- tion.PR, 2020. 4
work page 2020
-
[44]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 4
work page 2021
-
[45]
High-resolution image syn- thesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, 2022. 2
work page 2022
-
[46]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. DeepseekMath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 5, 13
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Spatial-semantic collaborative cropping for user generated content
Yukun Su, Yiwen Cao, Jingliang Deng, Fengyun Rao, and Qingyao Wu. Spatial-semantic collaborative cropping for user generated content. InAAAI, 2024. 2
work page 2024
-
[48]
Emu: Generative pretraining in multimodality
Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. InICLR, 2024. 3
work page 2024
-
[49]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[50]
DXOMARK - quality testing, scores and reviews, 2025
DXOMARK Team. DXOMARK - quality testing, scores and reviews, 2025. 1
work page 2025
-
[51]
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Shengbang Tong, David Fan, Jiachen Zhu, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. MetaMorph: Multimodal understanding and generation via instruction tuning.arXiv preprint arXiv:2412.14164, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[52]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. SigLIP 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Image cropping with composition and saliency aware aes- thetic score map
Yi Tu, Li Niu, Weijie Zhao, Dawei Cheng, and Liqing Zhang. Image cropping with composition and saliency aware aes- thetic score map. InAAAI, 2020. 2
work page 2020
-
[54]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 5
work page 2025
-
[55]
Deep cropping via at- tention box prediction and aesthetics assessment
Wenguan Wang and Jianbing Shen. Deep cropping via at- tention box prediction and aesthetics assessment. InICCV,
-
[56]
Good view hunting: Learning photo composition from dense view pairs
Zijun Wei, Jianming Zhang, Xiaohui Shen, Zhe Lin, Radomir Mech, Minh Hoai, and Dimitris Samaras. Good view hunting: Learning photo composition from dense view pairs. InCVPR, 2018. 2, 3, 4, 11, 14
work page 2018
-
[57]
Janus: Decoupling visual encod- ing for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encod- ing for unified multimodal understanding and generation. In CVPR, 2025. 3
work page 2025
-
[58]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-Image technical report.arXiv preprint arXiv:2508.02324, 2025. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels
Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InICML,
-
[60]
NExT-GPT: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. NExT-GPT: Any-to-any multimodal llm. InICML,
-
[61]
Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank
Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, and Kede Ma. Visualquality-r1: Reasoning-induced image quality assess- ment via reinforcement learning to rank. InNeurIPS, 2025. 5, 13
work page 2025
-
[62]
Show-o: One single transformer to unify multimodal understanding and generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation. InICLR, 2025. 2, 3
work page 2025
-
[63]
Learning the change for automatic image cropping
Jianzhou Yan, Stephen Lin, Sing Bing Kang, and Xiaoou Tang. Learning the change for automatic image cropping. In CVPR, 2013. 2, 4, 14
work page 2013
-
[64]
Guo-Ye Yang, Wen-Yang Zhou, Yun Cai, Song-Hai Zhang, and Fang-Lue Zhang. Focusing on your subject: Deep subject-aware image composition recommendation net- works.Computational Visual Media, 2023. 2, 4, 14
work page 2023
-
[65]
Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,
Zhiyuan You, Jinjin Gu, Zheyuan Li, Xin Cai, Kaiwen Zhu, Chao Dong, and Tianfan Xue. Descriptive image quality assessment in the wild.arXiv preprint arXiv:2405.18842,
-
[66]
Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models
Zhiyuan You, Zheyuan Li, Jinjin Gu, Zhenfei Yin, Tianfan Xue, and Chao Dong. Depicting beyond scores: Advanc- ing image quality assessment through multi-modal language models. InECCV, 2024. 3, 6
work page 2024
-
[67]
Teaching large language models to regress accurate image quality scores using score distribution
Zhiyuan You, Xin Cai, Jinjin Gu, Tianfan Xue, and Chao Dong. Teaching large language models to regress accurate image quality scores using score distribution. InCVPR,
-
[68]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hocken- maier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descrip- tions.Transactions of the Association for Computational Linguistics, 2014. 14
work page 2014
-
[69]
Reliable and efficient image cropping: A grid anchor based approach
Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Reliable and efficient image cropping: A grid anchor based approach. InCVPR, 2019. 2
work page 2019
-
[70]
Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020
Hui Zeng, Lida Li, Zisheng Cao, and Lei Zhang. Grid anchor based image cropping: A new benchmark and an efficient model.IEEE TPAMI, 2020. 2, 3, 4, 5, 12, 14
work page 2020
-
[71]
Image composition assessment with saliency-augmented multi-pattern pooling
Bo Zhang, Li Niu, and Liqing Zhang. Image composition assessment with saliency-augmented multi-pattern pooling. InBMVC, 2021. 2, 4, 5, 12
work page 2021
-
[72]
Human- centric image cropping with partition-aware and content- preserving features
Bo Zhang, Li Niu, Xing Zhao, and Liqing Zhang. Human- centric image cropping with partition-aware and content- preserving features. InECCV, 2022. 2
work page 2022
-
[73]
Ke Zhang, Tianyu Ding, Jiachen Jiang, Tianyi Chen, Ilya Zharkov, Vishal M Patel, and Luming Liang. ProCrop: Learning aesthetic image cropping from professional com- positions.arXiv preprint arXiv:2505.22490, 2025. 2
-
[74]
Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018
Xiaoyan Zhang, Zhuopeng Li, Martin Constable, Kap Luk Chan, Zhenhua Tang, and Gaoyang Tang. Pose-based composition improvement for portrait photographs.IEEE TCSVT, 2018. 2
work page 2018
-
[75]
Zhaoran Zhao, Peng Lu, Anran Zhang, Peipei Li, Xia Li, Xu- annan Liu, Yang Hu, Shiyi Chen, Liwei Wang, and Wenhao Guo. Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. InCVPR, 2025. 2
work page 2025
-
[76]
MiniGPT-4: Enhancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo- hamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In ICLR, 2024. 2
work page 2024
-
[77]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. InternVL3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 2 Appendix A. More Results We have added more qualitative results in Figs. A9 to A11, including t...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.