Semantically Stable Image Composition Analysis via Saliency and Gradient Vector Flow Fusion

Armin Dadras; Franziska Proksa; Markus Seidl; Robert Sablatnig

arxiv: 2604.16500 · v1 · submitted 2026-04-14 · 💻 cs.CV

Semantically Stable Image Composition Analysis via Saliency and Gradient Vector Flow Fusion

Armin Dadras , Robert Sablatnig , Franziska Proksa , Markus Seidl This is my paper

Pith reviewed 2026-05-10 15:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords photographic compositionimage composition analysissaliencygradient vector flowVFCNetDINOv3attention integrationsemantic stability

0 comments

The pith

Fusing saliency and gradient vector flow yields semantically stable features for assessing photographic composition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to solve the problem of judging photo composition in a way that does not depend on recognizing specific objects or scenes. It assumes that good composition shows up as a consistent flow of attention over the geometric layout of edges and shapes. The proposed VFCNet builds this by merging saliency information with gradient vector flow fields, using attention to combine streams and a DINOv3 model to pull out features at different scales. This leads to better results on composition evaluation tasks than earlier methods that were more tied to semantics.

Core claim

VFCNet computes dual-stream gradient vector flow representations by fusing saliency and edge information, integrates these streams through an attention mechanism, and extracts multi-scale flow features using a DINOv3 backbone. This produces a low-level representation of composition that is robust to semantic content variations.

What carries the argument

VFCNet, which fuses saliency and edge-derived information into gradient vector flow fields for attention-based integration and multi-scale feature extraction.

Load-bearing premise

That composition quality corresponds to the flow of visual attention across geometric structure in the image.

What would settle it

Finding a collection of images with matched geometric layouts but differing semantics where the model's composition scores do not align with human ratings of quality.

Figures

Figures reproduced from arXiv: 2604.16500 by Armin Dadras, Franziska Proksa, Markus Seidl, Robert Sablatnig.

**Figure 2.** Figure 2: VFCNet input representation: saliency captures attention distribution; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-scale GVF features: divergence captures convergence/divergence of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Per-class CDA-2 comparison: VFCNet vs DINOv3+C. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

The reliable computational assessment of photographic composition requires features that are discriminative of spatial layout yet robust to semantic content. This paper proposes a low-level representation grounded in the assumption that composition can be understood as the flow of visual attention across geometric structure. We introduce VFCNet, which fuses saliency and edge information into a gradient vector flow (GVF) field. The model computes dual-stream GVF representations, integrates them via attention, and extracts multi-scale flow features with a DINOv3 backbone. VFCNet achieves state-of-the-art performance on the PICD benchmark (CDA-1: 0.683, CDA-2: 0.629), improving by 33.1\% and 36.1\% over the previous best method. We also show that a simple classifier on self-supervised DINOv3 features substantially outperforms more sophisticated, composition-specialized models. Code is available at https://github.com/ADadras/VFCNet

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VFCNet claims big gains on PICD via saliency-GVF fusion with DINOv3, but the fusion's contribution over plain DINOv3 needs ablations to hold up.

read the letter

The main point is that this paper introduces VFCNet for computational assessment of photographic composition. It fuses saliency maps and edge-based gradient vector flow fields through attention before feeding multi-scale features into a DINOv3 backbone. The model reaches CDA-1 of 0.683 and CDA-2 of 0.629 on the PICD benchmark, with reported relative gains of 33% and 36% over the prior best. It also observes that a simple classifier on self-supervised DINOv3 features already beats earlier composition-specific models. The code release is a practical plus for anyone who wants to inspect or reuse the implementation.

Referee Report

2 major / 0 minor

Summary. The paper proposes VFCNet for computational assessment of photographic composition. It grounds the approach in the assumption that composition is the flow of visual attention across geometric structure, introducing a dual-stream architecture that fuses saliency maps with gradient vector flow (GVF) fields computed from edge information. These are integrated via attention and fed to a DINOv3 backbone for multi-scale feature extraction. The central claims are state-of-the-art results on the PICD benchmark (CDA-1: 0.683, CDA-2: 0.629, representing 33.1% and 36.1% relative gains over the prior best) together with the observation that a simple linear classifier on self-supervised DINOv3 features already outperforms prior composition-specialized models. Code is released at the provided GitHub link.

Significance. If the performance claims can be substantiated with complete experimental protocols and ablations, the work would be significant for demonstrating that low-level geometric priors (saliency + GVF) can yield semantically stable composition features when combined with strong self-supervised backbones. The secondary finding that plain DINOv3 features are already highly effective would usefully shift emphasis away from hand-crafted composition models toward leveraging large-scale pretraining. Public code release is a clear strength supporting reproducibility.

major comments (2)

Abstract: The reported benchmark scores (CDA-1: 0.683, CDA-2: 0.629 with 33.1% and 36.1% relative improvements) are presented without any details on training procedure, data splits, baseline implementations, statistical tests, or ablation studies. This absence directly undermines evaluation of the central SOTA claim.
Abstract: The manuscript simultaneously asserts that a simple DINOv3 classifier substantially outperforms prior composition-specialized models, yet attributes VFCNet's gains to the saliency-GVF fusion and attention streams. No ablation is described that strips the GVF/attention components while retaining the identical DINOv3 extractor and evaluates on the same PICD splits; without this comparison the load-bearing assumption that the proposed flow representation supplies additional semantic-robust features cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript accordingly to provide greater transparency and direct experimental support for our claims.

read point-by-point responses

Referee: Abstract: The reported benchmark scores (CDA-1: 0.683, CDA-2: 0.629 with 33.1% and 36.1% relative improvements) are presented without any details on training procedure, data splits, baseline implementations, statistical tests, or ablation studies. This absence directly undermines evaluation of the central SOTA claim.

Authors: We acknowledge that the abstract's brevity omits these procedural details. In the revised manuscript we will update the abstract to briefly reference the PICD data splits, the re-implementation of baselines under identical conditions, and the use of statistical testing. Full protocols, including training procedure, baseline details, statistical tests, and ablation studies, are already described in Sections 3 and 4; we will add explicit cross-references from the abstract to these sections. revision: yes
Referee: Abstract: The manuscript simultaneously asserts that a simple DINOv3 classifier substantially outperforms prior composition-specialized models, yet attributes VFCNet's gains to the saliency-GVF fusion and attention streams. No ablation is described that strips the GVF/attention components while retaining the identical DINOv3 extractor and evaluates on the same PICD splits; without this comparison the load-bearing assumption that the proposed flow representation supplies additional semantic-robust features cannot be assessed.

Authors: We agree that an explicit ablation isolating the contribution of the saliency-GVF fusion and attention is necessary to substantiate the attribution of gains. In the revised manuscript we will add a new ablation study comparing VFCNet against a DINOv3-only baseline that uses the identical backbone, training protocol, and PICD splits but omits the dual-stream GVF/saliency inputs and attention fusion. This will directly quantify the incremental benefit of the proposed flow representation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results from trained model with external backbone

full rationale

The paper presents VFCNet as a trained architecture that fuses saliency and GVF fields, integrates via attention, and extracts features using a self-supervised DINOv3 backbone pretrained externally. Performance numbers (CDA-1 0.683, CDA-2 0.629) are reported as empirical outcomes on the PICD benchmark rather than any derived prediction or mathematical reduction. No equations, self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the abstract or described chain; the modeling assumption that composition equals attention flow across geometry is an explicit ansatz, not a circular redefinition of the output. The additional observation that plain DINOv3 already beats prior models further separates the backbone contribution from any fusion-specific claim, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that composition equals attention flow over geometry and that the proposed fusion produces semantic invariance. No free parameters or invented entities are described in the abstract. The model reuses existing components (saliency, GVF, attention, DINOv3).

axioms (1)

domain assumption Composition can be understood as the flow of visual attention across geometric structure.
Explicitly stated in the abstract as the grounding assumption for the low-level representation.

pith-pipeline@v0.9.0 · 5475 in / 1516 out tokens · 72721 ms · 2026-05-10T15:09:12.883912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the 31st ACM International Con- ference on Multimedia

He, S., Ming, A., Zheng, S., Zhong, H., Ma, H.: Eat: An enhancer for aesthetics- oriented transformers. In: Proceedings of the 31st ACM International Con- ference on Multimedia. pp. 1023–1032. ACM, Ottawa, ON, Canada (2023). https://doi.org/10.1145/3581783.3611881

work page doi:10.1145/3581783.3611881 2023
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, C., Du, S., Xian, K., Lu, H., Cao, Z., Zhong, W.: Composing photos like a photographer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7057–7066 (2021)

work page 2021
[3]

Multimedia Systems30(3), 121 (2024)

Hou, Q., Ke, Y., Wang, K., Qin, F., Wang, Y.: Synchronous composition and semantic line detection based on cross-attention. Multimedia Systems30(3), 121 (2024)

work page 2024
[4]

Solomon R

Kandinsky, W.: Point and Line to Plane: Contribution to the Analysis of the Picto- rial Elements. Solomon R. Guggenheim Foundation, New York (1947), translated by Howard Dearstyne, edited by Hilla Rebay

work page 1947
[5]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 5148–5157 (October 2021)

work page 2021
[6]

Journal of Visual Communication and Image Representation55, 91–105 (2018)

Lee, J.T., Kim, H.U., Lee, C., Kim, C.S.: Photographic composition classification and dominant geometric element detection for outdoor scenes. Journal of Visual Communication and Image Representation55, 91–105 (2018)

work page 2018
[7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, D., Zhang, J., Huang, K., Yang, M.H.: Composing good shots by exploiting mu- tual relations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4213–4222 (June 2020) Title Suppressed Due to Excessive Length 15

work page 2020
[8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Linardos, A., Kümmerer, M., Press, O., Bethge, M.: Deepgaze iie: Calibrated pre- diction in and out-of-domain for state-of-the-art saliency modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12919–12928 (2021)

work page 2021
[9]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR)

She, D., Lai, Y., Yi, G., Xu, K.: Hierarchical layout-aware graph convolutional net- work for unified aesthetics assessment. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 8475–8484 (June 2021)

work page 2021
[10]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025), https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Su, Y., Cao, Y., Deng, J., Rao, F., Wu, Q.: Spatial-semantic collaborative cropping for user generated content (2024), https://arxiv.org/abs/2401.08086

work page arXiv 2024
[12]

Journal of Visual Communication and Image Representation90, 103751 (2023)

Wang, Y., Ke, Y., Wang, K., Guo, J., Yang, S.: Spatial-invariant convolutional neural network for photographic composition prediction and automatic correction. Journal of Visual Communication and Image Representation90, 103751 (2023)

work page 2023
[13]

Yaseen, M.: What is yolov8: An in-depth exploration of the internal features of the next-generation object detector (2024), https://arxiv.org/abs/2408.15857

work page arXiv 2024
[14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yi, R., Tian, H., Gu, Z., Lai, Y.K., Rosin, P.L.: Towards artistic image aesthet- ics assessment: A large-scale dataset and a new method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22388–22397 (June 2023)

work page 2023
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1304–1319 (2022)

Zeng, H., Li, L., Cao, Z., Zhang, L.: Grid anchor based image crop- ping: A new benchmark and an efficient model. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1304–1319 (2022). https://doi.org/10.1109/TPAMI.2020.3024207

work page doi:10.1109/tpami.2020.3024207 2022
[16]

arXiv preprint arXiv:2104.03133 , year=

Zhang, B., Niu, L., Zhang, L.: Image composition assessment with saliency- augmented multi-pattern pooling. arXiv preprint arXiv:2104.03133 (2021)

work page arXiv 2021
[17]

arXiv preprint arXiv:2403.03740 (2024)

Zhao, Z., Lu, P., Peng, X., Guo, W.: Self-supervised photographic image layout representation learning. arXiv preprint arXiv:2403.03740 (2024)

work page arXiv 2024
[18]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhao, Z., Lu, P., Zhang, A., Li, P., Li, X., Liu, X., Hu, Y., Chen, S., Wang, L., Guo, W.: Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14411–14421 (2025)

work page 2025

[1] [1]

In: Proceedings of the 31st ACM International Con- ference on Multimedia

He, S., Ming, A., Zheng, S., Zhong, H., Ma, H.: Eat: An enhancer for aesthetics- oriented transformers. In: Proceedings of the 31st ACM International Con- ference on Multimedia. pp. 1023–1032. ACM, Ottawa, ON, Canada (2023). https://doi.org/10.1145/3581783.3611881

work page doi:10.1145/3581783.3611881 2023

[2] [2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, C., Du, S., Xian, K., Lu, H., Cao, Z., Zhong, W.: Composing photos like a photographer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7057–7066 (2021)

work page 2021

[3] [3]

Multimedia Systems30(3), 121 (2024)

Hou, Q., Ke, Y., Wang, K., Qin, F., Wang, Y.: Synchronous composition and semantic line detection based on cross-attention. Multimedia Systems30(3), 121 (2024)

work page 2024

[4] [4]

Solomon R

Kandinsky, W.: Point and Line to Plane: Contribution to the Analysis of the Picto- rial Elements. Solomon R. Guggenheim Foundation, New York (1947), translated by Howard Dearstyne, edited by Hilla Rebay

work page 1947

[5] [5]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: Musiq: Multi-scale image quality transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 5148–5157 (October 2021)

work page 2021

[6] [6]

Journal of Visual Communication and Image Representation55, 91–105 (2018)

Lee, J.T., Kim, H.U., Lee, C., Kim, C.S.: Photographic composition classification and dominant geometric element detection for outdoor scenes. Journal of Visual Communication and Image Representation55, 91–105 (2018)

work page 2018

[7] [7]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Li, D., Zhang, J., Huang, K., Yang, M.H.: Composing good shots by exploiting mu- tual relations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4213–4222 (June 2020) Title Suppressed Due to Excessive Length 15

work page 2020

[8] [8]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Linardos, A., Kümmerer, M., Press, O., Bethge, M.: Deepgaze iie: Calibrated pre- diction in and out-of-domain for state-of-the-art saliency modeling. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12919–12928 (2021)

work page 2021

[9] [9]

In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR)

She, D., Lai, Y., Yi, G., Xu, K.: Hierarchical layout-aware graph convolutional net- work for unified aesthetics assessment. In: Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 8475–8484 (June 2021)

work page 2021

[10] [10]

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025), https://a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Su, Y., Cao, Y., Deng, J., Rao, F., Wu, Q.: Spatial-semantic collaborative cropping for user generated content (2024), https://arxiv.org/abs/2401.08086

work page arXiv 2024

[12] [12]

Journal of Visual Communication and Image Representation90, 103751 (2023)

Wang, Y., Ke, Y., Wang, K., Guo, J., Yang, S.: Spatial-invariant convolutional neural network for photographic composition prediction and automatic correction. Journal of Visual Communication and Image Representation90, 103751 (2023)

work page 2023

[13] [13]

Yaseen, M.: What is yolov8: An in-depth exploration of the internal features of the next-generation object detector (2024), https://arxiv.org/abs/2408.15857

work page arXiv 2024

[14] [14]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Yi, R., Tian, H., Gu, Z., Lai, Y.K., Rosin, P.L.: Towards artistic image aesthet- ics assessment: A large-scale dataset and a new method. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22388–22397 (June 2023)

work page 2023

[15] [15]

IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1304–1319 (2022)

Zeng, H., Li, L., Cao, Z., Zhang, L.: Grid anchor based image crop- ping: A new benchmark and an efficient model. IEEE Transactions on Pattern Analysis and Machine Intelligence44(3), 1304–1319 (2022). https://doi.org/10.1109/TPAMI.2020.3024207

work page doi:10.1109/tpami.2020.3024207 2022

[16] [16]

arXiv preprint arXiv:2104.03133 , year=

Zhang, B., Niu, L., Zhang, L.: Image composition assessment with saliency- augmented multi-pattern pooling. arXiv preprint arXiv:2104.03133 (2021)

work page arXiv 2021

[17] [17]

arXiv preprint arXiv:2403.03740 (2024)

Zhao, Z., Lu, P., Peng, X., Guo, W.: Self-supervised photographic image layout representation learning. arXiv preprint arXiv:2403.03740 (2024)

work page arXiv 2024

[18] [18]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhao, Z., Lu, P., Zhang, A., Li, P., Li, X., Liu, X., Hu, Y., Chen, S., Wang, L., Guo, W.: Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 14411–14421 (2025)

work page 2025