pith. sign in
Pith Number

pith:KJOUMWUN

pith:2026:KJOUMWUNJKBWB4OVMXNSUXGMFI
not attested not anchored not stored refs resolved

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Chao Li, Hao Chen, Jie Yu, Zhitong Dong

A vision-language model crops images to match expert aesthetics by reasoning through scene analysis, composition rules, and preference alignment.

arxiv:2605.12545 v1 · 2026-05-09 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{KJOUMWUNJKBWB4OVMXNSUXGMFI}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs the VLM to think like a professional photographer. It deconstructs a complex and subjective aesthetic problem into an 'analysis-proposal-decision' process, reasoning step by step through the analysis of scene elements and compositional principles. Meanwhile, our expert preference alignment module makes the model's decision consistent with human expert aesthetics.

C2weakest assumption

That a VLM can be reliably guided through an analysis-proposal-decision process and aligned via an expert preference module to produce cropping decisions that consistently outperform saliency and retrieval baselines across varied scenes.

C3one line summary

CROP uses compositional reasoning and expert preference alignment in VLMs to produce aesthetic crops that match human experts more closely than previous methods.

References

14 extracted · 14 resolved · 6 Pith anchors

[1] Qwen3-VL Technical Report · arXiv:2511.21631
[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale 2010 · arXiv:2010.11929
[3] Code- optimise: Self-generated preference data for correctness and efficiency.arXiv preprint arXiv:2406.12502,
[4] Dc-ae 1.5: Efficient image tok- enizer for autoregressive visual generation.arXiv preprint arXiv:2501.09012, 2025a
[5] WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct · arXiv:2308.09583

Formal links

1 machine-checked theorem link

Receipt and verification
First computed 2026-05-18T03:10:02.243785Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

525d465a8d4a8360f1d565db2a5ccc2a22e7bce3be758f8ae979af9468e12311

Aliases

arxiv: 2605.12545 · arxiv_version: 2605.12545v1 · doi: 10.48550/arxiv.2605.12545 · pith_short_12: KJOUMWUNJKBW · pith_short_16: KJOUMWUNJKBWB4OV · pith_short_8: KJOUMWUN
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/KJOUMWUNJKBWB4OVMXNSUXGMFI \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 525d465a8d4a8360f1d565db2a5ccc2a22e7bce3be758f8ae979af9468e12311
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "75344f0094e27d42850036b76748cc0e826d8798da475221f301f2fd9ffec282",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-09T10:21:51Z",
    "title_canon_sha256": "ea5b40783cdc39c1e045672bc0b75b9565633522cf0f7ecf26d3949bd09e2153"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.12545",
    "kind": "arxiv",
    "version": 1
  }
}