pith. sign in
Pith Number

pith:UI644VF2

pith:2026:UI644VF22ISURCEMQ6H7LVJJU7
not attested not anchored not stored refs pending

EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis

Chao Hui, Haohua Chen, Hao Shi, Honghao Cai, Tianze Zhou, Wei Zhu, Xiangyuan Wang, Xu Tang, Yao Hu, Yibo Chen, Yuling Wu, Yunhao Bai

A two-stage SFT and DPO pipeline aligns vision-language models to cut critical errors in image editing instructions from 47% to 23%.

arxiv:2604.08213 v2 · 2026-04-09 · cs.CV · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{UI644VF22ISURCEMQ6H7LVJJU7}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%.

C2weakest assumption

That the three identified failure modes (orientation inconsistency, viewpoint ambiguity, insufficient fine-grained attribute description) are the dominant sources of unusable instructions and that the human preference data collected for DPO faithfully captures them without introducing new selection biases or annotation artifacts.

C3one line summary

EditCaption reduces critical errors in automated image editing instructions from 47.75% to 23% via SFT and DPO, yielding fine-tuned models that match or exceed closed-source VLMs on Eval-400 and ByteMorph-Bench.

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-26T02:05:09.118007Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a23dce54bad22548888c878ff5d529a7eeabbc9527d7c8edc480a0a834b81b33

Aliases

arxiv: 2604.08213 · arxiv_version: 2604.08213v2 · doi: 10.48550/arxiv.2604.08213 · pith_short_12: UI644VF22ISU · pith_short_16: UI644VF22ISURCEM · pith_short_8: UI644VF2
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/UI644VF22ISURCEMQ6H7LVJJU7 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a23dce54bad22548888c878ff5d529a7eeabbc9527d7c8edc480a0a834b81b33
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "3a2e4252e971168d8233c682f68285742af2df9e05ead7a570e68d4258b92e57",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-04-09T13:11:33Z",
    "title_canon_sha256": "c82fb40e08059541fad887d45a11a27ee4e2310312c3b7193b08b4a06ec074a7"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2604.08213",
    "kind": "arxiv",
    "version": 2
  }
}