pith:UI644VF2
EditCaption: Human-Refined SFT and HAE-DPO for Image Editing Instruction Synthesis
A two-stage SFT and DPO pipeline aligns vision-language models to cut critical errors in image editing instructions from 47% to 23%.
arxiv:2604.08213 v2 · 2026-04-09 · cs.CV · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{UI644VF22ISURCEMQ6H7LVJJU7}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
fine-tuned Qwen3-VL models outperform open-source baselines; the 235B model reaches 4.712 on Eval-400 (vs. Gemini-3-Pro 4.706, GPT-4.1 4.220, Kimi-K2.5 4.111) and 4.588 on ByteMorph-Bench (vs. Gemini-3-Pro 4.522, GPT-4.1 3.412). Human evaluation shows critical errors falling from 47.75% to 23% and correctness rising from 41.75% to 66%.
That the three identified failure modes (orientation inconsistency, viewpoint ambiguity, insufficient fine-grained attribute description) are the dominant sources of unusable instructions and that the human preference data collected for DPO faithfully captures them without introducing new selection biases or annotation artifacts.
EditCaption reduces critical errors in automated image editing instructions from 47.75% to 23% via SFT and DPO, yielding fine-tuned models that match or exceed closed-source VLMs on Eval-400 and ByteMorph-Bench.
Formal links
Receipt and verification
| First computed | 2026-05-26T02:05:09.118007Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
a23dce54bad22548888c878ff5d529a7eeabbc9527d7c8edc480a0a834b81b33
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/UI644VF22ISURCEMQ6H7LVJJU7 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a23dce54bad22548888c878ff5d529a7eeabbc9527d7c8edc480a0a834b81b33
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "3a2e4252e971168d8233c682f68285742af2df9e05ead7a570e68d4258b92e57",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CV",
"submitted_at": "2026-04-09T13:11:33Z",
"title_canon_sha256": "c82fb40e08059541fad887d45a11a27ee4e2310312c3b7193b08b4a06ec074a7"
},
"schema_version": "1.0",
"source": {
"id": "2604.08213",
"kind": "arxiv",
"version": 2
}
}