EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

AprilPyone MaungMaung; Isao Echizen; Minh-Triet Tran; Van-Loc Nguyen

arxiv: 2605.08695 · v1 · submitted 2026-05-09 · 💻 cs.CV

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics

Van-Loc Nguyen , AprilPyone MaungMaung , Minh-Triet Tran , Isao Echizen This is my paper

Pith reviewed 2026-05-12 01:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords image edit forensicsreasoning chainsAI image manipulationgrounded explanationsvision-language modelsforensic datasetsedit localization

0 comments

The pith

Grounded reasoning chains for image-edit forensics match label-only classification accuracy while producing verifiable explanatory prose.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs EditSleuth, a large dataset of image-edit triplets that includes deterministic six-step reasoning chains tied directly to computable evidence such as source images, edit masks, and semantic labels. It shows that fine-tuning a vision-language model with these chains as supervision targets yields the same accuracy on edit classification as training with labels alone, yet produces explanatory text that references specific visual evidence. This addresses the gap between binary fake detection and the need for localized, typed, and grounded forensic analysis of AI manipulations. The deterministic pipeline avoids unverifiable LLM-generated rationales by deriving each step from upstream triplet artifacts. Difficulty scoring is refined to better separate examples within categories.

Core claim

EditSleuth supplies 257,725 examples, each with an edited image, source image, binary mask, 12-class edit label, difficulty score, and a six-step reasoning chain generated deterministically from the triplet artifacts; chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers while additionally yielding grounded explanatory prose.

What carries the argument

Deterministic six-step reasoning chains, each statement tied to a specific computable source of evidence within the image triplet.

If this is right

Chain supervision enables models to output explanatory prose without sacrificing edit classification accuracy.
A three-component difficulty formulation increases score dispersion compared with a four-component version.
Difficulty scores vary within edit categories rather than serving as a proxy for edit type.
The dataset supplies both classification targets and grounded explanations from the same triplet artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forensic systems trained this way could support human analysts by surfacing which visual evidence supports each conclusion.
The deterministic pipeline could be extended to other image-manipulation domains where source and edited pairs are available.
Curriculum training that orders examples by the refined difficulty score might accelerate convergence on harder edits.

Load-bearing premise

The deterministic construction of reasoning chains from triplet-grounded upstream artifacts produces faithful and useful explanations for forensic reasoning.

What would settle it

Inspecting whether model-generated explanations on held-out triplets correctly reference the provided edit mask locations and source-image differences.

Figures

Figures reproduced from arXiv: 2605.08695 by AprilPyone MaungMaung, Isao Echizen, Minh-Triet Tran, Van-Loc Nguyen.

**Figure 2.** Figure 2: V1 vs V2 difficulty score distributions, side-by-side, on Pico-Banana ( [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of the original images, edited images, ground-truth masks, and generated masks [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of the original images, edited images, ground-truth masks, and generated masks [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

read the original abstract

Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EditSleuth gives a large, released dataset of deterministically built reasoning chains for image-edit forensics, but the chains' actual forensic value is still an assumption rather than shown.

read the letter

The main thing to know is that this paper ships a dataset of 257k image-edit triplets, each with a source image, edited image, mask, 12-class taxonomy label, difficulty score, and a six-step reasoning chain built straight from those artifacts instead of an LLM. They also simplify the difficulty formula after spotting a rank-2 correlation collapse in the four-component version and show that training a small VL model on the chains matches label-only accuracy while adding prose output. The pipeline and scripts are released, which is concrete and usable work. The difficulty analysis is the clearest new piece; it shows the score varies inside categories rather than just tracking edit type. The pilot fine-tuning is modest but honest about what it measures. The soft spot is exactly the one the stress-test flags: nothing in the paper checks whether the chains are diagnostically useful or just restate the mask and label in six sentences. No expert forensic review, no inter-annotator numbers, no comparison to human-written rationales. That leaves the claim of grounded explanatory value as an untested premise. The pilot also stays small and the abstract leaves splits and exact metrics light. This is for researchers who need training data for explainable manipulation detection or who want to build on released forensic corpora. It is worth a serious referee because the dataset construction is reproducible and the release lowers the barrier for follow-up work, even if the chains need human validation before anyone treats them as reliable explanations.

Referee Report

2 major / 3 minor

Summary. The manuscript presents EditSleuth, a dataset of 257,725 image-edit triplets derived from existing editing corpora. Each example includes an edited image, source image, binary edit mask, 12-class taxonomy label, difficulty score, and a six-step reasoning chain generated deterministically from upstream triplet artifacts with each step tied to a computable evidence source. The authors analyze a four-component difficulty formulation that exhibits rank-2 correlation collapse among magnitude features and show that a three-component version increases score dispersion on Pico-Banana and MagicBrush; difficulty also varies within categories. In a pilot, Qwen2-VL-2B fine-tuned with LoRA under chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers while producing grounded explanatory prose.

Significance. If the deterministically generated chains prove to be faithful and diagnostically useful, the dataset supplies a reproducible resource for training vision-language models that perform edit localization, taxonomy classification, and explanation in image forensics—an area where existing datasets emphasize detection or rely on unverified LLM rationales. The pilot result that chain supervision preserves accuracy while adding explanatory output is a concrete empirical contribution. Releasing the construction pipeline and training scripts is a clear strength that supports community extension and verification.

major comments (2)

[Pilot fine-tuning experiment] Pilot fine-tuning experiment: the central claim that chain-as-target supervision matches label-only baseline accuracy among parseable answers is load-bearing for the learning study, yet the manuscript omits the train/validation/test split ratios, the precise definition of 'parseable answers', whether accuracy is computed only on the parseable subset or overall, and any statistical test for equivalence between the two supervision regimes. Without these details it is impossible to assess robustness of the reported match.
[Reasoning chain construction] Reasoning chain construction: although each of the six steps is tied to a computable source artifact, the paper reports no human forensic review, inter-annotator agreement, or comparison against expert-written rationales. This leaves the claim that the chains supply 'grounded explanatory prose' that label-only supervision cannot produce as an unverified assumption rather than a demonstrated property, which is central to the dataset's asserted value for forensic reasoning.

minor comments (3)

The 12-class edit taxonomy should be listed explicitly (with example images or definitions) in the main text or a dedicated table rather than referenced only by name.
Any figures showing difficulty-score distributions should include the exact mathematical definitions of the three-component formulation in the caption or legend.
Acronyms such as LoRA should be expanded on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation of minor revision. We address each major comment point by point below, indicating the changes we will make to the manuscript.

read point-by-point responses

Referee: [Pilot fine-tuning experiment] Pilot fine-tuning experiment: the central claim that chain-as-target supervision matches label-only baseline accuracy among parseable answers is load-bearing for the learning study, yet the manuscript omits the train/validation/test split ratios, the precise definition of 'parseable answers', whether accuracy is computed only on the parseable subset or overall, and any statistical test for equivalence between the two supervision regimes. Without these details it is impossible to assess robustness of the reported match.

Authors: We agree these details are essential for evaluating the pilot. In the revised manuscript we will specify the train/validation/test split ratios used (80/10/10), define 'parseable answers' as model outputs from which the 12-class taxonomy label can be extracted via deterministic parsing rules, clarify that accuracy is reported only on the parseable subset, and include a statistical equivalence test (two-proportion z-test) between the chain-as-target and label-only regimes. These additions will be placed in the learning study section. revision: yes
Referee: [Reasoning chain construction] Reasoning chain construction: although each of the six steps is tied to a computable source artifact, the paper reports no human forensic review, inter-annotator agreement, or comparison against expert-written rationales. This leaves the claim that the chains supply 'grounded explanatory prose' that label-only supervision cannot produce as an unverified assumption rather than a demonstrated property, which is central to the dataset's asserted value for forensic reasoning.

Authors: The chains are generated deterministically from upstream artifacts (edit masks, taxonomy labels, source/edited image pairs), with each of the six steps explicitly linked to a computable evidence source. This design guarantees traceability and eliminates LLM hallucination risks, providing grounding by construction rather than post-hoc verification. We did not conduct human forensic review or inter-annotator agreement in the present work. In revision we will expand the dataset construction and limitations sections to explicitly describe the deterministic grounding mechanism, contrast it with LLM-generated rationales, and note human validation as important future work. This addresses the concern without overstating current claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's contributions center on empirical dataset construction via deterministic rules applied to existing upstream artifacts (edited/source images, masks, taxonomy labels) and a pilot fine-tuning experiment comparing chain-as-target vs. label-only supervision. No mathematical derivations, predictions, or first-principles results are claimed that reduce to inputs by construction. Difficulty scores are derived from observable features with explicit empirical analysis of rank correlations leading to a simplified formulation. The supervision comparison is a standard empirical evaluation without any fitted-parameter renaming or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. This is a self-contained empirical paper against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on existing image-editing corpora and standard fine-tuning techniques; the main addition is the deterministic pipeline and analysis, without new physical or mathematical postulates.

free parameters (1)

difficulty score components
The paper evaluates a four-component then three-component formulation for difficulty, implying choices in feature selection and combination that are tuned based on observed correlations.

axioms (1)

domain assumption Reasoning chains generated deterministically from triplet-grounded artifacts are faithful to visual evidence
The construction assumes that tying each statement to computable sources like masks ensures grounding and faithfulness.

pith-pipeline@v0.9.0 · 5589 in / 1458 out tokens · 59669 ms · 2026-05-12T01:00:14.315448+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages

[1]

add a polar bear

The edit instruction states: “add a polar bear”

work page
[2]

The mask of changed pixels covers roughly 12% of the image and is concentrated in the lower-left region

work page
[3]

Structural change relative to the original is minor (SSIM-based score = 0.09), and the edit region is well- concentrated in a single coherent region

work page
[4]

The edit is classified as object_addition, inferred from the instruction text via a rule-based keyword match (confidence 0.80)

work page
[5]

Edits of this type typically exhibit boundary discontinuities at the edge of the inserted region, and lighting or shadow inconsistencies between the new object and its surrounding scene

work page
[6]

object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.11, instruction complexity = 0.12). object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]

work page
[7]

get rid of the framed pictures

The edit instruction states: “get rid of the framed pictures”

work page
[8]

The mask of changed pixels covers roughly 17% of the image and is centered in the image

work page
[9]

Structural change relative to the original is minor (SSIM-based score = 0.15), and the edit region is moderately concentrated

work page
[10]

The edit is classified as object_removal, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page
[11]

Edits of this type typically exhibit inpainting artifacts where the removed object used to be, such as blurred or repeated texture patches that disagree with the surrounding context

work page
[12]

object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.21, instruction complexity = 0.06). object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]

work page
[13]

replace the stuffed animals with a pillow

The edit instruction states: “replace the stuffed animals with a pillow.”

work page
[14]

The mask of changed pixels covers roughly 40% of the image and is centered in the image

work page
[15]

Structural change relative to the original is moderate (SSIM-based score = 0.32), and the edit region is well- concentrated in a single coherent region

work page
[16]

The edit is classified as object_replacement, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page
[17]

Edits of this type typically exhibit boundary mismatches at the silhouette of the new object, plus scale or perspective inconsistencies if the replacement does not match the original object’s geometry

work page
[18]

14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.28, instruction complexity = 0.15). 14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]

work page
[19]

let the apples be changed to orange slices

The edit instruction states: “let the apples be changed to orange slices”

work page
[20]

The mask of changed pixels covers roughly 13% of the image and is concentrated in the lower-right region

work page
[21]

Structural change relative to the original is minor (SSIM-based score = 0.12), and the edit region is well- concentrated in a single coherent region. x 4. The edit is classified as attribute_change, inferred from the instruction text via a rule-based keyword match (confidence 0.80)

work page
[22]

Edits of this type typically exhibit color or texture discontinuities along the object’s boundary where the edited region meets its preserved surroundings, often without changes elsewhere in the image

work page
[23]

style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.10, instruction complexity = 0.08). style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]

work page
[24]

The edit instruction states: “enhance the image to a modern aesthetic by applying a vibrant, high-contrast color grade with crisp details, brightening the overall scene, and subtly smoothing any visible wear or rust on the bridge.”

work page
[26]

Structural change relative to the original is substantial (SSIM-based score = 0.80), and the edit region is well- concentrated in a single coherent region

work page
[27]

The edit is classified as style_transfer, based on the dataset’s curated edit-type label

work page
[28]

Edits of this type typically exhibit global texture and brushstroke patterns inconsistent with natural photography, applied uniformly across the image regardless of original content

work page
[29]

photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.56, instruction complexity = 0.58). photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]

work page
[30]

The edit instruction states: “colorize the black and white image realistically, depicting natural skin tones, jungle foliage, and gear colors, then subtly shift the overall color temperature towards a cooler tone.”

work page
[32]

Structural change relative to the original is substantial (SSIM-based score = 0.64), and the edit region is well- concentrated in a single coherent region

work page
[33]

The edit is classified as photometric, based on the dataset’s curated edit-type label

work page
[34]

Edits of this type typically exhibit a global histogram shift or noise overlay applied uniformly to all pixels, with the underlying image content semantically unchanged from the original

work page
[35]

scene_transformation [category=scene_transformation, scope=local, difficulty=medium, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.48, instruction complexity = 0.67). scene_transformation [category=scene_transformation, scope=local, difficulty=medium, source=rule_based]

work page
[36]

let the cabinets be made of dark wood

The edit instruction states: “let the cabinets be made of dark wood”

work page
[37]

The mask of changed pixels covers roughly 13% of the image and is centered in the image

work page
[38]

Structural change relative to the original is minor (SSIM-based score = 0.10), and the edit region is moderately concentrated

work page
[39]

The edit is classified as scene_transformation, inferred from the instruction text via a rule-based keyword match (confidence 0.65)

work page
[40]

Edits of this type typically exhibit globally consistent changes in lighting, color temperature, or weather effects that affect the whole scene coherently rather than any single object

work page
[41]

15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.19, instruction complexity = 0.08). 15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]

work page
[42]

it should be a mountain in the background

The edit instruction states: “it should be a mountain in the background.”

work page
[43]

The mask of changed pixels covers roughly 10% of the image and is concentrated in the lower-right region

work page
[44]

Structural change relative to the original is minor (SSIM-based score = 0.07), and the edit region is moderately concentrated

work page
[45]

The edit is classified as background_change, inferred from the instruction text via a rule-based keyword match (confidence 0.75)

work page
[46]

Edits of this type typically exhibit a sharp transition between a preserved foreground subject and a newly-introduced background, sometimes with mismatched lighting or perspective at the boundary

work page
[47]

text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.13, instruction complexity = 0.08). text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]

work page
[48]

change the text on the parking meter to say “NO

The edit instruction states: “change the text on the parking meter to say “NO”.”

work page
[50]

Structural change relative to the original is minor (SSIM-based score = 0.02), and the edit region is diffuse or split across multiple sub-regions

work page
[51]

The edit is classified as text_edit, inferred from the instruction text via a rule-based keyword match (confidence 0.70)

work page
[52]

Edits of this type typically exhibit font or rendering artifacts in the modified text region — inconsistent letter spacing, mismatched typefaces, or rendering noise distinct from the original photographic text

work page
[53]

geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.24, instruction complexity = 0.18). geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]

work page
[54]

make the piece of paper hanging on the wall a mirror

The edit instruction states: “make the piece of paper hanging on the wall a mirror”

work page
[55]

The mask of changed pixels covers roughly 14% of the image and is concentrated in the upper-left region

work page
[56]

Structural change relative to the original is minor (SSIM-based score = 0.08), and the edit region is diffuse or split across multiple sub-regions

work page
[57]

The edit is classified as geometric, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page
[58]

Edits of this type typically exhibit canvas-level transformations such as cropped boundaries, scaled content, or extrapolated regions outside the original frame, rather than localized object edits

work page
[59]

human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.23, instruction complexity = 0.19). human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]

work page
[60]

The edit instruction states: “transform the main subject (the person playing the flute) into a detailed, expressive black ink line-art sketch, utilizing varied line weights to highlight facial features, the texture of the cap.”

work page
[61]

The mask of changed pixels covers roughly 100% of the image and spans the entire image

work page
[62]

Structural change relative to the original is substantial (SSIM-based score = 0.72), and the edit region is well- concentrated in a single coherent region

work page
[63]

The edit is classified as human_centric, based on the dataset’s curated edit-type label

work page
[64]

Edits of this type typically exhibit subject-localized stylization or attribute changes confined to a person, with the surrounding scene preserved; identity-preserving transformations often introduce distinctive rendering artifacts around the face and hair

work page
[65]

16 other [category=other, scope=local, difficulty=medium, source=fallback]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.51, instruction complexity = 0.58). 16 other [category=other, scope=local, difficulty=medium, source=fallback]

work page
[66]

have there be a basket of fruit on the counter

The edit instruction states: “have there be a basket of fruit on the counter.”

work page
[67]

The mask of changed pixels covers roughly 8% of the image and is centered in the image

work page
[68]

Structural change relative to the original is minor (SSIM-based score = 0.06), and the edit region is moderately concentrated

work page
[69]

The edit is classified as other, could not be determined confidently from the available signals; treated as an unspecified edit type

work page
[70]

Edits of this type typically exhibit edit characteristics that depend on the specific operation; without a confirmed category, look broadly for any local boundary discontinuities or global statistical shifts

work page
[71]

E Qualitative mask examples Figures 3 and 4 compare EditSleuth’s Stage B masks against MagicBrush’s ground-truth manipulation masks across nine representative edit categories

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.16, instruction complexity = 0.10). E Qualitative mask examples Figures 3 and 4 compare EditSleuth’s Stage B masks against MagicBrush’s ground-truth manipulation masks across nine representative edit categories. EditSleuth’s masks consistently localize the edited region, inclu...

work page

[1] [1]

add a polar bear

The edit instruction states: “add a polar bear”

work page

[2] [2]

The mask of changed pixels covers roughly 12% of the image and is concentrated in the lower-left region

work page

[3] [3]

Structural change relative to the original is minor (SSIM-based score = 0.09), and the edit region is well- concentrated in a single coherent region

work page

[4] [4]

The edit is classified as object_addition, inferred from the instruction text via a rule-based keyword match (confidence 0.80)

work page

[5] [5]

Edits of this type typically exhibit boundary discontinuities at the edge of the inserted region, and lighting or shadow inconsistencies between the new object and its surrounding scene

work page

[6] [6]

object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.11, instruction complexity = 0.12). object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]

work page

[7] [7]

get rid of the framed pictures

The edit instruction states: “get rid of the framed pictures”

work page

[8] [8]

The mask of changed pixels covers roughly 17% of the image and is centered in the image

work page

[9] [9]

Structural change relative to the original is minor (SSIM-based score = 0.15), and the edit region is moderately concentrated

work page

[10] [10]

The edit is classified as object_removal, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page

[11] [11]

Edits of this type typically exhibit inpainting artifacts where the removed object used to be, such as blurred or repeated texture patches that disagree with the surrounding context

work page

[12] [12]

object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.21, instruction complexity = 0.06). object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]

work page

[13] [13]

replace the stuffed animals with a pillow

The edit instruction states: “replace the stuffed animals with a pillow.”

work page

[14] [14]

The mask of changed pixels covers roughly 40% of the image and is centered in the image

work page

[15] [15]

Structural change relative to the original is moderate (SSIM-based score = 0.32), and the edit region is well- concentrated in a single coherent region

work page

[16] [16]

The edit is classified as object_replacement, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page

[17] [17]

Edits of this type typically exhibit boundary mismatches at the silhouette of the new object, plus scale or perspective inconsistencies if the replacement does not match the original object’s geometry

work page

[18] [18]

14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.28, instruction complexity = 0.15). 14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]

work page

[19] [19]

let the apples be changed to orange slices

The edit instruction states: “let the apples be changed to orange slices”

work page

[20] [20]

The mask of changed pixels covers roughly 13% of the image and is concentrated in the lower-right region

work page

[21] [21]

Structural change relative to the original is minor (SSIM-based score = 0.12), and the edit region is well- concentrated in a single coherent region. x 4. The edit is classified as attribute_change, inferred from the instruction text via a rule-based keyword match (confidence 0.80)

work page

[22] [22]

Edits of this type typically exhibit color or texture discontinuities along the object’s boundary where the edited region meets its preserved surroundings, often without changes elsewhere in the image

work page

[23] [23]

style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.10, instruction complexity = 0.08). style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]

work page

[24] [24]

The edit instruction states: “enhance the image to a modern aesthetic by applying a vibrant, high-contrast color grade with crisp details, brightening the overall scene, and subtly smoothing any visible wear or rust on the bridge.”

work page

[25] [26]

Structural change relative to the original is substantial (SSIM-based score = 0.80), and the edit region is well- concentrated in a single coherent region

work page

[26] [27]

The edit is classified as style_transfer, based on the dataset’s curated edit-type label

work page

[27] [28]

Edits of this type typically exhibit global texture and brushstroke patterns inconsistent with natural photography, applied uniformly across the image regardless of original content

work page

[28] [29]

photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.56, instruction complexity = 0.58). photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]

work page

[29] [30]

The edit instruction states: “colorize the black and white image realistically, depicting natural skin tones, jungle foliage, and gear colors, then subtly shift the overall color temperature towards a cooler tone.”

work page

[30] [32]

Structural change relative to the original is substantial (SSIM-based score = 0.64), and the edit region is well- concentrated in a single coherent region

work page

[31] [33]

The edit is classified as photometric, based on the dataset’s curated edit-type label

work page

[32] [34]

Edits of this type typically exhibit a global histogram shift or noise overlay applied uniformly to all pixels, with the underlying image content semantically unchanged from the original

work page

[33] [35]

scene_transformation [category=scene_transformation, scope=local, difficulty=medium, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.48, instruction complexity = 0.67). scene_transformation [category=scene_transformation, scope=local, difficulty=medium, source=rule_based]

work page

[34] [36]

let the cabinets be made of dark wood

The edit instruction states: “let the cabinets be made of dark wood”

work page

[35] [37]

The mask of changed pixels covers roughly 13% of the image and is centered in the image

work page

[36] [38]

Structural change relative to the original is minor (SSIM-based score = 0.10), and the edit region is moderately concentrated

work page

[37] [39]

The edit is classified as scene_transformation, inferred from the instruction text via a rule-based keyword match (confidence 0.65)

work page

[38] [40]

Edits of this type typically exhibit globally consistent changes in lighting, color temperature, or weather effects that affect the whole scene coherently rather than any single object

work page

[39] [41]

15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.19, instruction complexity = 0.08). 15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]

work page

[40] [42]

it should be a mountain in the background

The edit instruction states: “it should be a mountain in the background.”

work page

[41] [43]

The mask of changed pixels covers roughly 10% of the image and is concentrated in the lower-right region

work page

[42] [44]

Structural change relative to the original is minor (SSIM-based score = 0.07), and the edit region is moderately concentrated

work page

[43] [45]

The edit is classified as background_change, inferred from the instruction text via a rule-based keyword match (confidence 0.75)

work page

[44] [46]

Edits of this type typically exhibit a sharp transition between a preserved foreground subject and a newly-introduced background, sometimes with mismatched lighting or perspective at the boundary

work page

[45] [47]

text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.13, instruction complexity = 0.08). text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]

work page

[46] [48]

change the text on the parking meter to say “NO

The edit instruction states: “change the text on the parking meter to say “NO”.”

work page

[47] [50]

Structural change relative to the original is minor (SSIM-based score = 0.02), and the edit region is diffuse or split across multiple sub-regions

work page

[48] [51]

The edit is classified as text_edit, inferred from the instruction text via a rule-based keyword match (confidence 0.70)

work page

[49] [52]

Edits of this type typically exhibit font or rendering artifacts in the modified text region — inconsistent letter spacing, mismatched typefaces, or rendering noise distinct from the original photographic text

work page

[50] [53]

geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.24, instruction complexity = 0.18). geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]

work page

[51] [54]

make the piece of paper hanging on the wall a mirror

The edit instruction states: “make the piece of paper hanging on the wall a mirror”

work page

[52] [55]

The mask of changed pixels covers roughly 14% of the image and is concentrated in the upper-left region

work page

[53] [56]

Structural change relative to the original is minor (SSIM-based score = 0.08), and the edit region is diffuse or split across multiple sub-regions

work page

[54] [57]

The edit is classified as geometric, inferred from the instruction text via a rule-based keyword match (confidence 0.85)

work page

[55] [58]

Edits of this type typically exhibit canvas-level transformations such as cropped boundaries, scaled content, or extrapolated regions outside the original frame, rather than localized object edits

work page

[56] [59]

human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.23, instruction complexity = 0.19). human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]

work page

[57] [60]

The edit instruction states: “transform the main subject (the person playing the flute) into a detailed, expressive black ink line-art sketch, utilizing varied line weights to highlight facial features, the texture of the cap.”

work page

[58] [61]

The mask of changed pixels covers roughly 100% of the image and spans the entire image

work page

[59] [62]

Structural change relative to the original is substantial (SSIM-based score = 0.72), and the edit region is well- concentrated in a single coherent region

work page

[60] [63]

The edit is classified as human_centric, based on the dataset’s curated edit-type label

work page

[61] [64]

Edits of this type typically exhibit subject-localized stylization or attribute changes confined to a person, with the surrounding scene preserved; identity-preserving transformations often introduce distinctive rendering artifacts around the face and hair

work page

[62] [65]

16 other [category=other, scope=local, difficulty=medium, source=fallback]

Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.51, instruction complexity = 0.58). 16 other [category=other, scope=local, difficulty=medium, source=fallback]

work page

[63] [66]

have there be a basket of fruit on the counter

The edit instruction states: “have there be a basket of fruit on the counter.”

work page

[64] [67]

The mask of changed pixels covers roughly 8% of the image and is centered in the image

work page

[65] [68]

Structural change relative to the original is minor (SSIM-based score = 0.06), and the edit region is moderately concentrated

work page

[66] [69]

The edit is classified as other, could not be determined confidently from the available signals; treated as an unspecified edit type

work page

[67] [70]

Edits of this type typically exhibit edit characteristics that depend on the specific operation; without a confirmed category, look broadly for any local boundary discontinuities or global statistical shifts

work page

[68] [71]

E Qualitative mask examples Figures 3 and 4 compare EditSleuth’s Stage B masks against MagicBrush’s ground-truth manipulation masks across nine representative edit categories

Overall, this triplet is of moderate detection difficulty (difficulty score = 0.16, instruction complexity = 0.10). E Qualitative mask examples Figures 3 and 4 compare EditSleuth’s Stage B masks against MagicBrush’s ground-truth manipulation masks across nine representative edit categories. EditSleuth’s masks consistently localize the edited region, inclu...

work page