EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics
Pith reviewed 2026-05-12 01:00 UTC · model grok-4.3
The pith
Grounded reasoning chains for image-edit forensics match label-only classification accuracy while producing verifiable explanatory prose.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EditSleuth supplies 257,725 examples, each with an edited image, source image, binary mask, 12-class edit label, difficulty score, and a six-step reasoning chain generated deterministically from the triplet artifacts; chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers while additionally yielding grounded explanatory prose.
What carries the argument
Deterministic six-step reasoning chains, each statement tied to a specific computable source of evidence within the image triplet.
If this is right
- Chain supervision enables models to output explanatory prose without sacrificing edit classification accuracy.
- A three-component difficulty formulation increases score dispersion compared with a four-component version.
- Difficulty scores vary within edit categories rather than serving as a proxy for edit type.
- The dataset supplies both classification targets and grounded explanations from the same triplet artifacts.
Where Pith is reading between the lines
- Forensic systems trained this way could support human analysts by surfacing which visual evidence supports each conclusion.
- The deterministic pipeline could be extended to other image-manipulation domains where source and edited pairs are available.
- Curriculum training that orders examples by the refined difficulty score might accelerate convergence on harder edits.
Load-bearing premise
The deterministic construction of reasoning chains from triplet-grounded upstream artifacts produces faithful and useful explanations for forensic reasoning.
What would settle it
Inspecting whether model-generated explanations on held-out triplets correctly reference the provided edit mask locations and source-image differences.
Figures
read the original abstract
Forensic analysis of AI-edited images requires more than binary real-versus-fake prediction: a useful system should localize the edit, identify its semantic type, and ground its decisions in visual evidence. Existing image-forensics datasets typically emphasize detection or localization, while reasoning-supervised vision-language datasets rarely target image manipulation and often rely on LLM-generated rationales whose faithfulness is difficult to verify. We introduce EditSleuth, a dataset of 257,725 image-edit triplets constructed from existing image-editing corpora for grounded image-edit forensic reasoning. Each example includes an edited image, its source image, a binary edit mask, a 12-class edit taxonomy label, a difficulty score, and a six-step reasoning chain. EditSleuth chains are generated deterministically from triplet-grounded upstream artifacts, with each statement tied to a specific computable source of evidence. Our analysis reveals that a naive four-component difficulty formulation suffers from a rank-2 correlation collapse among magnitude features; a simplified three-component formulation substantially increases score dispersion on both Pico-Banana and MagicBrush. Difficulty also varies meaningfully within most edit categories, indicating that the score is not a proxy for edit type. As an initial learning study, we fine-tune Qwen2-VL-2B with LoRA and find that chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers, while additionally yielding grounded explanatory prose that label-only supervision cannot produce. We release the dataset, the deterministic construction pipeline, and pilot training scripts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents EditSleuth, a dataset of 257,725 image-edit triplets derived from existing editing corpora. Each example includes an edited image, source image, binary edit mask, 12-class taxonomy label, difficulty score, and a six-step reasoning chain generated deterministically from upstream triplet artifacts with each step tied to a computable evidence source. The authors analyze a four-component difficulty formulation that exhibits rank-2 correlation collapse among magnitude features and show that a three-component version increases score dispersion on Pico-Banana and MagicBrush; difficulty also varies within categories. In a pilot, Qwen2-VL-2B fine-tuned with LoRA under chain-as-target supervision matches a label-only baseline on classification accuracy among parseable answers while producing grounded explanatory prose.
Significance. If the deterministically generated chains prove to be faithful and diagnostically useful, the dataset supplies a reproducible resource for training vision-language models that perform edit localization, taxonomy classification, and explanation in image forensics—an area where existing datasets emphasize detection or rely on unverified LLM rationales. The pilot result that chain supervision preserves accuracy while adding explanatory output is a concrete empirical contribution. Releasing the construction pipeline and training scripts is a clear strength that supports community extension and verification.
major comments (2)
- [Pilot fine-tuning experiment] Pilot fine-tuning experiment: the central claim that chain-as-target supervision matches label-only baseline accuracy among parseable answers is load-bearing for the learning study, yet the manuscript omits the train/validation/test split ratios, the precise definition of 'parseable answers', whether accuracy is computed only on the parseable subset or overall, and any statistical test for equivalence between the two supervision regimes. Without these details it is impossible to assess robustness of the reported match.
- [Reasoning chain construction] Reasoning chain construction: although each of the six steps is tied to a computable source artifact, the paper reports no human forensic review, inter-annotator agreement, or comparison against expert-written rationales. This leaves the claim that the chains supply 'grounded explanatory prose' that label-only supervision cannot produce as an unverified assumption rather than a demonstrated property, which is central to the dataset's asserted value for forensic reasoning.
minor comments (3)
- The 12-class edit taxonomy should be listed explicitly (with example images or definitions) in the main text or a dedicated table rather than referenced only by name.
- Any figures showing difficulty-score distributions should include the exact mathematical definitions of the three-component formulation in the caption or legend.
- Acronyms such as LoRA should be expanded on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive recommendation of minor revision. We address each major comment point by point below, indicating the changes we will make to the manuscript.
read point-by-point responses
-
Referee: [Pilot fine-tuning experiment] Pilot fine-tuning experiment: the central claim that chain-as-target supervision matches label-only baseline accuracy among parseable answers is load-bearing for the learning study, yet the manuscript omits the train/validation/test split ratios, the precise definition of 'parseable answers', whether accuracy is computed only on the parseable subset or overall, and any statistical test for equivalence between the two supervision regimes. Without these details it is impossible to assess robustness of the reported match.
Authors: We agree these details are essential for evaluating the pilot. In the revised manuscript we will specify the train/validation/test split ratios used (80/10/10), define 'parseable answers' as model outputs from which the 12-class taxonomy label can be extracted via deterministic parsing rules, clarify that accuracy is reported only on the parseable subset, and include a statistical equivalence test (two-proportion z-test) between the chain-as-target and label-only regimes. These additions will be placed in the learning study section. revision: yes
-
Referee: [Reasoning chain construction] Reasoning chain construction: although each of the six steps is tied to a computable source artifact, the paper reports no human forensic review, inter-annotator agreement, or comparison against expert-written rationales. This leaves the claim that the chains supply 'grounded explanatory prose' that label-only supervision cannot produce as an unverified assumption rather than a demonstrated property, which is central to the dataset's asserted value for forensic reasoning.
Authors: The chains are generated deterministically from upstream artifacts (edit masks, taxonomy labels, source/edited image pairs), with each of the six steps explicitly linked to a computable evidence source. This design guarantees traceability and eliminates LLM hallucination risks, providing grounding by construction rather than post-hoc verification. We did not conduct human forensic review or inter-annotator agreement in the present work. In revision we will expand the dataset construction and limitations sections to explicitly describe the deterministic grounding mechanism, contrast it with LLM-generated rationales, and note human validation as important future work. This addresses the concern without overstating current claims. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper's contributions center on empirical dataset construction via deterministic rules applied to existing upstream artifacts (edited/source images, masks, taxonomy labels) and a pilot fine-tuning experiment comparing chain-as-target vs. label-only supervision. No mathematical derivations, predictions, or first-principles results are claimed that reduce to inputs by construction. Difficulty scores are derived from observable features with explicit empirical analysis of rank correlations leading to a simplified formulation. The supervision comparison is a standard empirical evaluation without any fitted-parameter renaming or self-referential definitions. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked. This is a self-contained empirical paper against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- difficulty score components
axioms (1)
- domain assumption Reasoning chains generated deterministically from triplet-grounded artifacts are faithful to visual evidence
Reference graph
Works this paper leans on
- [1]
-
[2]
The mask of changed pixels covers roughly 12% of the image and is concentrated in the lower-left region
-
[3]
Structural change relative to the original is minor (SSIM-based score = 0.09), and the edit region is well- concentrated in a single coherent region
-
[4]
The edit is classified as object_addition, inferred from the instruction text via a rule-based keyword match (confidence 0.80)
-
[5]
Edits of this type typically exhibit boundary discontinuities at the edge of the inserted region, and lighting or shadow inconsistencies between the new object and its surrounding scene
-
[6]
object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]
Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.11, instruction complexity = 0.12). object_removal [category=object_removal, scope=local, difficulty=medium, source=rule_based]
-
[7]
get rid of the framed pictures
The edit instruction states: “get rid of the framed pictures”
-
[8]
The mask of changed pixels covers roughly 17% of the image and is centered in the image
-
[9]
Structural change relative to the original is minor (SSIM-based score = 0.15), and the edit region is moderately concentrated
-
[10]
The edit is classified as object_removal, inferred from the instruction text via a rule-based keyword match (confidence 0.85)
-
[11]
Edits of this type typically exhibit inpainting artifacts where the removed object used to be, such as blurred or repeated texture patches that disagree with the surrounding context
-
[12]
object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]
Overall, this triplet is of moderate detection difficulty (difficulty score = 0.21, instruction complexity = 0.06). object_replacement [category=object_replacement, scope=local, difficulty=hard, source=rule_based]
-
[13]
replace the stuffed animals with a pillow
The edit instruction states: “replace the stuffed animals with a pillow.”
-
[14]
The mask of changed pixels covers roughly 40% of the image and is centered in the image
-
[15]
Structural change relative to the original is moderate (SSIM-based score = 0.32), and the edit region is well- concentrated in a single coherent region
-
[16]
The edit is classified as object_replacement, inferred from the instruction text via a rule-based keyword match (confidence 0.85)
-
[17]
Edits of this type typically exhibit boundary mismatches at the silhouette of the new object, plus scale or perspective inconsistencies if the replacement does not match the original object’s geometry
-
[18]
14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]
Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.28, instruction complexity = 0.15). 14 attribute_change [category=attribute_change, scope=local, difficulty=easy, source=rule_based]
-
[19]
let the apples be changed to orange slices
The edit instruction states: “let the apples be changed to orange slices”
-
[20]
The mask of changed pixels covers roughly 13% of the image and is concentrated in the lower-right region
-
[21]
Structural change relative to the original is minor (SSIM-based score = 0.12), and the edit region is well- concentrated in a single coherent region. x 4. The edit is classified as attribute_change, inferred from the instruction text via a rule-based keyword match (confidence 0.80)
-
[22]
Edits of this type typically exhibit color or texture discontinuities along the object’s boundary where the edited region meets its preserved surroundings, often without changes elsewhere in the image
-
[23]
style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]
Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.10, instruction complexity = 0.08). style_transfer [category=style_transfer, scope=global, difficulty=hard, source=dataset_label]
-
[24]
The edit instruction states: “enhance the image to a modern aesthetic by applying a vibrant, high-contrast color grade with crisp details, brightening the overall scene, and subtly smoothing any visible wear or rust on the bridge.”
-
[26]
Structural change relative to the original is substantial (SSIM-based score = 0.80), and the edit region is well- concentrated in a single coherent region
-
[27]
The edit is classified as style_transfer, based on the dataset’s curated edit-type label
-
[28]
Edits of this type typically exhibit global texture and brushstroke patterns inconsistent with natural photography, applied uniformly across the image regardless of original content
-
[29]
photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]
Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.56, instruction complexity = 0.58). photometric [category=photometric, scope=global, difficulty=medium, source=dataset_label]
-
[30]
The edit instruction states: “colorize the black and white image realistically, depicting natural skin tones, jungle foliage, and gear colors, then subtly shift the overall color temperature towards a cooler tone.”
-
[32]
Structural change relative to the original is substantial (SSIM-based score = 0.64), and the edit region is well- concentrated in a single coherent region
-
[33]
The edit is classified as photometric, based on the dataset’s curated edit-type label
-
[34]
Edits of this type typically exhibit a global histogram shift or noise overlay applied uniformly to all pixels, with the underlying image content semantically unchanged from the original
-
[35]
Overall, this triplet is of moderate detection difficulty (difficulty score = 0.48, instruction complexity = 0.67). scene_transformation [category=scene_transformation, scope=local, difficulty=medium, source=rule_based]
-
[36]
let the cabinets be made of dark wood
The edit instruction states: “let the cabinets be made of dark wood”
-
[37]
The mask of changed pixels covers roughly 13% of the image and is centered in the image
-
[38]
Structural change relative to the original is minor (SSIM-based score = 0.10), and the edit region is moderately concentrated
-
[39]
The edit is classified as scene_transformation, inferred from the instruction text via a rule-based keyword match (confidence 0.65)
-
[40]
Edits of this type typically exhibit globally consistent changes in lighting, color temperature, or weather effects that affect the whole scene coherently rather than any single object
-
[41]
15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]
Overall, this triplet is of moderate detection difficulty (difficulty score = 0.19, instruction complexity = 0.08). 15 background_change [category=background_change, scope=local, difficulty=easy, source=rule_based]
-
[42]
it should be a mountain in the background
The edit instruction states: “it should be a mountain in the background.”
-
[43]
The mask of changed pixels covers roughly 10% of the image and is concentrated in the lower-right region
-
[44]
Structural change relative to the original is minor (SSIM-based score = 0.07), and the edit region is moderately concentrated
-
[45]
The edit is classified as background_change, inferred from the instruction text via a rule-based keyword match (confidence 0.75)
-
[46]
Edits of this type typically exhibit a sharp transition between a preserved foreground subject and a newly-introduced background, sometimes with mismatched lighting or perspective at the boundary
-
[47]
text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]
Overall, this triplet is easier than average to detect, given clear local geometry and a low-complexity instruction (difficulty score = 0.13, instruction complexity = 0.08). text_edit [category=text_edit, scope=local, difficulty=hard, source=rule_based]
-
[48]
change the text on the parking meter to say “NO
The edit instruction states: “change the text on the parking meter to say “NO”.”
-
[50]
Structural change relative to the original is minor (SSIM-based score = 0.02), and the edit region is diffuse or split across multiple sub-regions
-
[51]
The edit is classified as text_edit, inferred from the instruction text via a rule-based keyword match (confidence 0.70)
-
[52]
Edits of this type typically exhibit font or rendering artifacts in the modified text region — inconsistent letter spacing, mismatched typefaces, or rendering noise distinct from the original photographic text
-
[53]
geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]
Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.24, instruction complexity = 0.18). geometric [category=geometric, scope=local, difficulty=hard, source=rule_based]
-
[54]
make the piece of paper hanging on the wall a mirror
The edit instruction states: “make the piece of paper hanging on the wall a mirror”
-
[55]
The mask of changed pixels covers roughly 14% of the image and is concentrated in the upper-left region
-
[56]
Structural change relative to the original is minor (SSIM-based score = 0.08), and the edit region is diffuse or split across multiple sub-regions
-
[57]
The edit is classified as geometric, inferred from the instruction text via a rule-based keyword match (confidence 0.85)
-
[58]
Edits of this type typically exhibit canvas-level transformations such as cropped boundaries, scaled content, or extrapolated regions outside the original frame, rather than localized object edits
-
[59]
human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]
Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.23, instruction complexity = 0.19). human_centric [category=human_centric, scope=global, difficulty=hard, source=dataset_label]
-
[60]
The edit instruction states: “transform the main subject (the person playing the flute) into a detailed, expressive black ink line-art sketch, utilizing varied line weights to highlight facial features, the texture of the cap.”
-
[61]
The mask of changed pixels covers roughly 100% of the image and spans the entire image
-
[62]
Structural change relative to the original is substantial (SSIM-based score = 0.72), and the edit region is well- concentrated in a single coherent region
-
[63]
The edit is classified as human_centric, based on the dataset’s curated edit-type label
-
[64]
Edits of this type typically exhibit subject-localized stylization or attribute changes confined to a person, with the surrounding scene preserved; identity-preserving transformations often introduce distinctive rendering artifacts around the face and hair
-
[65]
16 other [category=other, scope=local, difficulty=medium, source=fallback]
Overall, this triplet is harder than average to detect, given diffuse geometry or a high-complexity instruction (difficulty score = 0.51, instruction complexity = 0.58). 16 other [category=other, scope=local, difficulty=medium, source=fallback]
-
[66]
have there be a basket of fruit on the counter
The edit instruction states: “have there be a basket of fruit on the counter.”
-
[67]
The mask of changed pixels covers roughly 8% of the image and is centered in the image
-
[68]
Structural change relative to the original is minor (SSIM-based score = 0.06), and the edit region is moderately concentrated
-
[69]
The edit is classified as other, could not be determined confidently from the available signals; treated as an unspecified edit type
-
[70]
Edits of this type typically exhibit edit characteristics that depend on the specific operation; without a confirmed category, look broadly for any local boundary discontinuities or global statistical shifts
-
[71]
Overall, this triplet is of moderate detection difficulty (difficulty score = 0.16, instruction complexity = 0.10). E Qualitative mask examples Figures 3 and 4 compare EditSleuth’s Stage B masks against MagicBrush’s ground-truth manipulation masks across nine representative edit categories. EditSleuth’s masks consistently localize the edited region, inclu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.