pith. sign in

arxiv: 2604.11004 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI· cs.LG

Panoptic Pairwise Distortion Graph

Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords Distortion GraphPairwise Image AssessmentRegion-Level DegradationScene Graph ExtensionImage Quality AssessmentMultimodal Large Language ModelsComparative AssessmentPandaBench
0
0 comments X

The pith

Representing image pairs as region-level distortion graphs captures fine-grained degradations more compactly than whole-image analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes shifting comparative image assessment from whole-image methods to structured graphs that connect regions across a pair. These graphs encode distortion type, severity, direct comparisons between regions, and quality scores in one topology. The authors supply a region-annotated dataset and a benchmark that varies the difficulty of region-level tasks. Experiments reveal that current multimodal models cannot reliably interpret region degradations even when given explicit region cues, yet training on the dataset or prompting with the graph structure elicits better understanding. This establishes a new direction for interpretable, region-focused pairwise evaluation.

Core claim

We extend the intra-image scene graph to an inter-image Distortion Graph that treats paired images as a region-grounded topology and packs dense degradation data such as distortion type, severity, comparison, and quality score into a compact structure. To support the task we release PandaSet, a region-level dataset, PandaBench, a benchmark with graded region difficulties, and the Panda architecture for generating the graphs. PandaBench exposes a clear limitation in state-of-the-art multimodal large language models: they fail to understand region-level degradations even when supplied explicit region cues, while training on PandaSet or prompting with the Distortion Graph improves region-wiseaw

What carries the argument

The Distortion Graph, an inter-image extension of scene graphs that organizes regions from paired images into a topology encoding distortion type, severity, comparison, and quality score.

If this is right

  • Whole-image assessment methods can be replaced or augmented by region-structured graphs for greater compactness and interpretability.
  • Multimodal models gain region-wise distortion understanding when trained on PandaSet or prompted with Distortion Graphs.
  • PandaBench provides a graded test suite that exposes and measures limitations in current models on pairwise region tasks.
  • The graph output supports downstream uses that require explicit, localized degradation information rather than global scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same region-topology approach could be tested on video sequences to track how distortions evolve across frames.
  • Integrating Distortion Graphs with existing object-detection pipelines might allow automatic localization of quality issues without extra labels.
  • One could measure whether models trained on these graphs transfer better to related tasks such as image editing or restoration targeting specific regions.

Load-bearing premise

That extending scene graphs to inter-image region topologies will yield a more compact and interpretable representation for comparative assessment than whole-image methods.

What would settle it

State-of-the-art multimodal models reaching high accuracy on PandaBench without region-level training or graph prompting, or human raters finding whole-image methods equally interpretable for degradation details.

Figures

Figures reproduced from arXiv: 2604.11004 by Abdul Wahab, Bahador Rashidi, Muhammad Kamran Janjua.

Figure 1
Figure 1. Figure 1: DG Task Overview. Top: Given two images, PANDA learns the proposed Distortion Graph (DG). Bottom: Grounded Subgraphs illustrate how DG grounds regions in terms of distortion information. 1 INTRODUCTION In humans, perceptual decisions1 are often cognitively involved, deliberate, and contextual (Ding & Gold, 2013). Studies have argued that any model of such perceptual decision making should consider the repr… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation. Current MLLMs (e.g., Co-Instruct (2024c)) fail at region-level understand￾ing, struggling even when given explicit region details (name, description, bounding box). DG grounds assessment in regions, relating distortions and attributes to provide a structured view. Op￾tionally, the graph can be fed to an MLLM for region-wise language descriptions. Scene relations (yellow) are not predicted. Best… view at source ↗
Figure 3
Figure 3. Figure 3: Architecture Diagram. Illustration of the proposed PANDA architecture to learn Distor￾tion Graph (DG). A pair of image is fed as input, and for each region in the pair, their comparative relationship (predicates), distortion type, severity type and quality score (attributes) are predicted. across both images in the pair for one-to-one correspondence in regions, i.e., NR = N A R = N T R . For exposition, we… view at source ↗
Figure 4
Figure 4. Figure 4: Emergent Results. Feeding predicted DG in prompt as chain-of-thought (CoT) results in improvement of ≈ 15% (accuracy) in region-wise distortion understanding of GPT-5 Mini. 5.1 RESULTS & DISCUSSION In tables 2 to 4, we present results on the Easy, Hard, and Medium settings of PANDABENCH. For comparison, distortion, and severity type, we measure accuracy, precision, recall, and F1 score. While for quality s… view at source ↗
Figure 5
Figure 5. Figure 5: Distortion Graph as Context. Illustrative figure analyzing showcase application wherein predicted DG is fed as context to GPT-5 Mini. Top: Sample taken from PANDABENCH Easy, Bottom: Sample taken from PANDABENCH Hard. GPT-5 Mini indeed overrides the predicted DG when the pixels disagree with DG. Analysis of Distortion Graph as Context. We analyze the improvement in fig. 4 and evaluate whether GPT-5 Mini sim… view at source ↗
Figure 6
Figure 6. Figure 6: PANDASET Summary. Left: A word cloud of region names indicating diversity of the objects in images. Right: A region-wise summary of PANDASET in terms of distortions & severity. All 15 of the distortions are uniformly distributed across the regions, and we broadly categorize the distortions in super categories: weather, camera/equipment, digital, light, and clean. Methods Comparison Distortion Severity Scor… view at source ↗
Figure 7
Figure 7. Figure 7: Design Choice Ablation. Accuracy comparison of different design choices: backbone feature extractors (solid line) and Transformer blocks (dotted line). PANDA maintains balance in size, performance, and efficiency. Whole Image vs. Region-Wise. Our findings indicate that MLLMs performance is dependent on the granularity of decision-making. If whole image, i.e., global view, is considered, the performance is … view at source ↗
Figure 8
Figure 8. Figure 8: All Distortion Types. We visualize all 15 different distortion types on the same image taken from PANDASET. Each distortion degrades the image differently. Some distortions ruin the perceptual quality of the image more than others (e.g., haze, contrast decrease). along with model size in parameters in table 6. For closed-source models, we do not compare the compute cost since they are exposed through an AP… view at source ↗
Figure 9
Figure 9. Figure 9: Prompt Type (a). A template of prompt for open-source MLLMs. The tags for keywords like image, input, user, assistant, output, etc. that each method requires are added as necessary. Anchor Target Content: Provide an image quality assessment for each of the regions in image A and image B based on their bounding boxes and names. Identify the distortion present in each region, and pick one for each region fro… view at source ↗
Figure 10
Figure 10. Figure 10: Prompt Type (b). A template of prompt for closed-source MLLMs. Frontier methods have superior instruction following ability, and can reason about the regions from the prompt. tionships (predicates). PANDA was not originally trained to provide whole-image ranking, and we directly use predicted relationship predicates or region-wise scores with a naive control logic, e.g., if more regions in image A are bet… view at source ↗
Figure 11
Figure 11. Figure 11: Q-Instruct Behavior. An illustration of output from Q-Instruct (Wu et al., 2023) when prompted for multiple instructions. It is insensitive to the order of image, even when explicitly specified, misses degradation, struggles to follow instruction, and repeats irrelevant information. B.2 GENERALITY OF DISTORTION GRAPH REPRESENTATION Methods Accuracy ↑ mPLUG-Owl2 (2024) 48.5 LLaVA-1.6 (2024b) 57.0 Q-Instruc… view at source ↗
Figure 12
Figure 12. Figure 12: Co-Instruct Behavior. An illustration of output from Co-Instruct (Wu et al., 2024c) when prompted for multiple instructions. It fails to perform comparative assessment, frequently misses regions, and struggles with instruction following. Janus-Pro-7B. Unlike Q-Instruct (Wu et al., 2023) and Co-Instruct (Wu et al., 2024c), Janus-Pro￾7B (Chen et al., 2025) is a general-purpose open-source MLLM designed for … view at source ↗
Figure 13
Figure 13. Figure 13: Open-Source MLLM Prompt/Output. A representative example of prompt type (a) along with output for all open-source MLLMs evaluated in this work. Anchor Target Prompt (b): Provide an image quality assessment for each of the regions in image A (first) and image B (second) ... Bounding boxes and names... Answer: 1. A: {dist: blur, sev: minor, score: 0.78}, B: {dist: saturate-increase, sev: moderate, score: 0.… view at source ↗
Figure 14
Figure 14. Figure 14: Closed-Source MLLM Prompt/Output. A representative example of prompt type (b) along with output for all closed-source MLLMs evaluated in this work. Note that, in this example, the image has 16 regions. mance is generally limited, especially on Hard split of PANDABENCH indicating a broader trend in lack of region-wise image understanding towards distortion analysis. While Seagull (Chen et al., 2024c) is a … view at source ↗
Figure 15
Figure 15. Figure 15: PANDABENCH. Representative samples from Easy, Medium, and Hard splits of PAND￾ABENCH. In Easy split, only one distortion afflicts the entire image (and its regions, but with varied severity), while in Medium, mixed has region-wise distortions (see the person in blue jacket). In Hard split, the distortion varies by region in both images (see ground, bike, trees, etc.). Taken together, they represent a spec… view at source ↗
Figure 16
Figure 16. Figure 16: Hyperparmeter Sweep. Plot of optimization objective hyperparameter sweep with cross-validation on validation set of PANDASET. Each grey point denotes an experiment that per￾formed noticeably worse, and we label top five settings with colored × mark. A denotes anchor, T denotes Target, Dist. denotes distortion, Sev. is short for severity, and Acc. denotes accuracy. We, therefore, view PANDASET as the first… view at source ↗
Figure 17
Figure 17. Figure 17: Dense Distortion Graph Sample. An example of an image pair with several regions resulting in a dense distortion graph. Left image is Anchor (purple nodes in graph), Right image is Target (green nodes in graph). Legend is presented, and ’rels’ is short for relations. tion E. We will publicly release our code, trained models, and proposed dataset and benchmark to help further scientific research on comparat… view at source ↗
read the original abstract

In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Distortion Graph (DG) as an inter-image extension of scene graphs for region-level comparative image assessment, encoding distortion type, severity, comparisons, and quality scores in a compact topology. It contributes PandaSet (region-level dataset), PandaBench (benchmark with varying difficulties), and the Panda architecture for DG generation. The central claims are that state-of-the-art MLLMs fail to understand region-level degradations even when given explicit region cues, and that training on PandaSet or DG prompting elicits better region-wise understanding, opening a direction for structured pairwise assessment over whole-image methods.

Significance. If the empirical results hold, the structured DG representation and associated benchmark could advance fine-grained, interpretable comparative assessment by moving beyond implicit region reliance in whole-image models. The introduction of a dedicated dataset and benchmark for region-level distortions, plus demonstration of MLLM limitations, provides concrete value for the community even if the topology-specific gains require further validation.

major comments (2)
  1. [Experimental evaluation / results] The central claim that inter-image DG topology yields a compact, interpretable, and superior representation requires an ablation that isolates the contribution of graph edges (encoding cross-image comparisons, shared distortions, and quality relations) and message passing from simply supplying explicit region masks or bounding boxes. The reported MLLM improvements with DG prompting do not rule out that gains arise from region granularity alone rather than the pairwise structure.
  2. [Abstract and dataset/benchmark description] No quantitative details on PandaSet (e.g., number of image pairs, region annotations per pair, distortion category distribution) or PandaBench difficulty tiers are referenced in the abstract or high-level claims, making it impossible to assess whether the benchmark genuinely stresses region-level understanding or merely restates known MLLM weaknesses on fine-grained tasks.
minor comments (2)
  1. [Method] Notation for graph nodes/edges (e.g., how distortion severity and quality scores are encoded as attributes) should be formalized with a diagram or table early in the method section to aid readability.
  2. [Related work] The manuscript would benefit from explicit comparison to prior scene-graph or region-graph works in related work to clarify the precise novelty of the inter-image extension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We believe the suggested revisions will improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [Experimental evaluation / results] The central claim that inter-image DG topology yields a compact, interpretable, and superior representation requires an ablation that isolates the contribution of graph edges (encoding cross-image comparisons, shared distortions, and quality relations) and message passing from simply supplying explicit region masks or bounding boxes. The reported MLLM improvements with DG prompting do not rule out that gains arise from region granularity alone rather than the pairwise structure.

    Authors: We agree that isolating the contribution of the graph topology is important for validating our central claim. Our current experiments demonstrate that MLLMs struggle with region-level degradations even when provided explicit region cues, and that DG prompting improves performance. However, to more rigorously separate the effects of region granularity from the pairwise graph structure (including edges for comparisons and message passing), we will include an additional ablation study in the revised version. This ablation will compare performance using: (i) explicit region masks without any graph structure, (ii) region information with pairwise distortion comparisons but without full graph topology, and (iii) the complete Distortion Graph. We believe this will confirm that the structured topology provides benefits beyond region cues alone. revision: yes

  2. Referee: [Abstract and dataset/benchmark description] No quantitative details on PandaSet (e.g., number of image pairs, region annotations per pair, distortion category distribution) or PandaBench difficulty tiers are referenced in the abstract or high-level claims, making it impossible to assess whether the benchmark genuinely stresses region-level understanding or merely restates known MLLM weaknesses on fine-grained tasks.

    Authors: We appreciate this observation and agree that quantitative details would enhance the abstract's informativeness. In the revised manuscript, we will update the abstract to include key statistics such as the total number of image pairs in PandaSet, the average number of region annotations per pair, the distribution across distortion categories, and a brief description of the difficulty tiers in PandaBench. This will allow readers to better evaluate the benchmark's scope and its ability to challenge region-level understanding in MLLMs. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new task and benchmark introduced without derivations or self-referential reductions

full rationale

The paper introduces a novel Distortion Graph task by extending intra-image scene graphs to inter-image pairs, along with PandaSet dataset, PandaBench benchmark, and Panda architecture. No equations, mathematical derivations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on empirical results showing MLLM failures on region degradations and improvements via DG prompting, without any self-citation load-bearing the central premise or ansatz smuggled through prior work. This is a standard empirical introduction of new methodology and evaluation, self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The central claim rests on the domain assumption that region-level graph structures can compactly encode distortion type, severity, and quality for image pairs, plus several newly introduced entities whose validity is asserted without external evidence in the abstract.

axioms (1)
  • domain assumption Intra-image scene graphs can be meaningfully extended to inter-image comparisons for capturing distortion information.
    The paper builds directly on the intra-image scene graph concept but assumes the inter-image extension preserves utility for pairwise degradation analysis.
invented entities (4)
  • Distortion Graph (DG) no independent evidence
    purpose: Compact interpretable graph structure encoding distortion type, severity, comparison, and quality score for image region pairs.
    Newly proposed representation introduced to realize the task.
  • PandaSet no independent evidence
    purpose: Region-level dataset for training and evaluating distortion graph generation.
    New dataset contributed to support the task.
  • PandaBench no independent evidence
    purpose: Benchmark suite with varying region-level difficulty to test models on distortion graphs.
    New benchmark introduced to demonstrate challenges for existing models.
  • Panda no independent evidence
    purpose: Efficient architecture for generating distortion graphs from image pairs.
    New model architecture proposed to solve the task.

pith-pipeline@v0.9.0 · 5506 in / 1363 out tokens · 27588 ms · 2026-05-10T15:13:32.102112+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

  1. [1]

    Q-Insight: Understanding Image Quality via Visual Reinforcement Learning.arXiv e-prints2025, arXiv:2503.22679

    URLhttps://openreview.net/forum?id=skLtdUVaJa. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2015. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Ro...

  2. [2]

    distortion

    to generate region masks. We do zero-shot inference with PANDAtrained on PANDASET, and report performance on the ranking task using the predicted quality score or the comparative rela- 18 Published as a conference paper at ICLR 2026 Method Ranking Accuracy↑ Inference Time↓ Q-Insight (Li et al., 2025)0.6970 8hours GPT-5 Mini (OpenAI, 2025)0.8472N/A PANDA(Z...

  3. [3]

    regions": [{

    Region: field | Box: [x1=0.0, y1=290.0, x2=1023.0, y2=675.0] ... N. Region: ... | Box: [x1...] System: You are an image quality assessor. Respond with **JSON only** matching: {"regions": [{"index": int>=1, "a": {"distortion": one of [dist-list], "severity": 0..3, "score": 0..1}, "b": {"distortion": same set, "severity": 0..3, "score": 0..1}, "comparison":...

  4. [4]

    The colors appear to be washed out, and there is a loss of detail in the person's face and clothing

    Person: The image has a moderate level of distortion, with pixelation and blurring. The colors appear to be washed out, and there is a loss of detail in the person's face and clothing. The quality is average

  5. [5]

    The colors are also somewhat muted

    Sheep: The image has significant distortion, with pixelation and blurring that make it difficult to discern fine details. The colors are also somewhat muted. The quality of this image is poor

  6. [6]

    anchor”: {“distortion

    Mountain: Image has minimal distortion, with a clear mountain landscape in the background. The quality is good with no noticeable distortion. Region Query No Comparative Assessment, No Score, & Wrong Degradation Identification Figure 12:Co-Instruct Behavior.An illustration of output from Co-Instruct (Wu et al., 2024c) when prompted for multiple instructio...

  7. [7]

    A: {dist: blur, sev: minor, score: 0.78}, B: {dist: saturate-increase, sev: moderate, score: 0.85}, comparison: slightly-worse}

  8. [8]

    Note that, in this example, the image has16regions

    A: {dist: contrast-decrease, sev: minor, score: 0.88}, B: {dist: contrast-increase, severity: minor, score: 0.92}, comparison: slightly-worse} Figure 14:Closed-Source MLLM Prompt/Output.A representative example of prompt type (b) along with output for all closed-source MLLMs evaluated in this work. Note that, in this example, the image has16regions. mance...

  9. [9]

    Both of these datasets have region-level segmenta- tion maps, and scene information

    and (ii) Seagull-100w (Chen et al., 2024c). Both of these datasets have region-level segmenta- tion maps, and scene information. In PSG, since it is an intersection of COCO Lin et al. (2014) and Visual Genome (Krishna et al., 2017), scene level relationships (or predicates) are provided. While Seagull-100w provides a short description of each region, we u...