Panoptic Pairwise Distortion Graph
Pith reviewed 2026-05-10 15:13 UTC · model grok-4.3
The pith
Representing image pairs as region-level distortion graphs captures fine-grained degradations more compactly than whole-image analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We extend the intra-image scene graph to an inter-image Distortion Graph that treats paired images as a region-grounded topology and packs dense degradation data such as distortion type, severity, comparison, and quality score into a compact structure. To support the task we release PandaSet, a region-level dataset, PandaBench, a benchmark with graded region difficulties, and the Panda architecture for generating the graphs. PandaBench exposes a clear limitation in state-of-the-art multimodal large language models: they fail to understand region-level degradations even when supplied explicit region cues, while training on PandaSet or prompting with the Distortion Graph improves region-wiseaw
What carries the argument
The Distortion Graph, an inter-image extension of scene graphs that organizes regions from paired images into a topology encoding distortion type, severity, comparison, and quality score.
If this is right
- Whole-image assessment methods can be replaced or augmented by region-structured graphs for greater compactness and interpretability.
- Multimodal models gain region-wise distortion understanding when trained on PandaSet or prompted with Distortion Graphs.
- PandaBench provides a graded test suite that exposes and measures limitations in current models on pairwise region tasks.
- The graph output supports downstream uses that require explicit, localized degradation information rather than global scores.
Where Pith is reading between the lines
- The same region-topology approach could be tested on video sequences to track how distortions evolve across frames.
- Integrating Distortion Graphs with existing object-detection pipelines might allow automatic localization of quality issues without extra labels.
- One could measure whether models trained on these graphs transfer better to related tasks such as image editing or restoration targeting specific regions.
Load-bearing premise
That extending scene graphs to inter-image region topologies will yield a more compact and interpretable representation for comparative assessment than whole-image methods.
What would settle it
State-of-the-art multimodal models reaching high accuracy on PandaBench without region-level training or graph prompting, or human raters finding whole-image methods equally interpretable for degradation details.
Figures
read the original abstract
In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Distortion Graph (DG) as an inter-image extension of scene graphs for region-level comparative image assessment, encoding distortion type, severity, comparisons, and quality scores in a compact topology. It contributes PandaSet (region-level dataset), PandaBench (benchmark with varying difficulties), and the Panda architecture for DG generation. The central claims are that state-of-the-art MLLMs fail to understand region-level degradations even when given explicit region cues, and that training on PandaSet or DG prompting elicits better region-wise understanding, opening a direction for structured pairwise assessment over whole-image methods.
Significance. If the empirical results hold, the structured DG representation and associated benchmark could advance fine-grained, interpretable comparative assessment by moving beyond implicit region reliance in whole-image models. The introduction of a dedicated dataset and benchmark for region-level distortions, plus demonstration of MLLM limitations, provides concrete value for the community even if the topology-specific gains require further validation.
major comments (2)
- [Experimental evaluation / results] The central claim that inter-image DG topology yields a compact, interpretable, and superior representation requires an ablation that isolates the contribution of graph edges (encoding cross-image comparisons, shared distortions, and quality relations) and message passing from simply supplying explicit region masks or bounding boxes. The reported MLLM improvements with DG prompting do not rule out that gains arise from region granularity alone rather than the pairwise structure.
- [Abstract and dataset/benchmark description] No quantitative details on PandaSet (e.g., number of image pairs, region annotations per pair, distortion category distribution) or PandaBench difficulty tiers are referenced in the abstract or high-level claims, making it impossible to assess whether the benchmark genuinely stresses region-level understanding or merely restates known MLLM weaknesses on fine-grained tasks.
minor comments (2)
- [Method] Notation for graph nodes/edges (e.g., how distortion severity and quality scores are encoded as attributes) should be formalized with a diagram or table early in the method section to aid readability.
- [Related work] The manuscript would benefit from explicit comparison to prior scene-graph or region-graph works in related work to clarify the precise novelty of the inter-image extension.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We believe the suggested revisions will improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [Experimental evaluation / results] The central claim that inter-image DG topology yields a compact, interpretable, and superior representation requires an ablation that isolates the contribution of graph edges (encoding cross-image comparisons, shared distortions, and quality relations) and message passing from simply supplying explicit region masks or bounding boxes. The reported MLLM improvements with DG prompting do not rule out that gains arise from region granularity alone rather than the pairwise structure.
Authors: We agree that isolating the contribution of the graph topology is important for validating our central claim. Our current experiments demonstrate that MLLMs struggle with region-level degradations even when provided explicit region cues, and that DG prompting improves performance. However, to more rigorously separate the effects of region granularity from the pairwise graph structure (including edges for comparisons and message passing), we will include an additional ablation study in the revised version. This ablation will compare performance using: (i) explicit region masks without any graph structure, (ii) region information with pairwise distortion comparisons but without full graph topology, and (iii) the complete Distortion Graph. We believe this will confirm that the structured topology provides benefits beyond region cues alone. revision: yes
-
Referee: [Abstract and dataset/benchmark description] No quantitative details on PandaSet (e.g., number of image pairs, region annotations per pair, distortion category distribution) or PandaBench difficulty tiers are referenced in the abstract or high-level claims, making it impossible to assess whether the benchmark genuinely stresses region-level understanding or merely restates known MLLM weaknesses on fine-grained tasks.
Authors: We appreciate this observation and agree that quantitative details would enhance the abstract's informativeness. In the revised manuscript, we will update the abstract to include key statistics such as the total number of image pairs in PandaSet, the average number of region annotations per pair, the distribution across distortion categories, and a brief description of the difficulty tiers in PandaBench. This will allow readers to better evaluate the benchmark's scope and its ability to challenge region-level understanding in MLLMs. revision: yes
Circularity Check
No significant circularity: new task and benchmark introduced without derivations or self-referential reductions
full rationale
The paper introduces a novel Distortion Graph task by extending intra-image scene graphs to inter-image pairs, along with PandaSet dataset, PandaBench benchmark, and Panda architecture. No equations, mathematical derivations, fitted parameters, or predictions are present that could reduce to inputs by construction. Claims rest on empirical results showing MLLM failures on region degradations and improvements via DG prompting, without any self-citation load-bearing the central premise or ansatz smuggled through prior work. This is a standard empirical introduction of new methodology and evaluation, self-contained against external benchmarks rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Intra-image scene graphs can be meaningfully extended to inter-image comparisons for capturing distortion information.
invented entities (4)
-
Distortion Graph (DG)
no independent evidence
-
PandaSet
no independent evidence
-
PandaBench
no independent evidence
-
Panda
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://openreview.net/forum?id=skLtdUVaJa. Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2015. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Ro...
-
[2]
to generate region masks. We do zero-shot inference with PANDAtrained on PANDASET, and report performance on the ranking task using the predicted quality score or the comparative rela- 18 Published as a conference paper at ICLR 2026 Method Ranking Accuracy↑ Inference Time↓ Q-Insight (Li et al., 2025)0.6970 8hours GPT-5 Mini (OpenAI, 2025)0.8472N/A PANDA(Z...
work page 2026
-
[3]
Region: field | Box: [x1=0.0, y1=290.0, x2=1023.0, y2=675.0] ... N. Region: ... | Box: [x1...] System: You are an image quality assessor. Respond with **JSON only** matching: {"regions": [{"index": int>=1, "a": {"distortion": one of [dist-list], "severity": 0..3, "score": 0..1}, "b": {"distortion": same set, "severity": 0..3, "score": 0..1}, "comparison":...
work page 2026
-
[4]
The colors appear to be washed out, and there is a loss of detail in the person's face and clothing
Person: The image has a moderate level of distortion, with pixelation and blurring. The colors appear to be washed out, and there is a loss of detail in the person's face and clothing. The quality is average
-
[5]
The colors are also somewhat muted
Sheep: The image has significant distortion, with pixelation and blurring that make it difficult to discern fine details. The colors are also somewhat muted. The quality of this image is poor
-
[6]
Mountain: Image has minimal distortion, with a clear mountain landscape in the background. The quality is good with no noticeable distortion. Region Query No Comparative Assessment, No Score, & Wrong Degradation Identification Figure 12:Co-Instruct Behavior.An illustration of output from Co-Instruct (Wu et al., 2024c) when prompted for multiple instructio...
work page 2023
-
[7]
A: {dist: blur, sev: minor, score: 0.78}, B: {dist: saturate-increase, sev: moderate, score: 0.85}, comparison: slightly-worse}
-
[8]
Note that, in this example, the image has16regions
A: {dist: contrast-decrease, sev: minor, score: 0.88}, B: {dist: contrast-increase, severity: minor, score: 0.92}, comparison: slightly-worse} Figure 14:Closed-Source MLLM Prompt/Output.A representative example of prompt type (b) along with output for all closed-source MLLMs evaluated in this work. Note that, in this example, the image has16regions. mance...
-
[9]
Both of these datasets have region-level segmenta- tion maps, and scene information
and (ii) Seagull-100w (Chen et al., 2024c). Both of these datasets have region-level segmenta- tion maps, and scene information. In PSG, since it is an intersection of COCO Lin et al. (2014) and Visual Genome (Krishna et al., 2017), scene level relationships (or predicates) are provided. While Seagull-100w provides a short description of each region, we u...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.