U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

Angeliki Dimitriou; Giorgos Filandrianos; Giorgos Stamou; Maria Lymperaiou; Nikolaos Chaidos

arxiv: 2604.08295 · v3 · pith:WJZ6UMSBnew · submitted 2026-04-09 · 💻 cs.AI · cs.CV

U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

Angeliki Dimitriou , Nikolaos Chaidos , Maria Lymperaiou , Giorgos Filandrianos , Giorgos Stamou This is my paper

Pith reviewed 2026-05-22 10:31 UTC · model grok-4.3

classification 💻 cs.AI cs.CV

keywords conceptual counterfactual explanationsmulti-resolution frameworksgraph neural networksgraph autoencodersexplainable AIconcept-based explanationsstructural graphscounterfactual generation

0 comments

The pith

U-CECE delivers conceptual counterfactual explanations at three adjustable levels of detail from atomic sets to full graphs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces U-CECE as a single framework that generates explanations for AI decisions by representing concepts at different resolutions depending on the available data and compute. Atomic concepts give quick broad explanations, relational sets-of-sets handle basic interactions, and structural graphs capture complete semantic relationships. This addresses the prior trade-off where fast methods lost context while detailed graph methods required solving expensive graph edit distance problems. The framework supports both supervised graph neural networks for accuracy and unsupervised autoencoders for scale at the graph level. Experiments on image datasets show the practical balance and confirm that the resulting explanations match human and model judgments of semantic quality.

Core claim

U-CECE is a model-agnostic framework that unifies conceptual counterfactual explanations across three expressivity levels: atomic concepts for broad views, relational sets-of-sets for simple interactions, and structural graphs for full semantics. At the structural level it offers a transductive mode with supervised graph neural networks for precision and an inductive mode with unsupervised graph autoencoders for scalability, both approximating the results of exact graph edit distance. Tests on the CUB and Visual Genome datasets map the efficiency-expressivity trade-off, while human surveys and large vision-language model evaluations indicate that the structural counterfactuals are equivalent

What carries the argument

The three-level expressivity hierarchy from atomic concepts through relational sets-of-sets to structural graphs, with graph neural networks and graph autoencoders handling the graph level.

If this is right

Explanations can be produced quickly with atomic concepts when compute is limited.
Relational sets-of-sets capture interactions without full graph computation.
Structural graphs yield explanations that align with human semantic judgments.
Both supervised and unsupervised neural modes make graph-level explanations practical across data regimes.
The framework adapts to different datasets without requiring exact graph edit distance at every step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hierarchy could be tested on non-image data such as text or sensor streams to see whether the efficiency gains hold.
Combining the levels might let systems start with atomic explanations and refine only when a user requests more detail.
Wider adoption could shorten the time between model deployment and user debugging of errors in production settings.

Load-bearing premise

The neural approximations at the structural level produce counterfactuals that remain semantically equivalent to exact graph edit distance solutions as judged by humans and large vision-language models.

What would settle it

A controlled comparison on a new dataset in which human raters or vision-language models consistently judge the GNN- or GAE-generated structural counterfactuals as less faithful than those from exact graph edit distance.

Figures

Figures reproduced from arXiv: 2604.08295 by Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou, Maria Lymperaiou, Nikolaos Chaidos.

**Figure 1.** Figure 1: The U-CECE Framework for Multi-Resolution Conceptual Counterfactuals. The pipeline begins with Concept Abstraction, mapping raw inputs to a symbolic vocabulary (Computer(A), Keyboard(B), Table(C), Cat(D)) and their relations (on(r)). Segmentation visualizations are for clarity, not to imply a specific abstraction pipeline. The central Expressivity Level decision directs queries through three increasingly… view at source ↗

**Figure 2.** Figure 2: Retrieval performance comparison of underlying GNN components of the U-CECE-Structural [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: U-CECE expressivity tiers yielding the same counterfactual for different datasets ( [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of diverging counterfactuals retrieved by U-CECE expressivity tiers for [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of diverging counterfactuals retrieved by U-CECE expressivity tiers for [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of visual form layouts showing two triplets of images used in the experiments. Each [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Human evaluation results (signal dataset). The top row displays the aggregate consensus for each experiment, while the bottom row details the distribution for each individual question. nuanced visual patterns that exceed the capacity of rigid, discrete graph edits. Ultimately, this proves that for explanations requiring perceived semantic similarity, continuous neural representations are inherently better … view at source ↗

**Figure 8.** Figure 8: Distribution of GED across the Sem-Eq-x human perception surveys. Point markers represent the mean GED from the source image for GT and GCN-retrieved counterfactuals, categorized by participant consensus: Yes (Signal), No (Signal), and Noise. Error bars denote standard deviation. The area of each marker is proportional to the total volume of user annotations in that specific category. The relationship be… view at source ↗

**Figure 9.** Figure 9: Results of LVLM-as-Judge. Top row for single step evaluation, second row for the analyze-then [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Human vs LVLM responses agreement confusion matrices directly corresponding to Fig. 9 [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: An example of the Pairs experimental layout. The Source Image is removed to test unanchored semantic equivalence. "Sem-Ed-Triplets" Initial Instruction: "The first image is the source, while the second and the third are proposed counterfactual images of the source. Are the proposed counterfactuals (second and third image) semantically equivalent? Focus on whether the counterfactuals are of the same sp… view at source ↗

read the original abstract

As AI models grow more complex, explainability is essential for building trust, yet concept-based counterfactual methods still face a trade-off between expressivity and efficiency. Representing underlying concepts as atomic sets is fast but misses relational context, whereas full graph representations are more faithful but require solving the NP-hard Graph Edit Distance (GED) problem. We propose U-CECE, a unified, model-agnostic multi-resolution framework for conceptual counterfactual explanations that adapts to data regime and compute budget. U-CECE spans three levels of expressivity: atomic concepts for broad explanations, relational sets-of-sets for simple interactions, and structural graphs for full semantic structure. At the structural level, both a precision-oriented transductive mode based on supervised Graph Neural Networks (GNNs) and a scalable inductive mode based on unsupervised graph autoencoders (GAEs) are supported. Experiments on the structurally divergent CUB and Visual Genome datasets characterize the efficiency-expressivity trade-off across levels, while human surveys and LVLM-based evaluation show that the retrieved structural counterfactuals are semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

U-CECE layers atomic, relational, and structural counterfactuals with GNN/GAE approximations to GED, but the equivalence claim rests on preference scores rather than direct error metrics.

read the letter

Hi, the main point on this paper is a three-level framework for conceptual counterfactuals that starts with simple atomic sets, moves to relational sets-of-sets, and tops out at graph structures where it swaps in either supervised GNNs or unsupervised GAEs to avoid exact GED computation. The authors position it as model-agnostic and adaptable to compute budget or data regime. That explicit hierarchy with the dual structural modes is the clearest new synthesis here; most prior concept or graph-edit work tends to lock into one regime rather than offering the choice inside one setup. They run the levels on CUB and Visual Genome to map the efficiency-expressivity curve and back the structural outputs with human surveys plus LVLM preference checks, which is a reasonable way to show the approximations feel right to users and models. The soft spot is the validation step for those structural approximations. The abstract claims semantic equivalence to exact GED ground truth, yet the reported evidence is preference and equivalence judgments rather than any quantitative check such as mean edit-distance deviation, graph-edit similarity, or faithfulness to the true minimal edits. Since the GNN and GAE are necessarily approximations to an NP-hard problem, some direct error characterization would make the equivalence claim more convincing and less dependent on the two chosen datasets. This is the kind of work that fits explainable AI groups that build tools for vision or relational data and want tunable fidelity instead of a single fixed method. A reader who needs practical options for counterfactual generation under resource constraints would get something usable from it. The framework is coherent enough on its own terms to deserve a serious referee even if the quantitative side needs tightening. I would send it out for peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces U-CECE, a model-agnostic multi-resolution framework for conceptual counterfactual explanations. It defines three expressivity levels (atomic concepts, relational sets-of-sets, and structural graphs) and, at the structural level, supports a transductive supervised GNN mode for precision and an inductive unsupervised GAE mode for scalability as approximations to the NP-hard Graph Edit Distance problem. Experiments on the CUB and Visual Genome datasets, together with human surveys and LVLM evaluations, are presented to characterize the efficiency-expressivity trade-off and to claim that the structural counterfactuals are semantically equivalent to exact GED ground truth.

Significance. If the central claims hold, U-CECE would offer a practical, adaptable approach to balancing expressivity and computational cost in counterfactual explanations for vision and scene-graph tasks. The explicit multi-resolution hierarchy and dual transductive/inductive modes at the structural level represent a clear engineering contribution. The use of external human and LVLM preference data provides some evidence of utility, but the absence of direct quantitative checks on approximation quality to exact GED weakens the verifiability of the semantic-equivalence assertion.

major comments (2)

[Abstract] Abstract: the claim that structural counterfactuals are 'semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations' rests only on human surveys and LVLM preference results. No quantitative metric (mean edit-distance deviation, graph-edit similarity, or faithfulness score) comparing GNN/GAE outputs to true minimal GED edits is reported, even though the models are necessarily approximations to an NP-hard problem; this omission is load-bearing for the efficiency-expressivity claims.
[Experiments] Experiments section: the manuscript supplies no ablation details, error bounds, or direct distance-to-exact comparisons for the GNN and GAE approximations on the test sets. Without such checks it remains unclear whether observed human/LVLM preferences reflect general semantic equivalence or are artifacts of the two chosen datasets.

minor comments (1)

[Abstract] The abstract refers to 'structurally divergent CUB and Visual Genome datasets' without specifying the structural differences or how they influence the observed trade-offs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of verifiability for our approximation claims. We respond to each major comment below and will incorporate clarifications and additional analyses in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that structural counterfactuals are 'semantically equivalent to, and often preferred over, exact GED-based ground-truth explanations' rests only on human surveys and LVLM preference results. No quantitative metric (mean edit-distance deviation, graph-edit similarity, or faithfulness score) comparing GNN/GAE outputs to true minimal GED edits is reported, even though the models are necessarily approximations to an NP-hard problem; this omission is load-bearing for the efficiency-expressivity claims.

Authors: We agree that direct quantitative metrics comparing GNN/GAE outputs to exact minimal GED edits would strengthen verifiability. Exact GED is NP-hard and intractable at the scale of Visual Genome graphs, which is why we positioned the GNN and GAE modes as practical approximations. The human surveys and LVLM evaluations were chosen as proxies for semantic equivalence because they directly assess whether the resulting explanations convey comparable meaning to users and models. We will revise the abstract to qualify the claim as semantic equivalence under these evaluations rather than exact edit-distance equivalence, and we will add a brief discussion of computational intractability in the experiments section. These changes will be made in the next version. revision: yes
Referee: [Experiments] Experiments section: the manuscript supplies no ablation details, error bounds, or direct distance-to-exact comparisons for the GNN and GAE approximations on the test sets. Without such checks it remains unclear whether observed human/LVLM preferences reflect general semantic equivalence or are artifacts of the two chosen datasets.

Authors: We acknowledge the absence of explicit ablations, error bounds, and direct distance-to-exact comparisons in the current experiments section. We will expand the section to include ablation studies on the GNN and GAE components, training error bounds, and, on smaller graph subsets where exact GED remains tractable, direct quantitative comparisons of approximation quality. These additions will help demonstrate that the reported human and LVLM preferences are not artifacts of the specific CUB and Visual Genome datasets. The revisions will be incorporated in the updated manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity detected; framework is an engineering synthesis with external dataset validation

full rationale

The paper introduces U-CECE as a multi-resolution framework spanning atomic, relational, and structural levels, with GNN transductive and GAE inductive modes at the structural level. No equations, derivations, or self-referential definitions are present that reduce any claimed performance or equivalence to fitted parameters or prior self-citations by construction. The central claim of semantic equivalence to GED ground truth is positioned as validated via human surveys and LVLM evaluation on external datasets (CUB, Visual Genome), rendering the derivation self-contained against independent benchmarks rather than internally forced. This matches the default expectation for non-circular engineering papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions about GNN and GAE representational power plus the claim that human and LVLM judgments can serve as proxies for semantic equivalence to exact GED.

axioms (1)

domain assumption Graph Neural Networks and Graph Autoencoders can approximate structural differences sufficiently well for counterfactual generation
Invoked when moving from exact GED to the transductive and inductive structural modes.

invented entities (1)

U-CECE multi-resolution hierarchy no independent evidence
purpose: To unify atomic, relational, and structural conceptual counterfactuals under one adaptive framework
Newly introduced construct whose utility is demonstrated through experiments rather than derived from prior theory.

pith-pipeline@v0.9.0 · 5749 in / 1352 out tokens · 70638 ms · 2026-05-22T10:31:52.702190+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

U-CECE spans three levels of expressivity: atomic concepts... relational sets-of-sets... structural graphs... supervised Graph Neural Networks (GNNs) and... graph autoencoders (GAEs)
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Graph Edit Distance (GED) problem... NP-hard... approximated via GNN embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

doi: 10.1145/3616865

ISSN 0360-0300. doi: 10.1145/3616865. URLhttps://doi.org/10.1145/3616865. Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, and Giorgos Stamou. Scenir: visual semantic clarity through unsupervised scene graph retrieval. InProceedings of the 42nd International Conference on Machine Learning, ICML’25, 2025. Chun-Hao Kingsley Chang, Elliot Creager, Ann...

work page doi:10.1145/3616865 2025
[2]

Ziyi Chang, George A

URLhttps://api.semanticscholar.org/CorpusID:52962991. Ziyi Chang, George A. Koulieris, Hyung Jin Chang, and Hubert P.H. Shum. On the design fundamentals of diffusion models: A survey.Pattern Recognition, 169:111934, 2026. ISSN 0031-3203. doi: https:// doi.org/10.1016/j.patcog.2025.111934. URLhttps://www.sciencedirect.com/science/article/pii/ S003132032500...

work page doi:10.1016/j.patcog.2025.111934 2026
[3]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Springer Nature Switzerland. ISBN 978-3-031-43415-0. Edmund Dervakos, Konstantinos Thomas, Giorgos Filandrianos, and Giorgos Stamou. Choose your data wisely: A framework for semantic counterfactuals. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), pp. 382–390, 2023. doi: 10.24963/ijcai.2023/43. URL ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2023/43 2023

[1] [1]

doi: 10.1145/3616865

ISSN 0360-0300. doi: 10.1145/3616865. URLhttps://doi.org/10.1145/3616865. Nikolaos Chaidos, Angeliki Dimitriou, Maria Lymperaiou, and Giorgos Stamou. Scenir: visual semantic clarity through unsupervised scene graph retrieval. InProceedings of the 42nd International Conference on Machine Learning, ICML’25, 2025. Chun-Hao Kingsley Chang, Elliot Creager, Ann...

work page doi:10.1145/3616865 2025

[2] [2]

Ziyi Chang, George A

URLhttps://api.semanticscholar.org/CorpusID:52962991. Ziyi Chang, George A. Koulieris, Hyung Jin Chang, and Hubert P.H. Shum. On the design fundamentals of diffusion models: A survey.Pattern Recognition, 169:111934, 2026. ISSN 0031-3203. doi: https:// doi.org/10.1016/j.patcog.2025.111934. URLhttps://www.sciencedirect.com/science/article/pii/ S003132032500...

work page doi:10.1016/j.patcog.2025.111934 2026

[3] [3]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Springer Nature Switzerland. ISBN 978-3-031-43415-0. Edmund Dervakos, Konstantinos Thomas, Giorgos Filandrianos, and Giorgos Stamou. Choose your data wisely: A framework for semantic counterfactuals. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI-23), pp. 382–390, 2023. doi: 10.24963/ijcai.2023/43. URL ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.24963/ijcai.2023/43 2023