PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

Sooho Moon; Yunyong Ko

arxiv: 2606.08926 · v1 · pith:4OLGJHJQnew · submitted 2026-06-08 · 💻 cs.LG

PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

Sooho Moon , Yunyong Ko This is my paper

Pith reviewed 2026-06-27 17:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords knowledge graph completionmodel evaluationinteractive systempredictive sharpnesspopularity biasevaluation landscapesGUI toolkit

0 comments

The pith

PROBE-Web lets users adjust two perspectives to evaluate knowledge graph completion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Knowledge graph completion models are typically judged by fixed rank metrics such as MRR and Hits@K, yet users often need evaluations tailored to their goals. The paper introduces PROBE-Web, a web system that lets users vary predictive sharpness to emphasize confident correct predictions and popularity-bias robustness to test performance on infrequent entities. Through a GUI, the system supports conventional metric calculations, perspective-aware scoring, explainable case breakdowns, and visual landscape exploration across multiple models. A sympathetic reader would care because the approach replaces one-size-fits-all rankings with evaluations that can match specific objectives.

Core claim

PROBE-Web provides an interactive GUI for probing evaluation landscapes of KGC models by letting users adjust predictive sharpness and popularity-bias robustness, delivering four functionalities: a conventional evaluation toolkit, flexible perspective-aware evaluation, explainable case studies, and evaluation landscape exploration to help users understand model strengths and weaknesses in line with their objectives.

What carries the argument

Two adjustable perspectives—predictive sharpness and popularity-bias robustness—implemented inside a user-friendly GUI that supports simultaneous model comparison and landscape navigation.

If this is right

Users can run and compare multiple KGC models under customized evaluation settings instead of fixed metrics.
Explainable case studies make it possible to see why a model performs well or poorly on particular triples.
Landscape exploration lets users identify regions where one model outperforms others visually.
Evaluations become aligned with individual objectives rather than a single universal ranking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same GUI pattern could support additional perspectives such as temporal robustness or domain-specific bias.
Researchers might use the system as a shared platform to benchmark new KGC models under varied criteria.
Similar adjustable-evaluation interfaces could be built for related tasks like link prediction in other graph domains.

Load-bearing premise

The two chosen perspectives capture the main evaluation needs of users and the GUI surfaces differences that are meaningful for decision making.

What would settle it

A user study in which participants report that varying the two perspective sliders produces no change in model rankings or insights beyond what standard MRR and Hits@K already show.

Figures

Figures reproduced from arXiv: 2606.08926 by Sooho Moon, Yunyong Ko.

**Figure 2.** Figure 2: RT and RA functions of PROBE. Users can flexibly evaluate KGC models from diverse evaluation perspectives by adjusting 𝛼 and 𝛽. models (i.e., rank scores) and the KG information (i.e., popularity statistics) in a predefined format. Specifically, the prediction results contain the ranked candidate entities generated by KGC models for each test query. Then, PROBEWeb automatically analyze the prediction resul… view at source ↗

**Figure 3.** Figure 3: Overview of PROBEWeb that provides the four key functionalities: (F1) Conventional Evaluation Toolkit, (F2) Flexible Perspective-Aware Evaluation, (F3) Explainable Evaluation via Case Studies, and (F4) Evaluation Landscape Exploration [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: (F2) Flexible perspective-aware evaluation. By inter [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Evaluation landscape exploration. PROBEWeb visualizes the performance of KGC models in a 3-D evaluation landscape defined by predictive sharpness (𝛼) and popularity-bias robustness (𝛽). By comparing model surfaces, users can identify the relative strengths, weaknesses, and performance trends of KGC models under diverse evaluation perspectives at a glance. 2nd 5th 4th 6th 4th 6th +422 +466 +750 +418 +191 3r… view at source ↗

**Figure 6.** Figure 6: (F3) Explainable evaluation via case studies. [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

read the original abstract

Knowledge graph completion (KGC) models are commonly evaluated using rank-based metrics such as MRR and Hits@K, despite different users often requiring different evaluation perspectives. In this demo, we present PROBE-Web, an interactive system for probing diverse evaluation landscapes for KGC models. PROBE-Web enables users to flexibly evaluate KGC models by adjusting two critical perspectives: (P1) predictive sharpness and (P2) popularity-bias robustness. Through a user-friendly GUI, users easily evaluate multiple KGC models and analyze their strengths and weaknesses. PROBE-Web provides four key functionalities: (1) conventional evaluation toolkit, (2) flexible perspective-aware evaluation, (3) explainable case studies, and (4) evaluation landscape exploration. We believe that PROBE-Web can help users better understand KGC models aligning with their objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a demo paper for a GUI tool that packages two evaluation axes for KGC models into an interface, without new methods or evidence that the axes matter.

read the letter

The paper describes PROBE-Web, a web-based interactive system for evaluating knowledge graph completion models. It lets users adjust two perspectives—predictive sharpness and popularity-bias robustness—through a GUI, alongside standard rank metrics, case studies, and landscape views.

The system description covers four functionalities that sound straightforward to implement and use: a conventional toolkit, perspective-aware scoring, explainable examples, and visual exploration. If the interface works as claimed, it could save time for someone who wants to compare models without writing separate scripts for each view.

The soft spot is the unsupported claim that these two perspectives are the critical ones. The abstract asserts this without citing prior work on user needs, without a survey of KGC practitioners, and without any comparison to other axes such as calibration or long-tail handling. No implementation details, no validation that the sliders produce reliable differences, and no user testing appear in the provided text. The central claim therefore rests on assertion rather than evidence.

This kind of paper is for KGC researchers or engineers who might want a ready-made interface for quick explorations. It does not advance evaluation methodology or report new findings, so it is unlikely to be cited or discussed in a reading group. A serious referee is not needed for a research venue; the work fits a demo track at best, where the lack of validation would be less central.

Referee Report

2 major / 0 minor

Summary. The manuscript presents PROBE-Web, an interactive web-based system for evaluating knowledge graph completion (KGC) models. It allows users to adjust two perspectives—(P1) predictive sharpness and (P2) popularity-bias robustness—via a GUI that supports conventional rank-based metrics, perspective-aware evaluation, explainable case studies, and exploration of evaluation landscapes, with the goal of helping users align assessments with their objectives beyond standard MRR and Hits@K.

Significance. If the system is fully implemented as described and the chosen perspectives surface actionable model differences, PROBE-Web could provide a usable tool for KGC practitioners seeking customizable evaluations. The interactive, GUI-driven approach is a practical strength for accessibility.

major comments (2)

[Abstract] Abstract: The assertion that predictive sharpness and popularity-bias robustness constitute the 'two critical perspectives' is presented without any supporting justification, such as citations to prior KGC evaluation literature, a practitioner survey, or empirical comparison against other axes (e.g., calibration error or long-tail entity performance). This choice is load-bearing for the system's design and claimed utility.
[Abstract] Abstract: No implementation details, validation experiments, or evidence are supplied that adjusting the P1/P2 sliders produces meaningful, reproducible differences across models that align with real user objectives. The four listed functionalities are described at a high level only.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and outline planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that predictive sharpness and popularity-bias robustness constitute the 'two critical perspectives' is presented without any supporting justification, such as citations to prior KGC evaluation literature, a practitioner survey, or empirical comparison against other axes (e.g., calibration error or long-tail entity performance). This choice is load-bearing for the system's design and claimed utility.

Authors: We agree that the abstract asserts these perspectives without explicit supporting citations or justification in the current text. The full manuscript draws on related KGC evaluation literature concerning bias and ranking sharpness, but we will revise the abstract and add a short paragraph in the introduction with targeted citations to prior work on these specific axes. We will not add a new practitioner survey or broad empirical comparison, as those fall outside the demo scope, but the added references will clarify the rationale for the design choice. revision: yes
Referee: [Abstract] Abstract: No implementation details, validation experiments, or evidence are supplied that adjusting the P1/P2 sliders produces meaningful, reproducible differences across models that align with real user objectives. The four listed functionalities are described at a high level only.

Authors: The paper is a system demonstration, so the abstract and main text prioritize high-level description of the GUI and functionalities. We acknowledge the absence of concrete implementation details and example outputs in the abstract. In revision we will expand the abstract slightly and add a new subsection with implementation notes, reproducible slider-adjustment examples, and qualitative evidence of model differentiation. Full quantitative validation experiments across many models are not present and would require substantial additional work; we will therefore provide illustrative case studies instead. revision: partial

Circularity Check

0 steps flagged

No circularity: system description with no derivations or fitted quantities

full rationale

The paper is a demo/system description of PROBE-Web with no equations, parameters, predictions, or derivations of any kind. The two perspectives (P1 predictive sharpness, P2 popularity-bias robustness) are presented as design choices in the abstract and functionalities list, not derived from prior results or self-citations. No load-bearing steps reduce to inputs by construction, self-citation chains, or renaming. The absence of any mathematical content makes circularity impossible; the work is self-contained as an interactive toolkit description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the contribution is a described interactive system with no underlying mathematical claims.

pith-pipeline@v0.9.1-grok · 5672 in / 892 out tokens · 14808 ms · 2026-06-27T17:12:54.413148+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 1 canonical work pages

[1]

Ivana Balazevic, Carl Allen, and Timothy Hospedales. 2019. TuckER: Tensor Factorization for Knowledge Graph Completion. InProceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, ...

work page doi:10.18653/v1/d19-1522 2019
[2]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems, Vol. 26

2013
[3]

Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2D knowledge graph embeddings. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 32

2018
[4]

Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. InProceedings of the twelfth ACM international conference on web search and data mining. 105–113

2019
[5]

Rui Li, Jianan Zhao, Chaozhuo Li, Di He, Yiqi Wang, Yuming Liu, Hao Sun, Senzhang Wang, Weiwei Deng, Yanming Shen, et al. 2022. House: Knowledge graph embedding with householder parameterization. InInternational Conference on Machine Learning. PMLR, 13209–13224. PROBEWeb: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

2022
[6]

Xuan Lin, Zhe Quan, Zhi-Jie Wang, Tengfei Ma, and Xiangxiang Zeng. 2020. KGNN: Knowledge graph neural network for drug-drug interaction prediction.. InIJCAI, Vol. 380. 2739–2745

2020
[7]

Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. 2013. YAGO3: A knowledge base from multilingual Wikipedias. InCIDR

2013
[8]

Sooho Moon and Yunyong Ko. 2026. How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion. InProceedings of the nineteenth ACM international conference on web search and data mining

2026
[9]

Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel, et al. 2011. A three-way model for collective learning on multi-relational data.. InIcml, Vol. 11. 3104482– 3104584

2011
[10]

Meng Qu, Junkun Chen, Louis-Pascal Xhonneux, Yoshua Bengio, and Jian Tang
[11]

In International Conference on Learning Representations (ICLR)

RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs. In International Conference on Learning Representations (ICLR). https://openreview .net/forum?id=tGZu6DlbreV
[12]

Meng Qu and Jian Tang. 2019. Probabilistic logic neural networks for reasoning. InAdvances in Neural Information Processing Systems, Vol. 32

2019
[13]

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. RotatE: Knowl- edge Graph Embedding by Relational Rotation in Complex Space. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?i d=HkgEQnRqYQ

2019
[14]

Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, and Yiming Yang. 2020. A Re-evaluation of Knowledge Graph Completion Methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5516–5522. doi:10 .18653/v1/2020.acl-main.489

2020
[15]

Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. InProceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 57–66

2015
[16]

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. InProceedings of International Conference on Machine Learning. PMLR, 2071–2080

2016
[17]

Daniel Vella and Jean-Paul Ebejer. 2022. Few-shot learning for low-data drug discovery.Journal of Chemical Information and Modeling63, 1 (2022), 27–42

2022
[18]

Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. DKN: Deep knowledge-aware network for news recommendation. InProceedings of the Web Conference (WWW). 1835–1844

2018
[19]

Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019. Knowl- edge graph convolutional networks for recommender systems. InThe world wide web conference. 3307–3313

2019
[20]

Fan Yang, Zhilin Yang, and William W Cohen. 2017. Differentiable learning of logical rules for knowledge base reasoning. InAdvances in Neural Information Processing Systems, Vol. 30

2017
[21]

Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. 2021. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 535–546

2021
[22]

Xiangxiang Zeng, Xinqi Tu, Yuansheng Liu, Xiangzheng Fu, and Yansen Su
[23]

Toward better drug discovery with knowledge graph.Current Opinion in Structural Biology72 (2022), 114–126

2022
[24]

Rui Zhang, Dimitar Hristovski, Dalton Schutte, Andrej Kastrin, Marcelo Fiszman, and Halil Kilicoglu. 2021. Drug repurposing for COVID-19 via knowledge graph completion.Journal of Biomedical Informatics115 (2021), 103696

2021
[25]

Yongqi Zhang and Quanming Yao. 2022. Knowledge graph reasoning with relational digraph. InProceedings of the ACM web conference 2022. 912–924

2022
[26]

Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive recommender system via knowledge graph-enhanced reinforcement learning. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 179– 188

2020

[1] [1]

Ivana Balazevic, Carl Allen, and Timothy Hospedales. 2019. TuckER: Tensor Factorization for Knowledge Graph Completion. InProceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th In- ternational Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, ...

work page doi:10.18653/v1/d19-1522 2019

[2] [2]

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. InAdvances in Neural Information Processing Systems, Vol. 26

2013

[3] [3]

Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2D knowledge graph embeddings. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 32

2018

[4] [4]

Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. Knowledge graph embedding based question answering. InProceedings of the twelfth ACM international conference on web search and data mining. 105–113

2019

[5] [5]

Rui Li, Jianan Zhao, Chaozhuo Li, Di He, Yiqi Wang, Yuming Liu, Hao Sun, Senzhang Wang, Weiwei Deng, Yanming Shen, et al. 2022. House: Knowledge graph embedding with householder parameterization. InInternational Conference on Machine Learning. PMLR, 13209–13224. PROBEWeb: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

2022

[6] [6]

Xuan Lin, Zhe Quan, Zhi-Jie Wang, Tengfei Ma, and Xiangxiang Zeng. 2020. KGNN: Knowledge graph neural network for drug-drug interaction prediction.. InIJCAI, Vol. 380. 2739–2745

2020

[7] [7]

Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. 2013. YAGO3: A knowledge base from multilingual Wikipedias. InCIDR

2013

[8] [8]

Sooho Moon and Yunyong Ko. 2026. How Sharp and Bias-Robust is a Model? Dual Evaluation Perspectives on Knowledge Graph Completion. InProceedings of the nineteenth ACM international conference on web search and data mining

2026

[9] [9]

Maximilian Nickel, Volker Tresp, Hans-Peter Kriegel, et al. 2011. A three-way model for collective learning on multi-relational data.. InIcml, Vol. 11. 3104482– 3104584

2011

[10] [10]

Meng Qu, Junkun Chen, Louis-Pascal Xhonneux, Yoshua Bengio, and Jian Tang

[11] [11]

In International Conference on Learning Representations (ICLR)

RNNLogic: Learning Logic Rules for Reasoning on Knowledge Graphs. In International Conference on Learning Representations (ICLR). https://openreview .net/forum?id=tGZu6DlbreV

[12] [12]

Meng Qu and Jian Tang. 2019. Probabilistic logic neural networks for reasoning. InAdvances in Neural Information Processing Systems, Vol. 32

2019

[13] [13]

Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. RotatE: Knowl- edge Graph Embedding by Relational Rotation in Complex Space. InInternational Conference on Learning Representations (ICLR). https://openreview.net/forum?i d=HkgEQnRqYQ

2019

[14] [14]

Zhiqing Sun, Shikhar Vashishth, Soumya Sanyal, Partha Talukdar, and Yiming Yang. 2020. A Re-evaluation of Knowledge Graph Completion Methods. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5516–5522. doi:10 .18653/v1/2020.acl-main.489

2020

[15] [15]

Kristina Toutanova and Danqi Chen. 2015. Observed versus latent features for knowledge base and text inference. InProceedings of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality. 57–66

2015

[16] [16]

Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. InProceedings of International Conference on Machine Learning. PMLR, 2071–2080

2016

[17] [17]

Daniel Vella and Jean-Paul Ebejer. 2022. Few-shot learning for low-data drug discovery.Journal of Chemical Information and Modeling63, 1 (2022), 27–42

2022

[18] [18]

Hongwei Wang, Fuzheng Zhang, Xing Xie, and Minyi Guo. 2018. DKN: Deep knowledge-aware network for news recommendation. InProceedings of the Web Conference (WWW). 1835–1844

2018

[19] [19]

Hongwei Wang, Miao Zhao, Xing Xie, Wenjie Li, and Minyi Guo. 2019. Knowl- edge graph convolutional networks for recommender systems. InThe world wide web conference. 3307–3313

2019

[20] [20]

Fan Yang, Zhilin Yang, and William W Cohen. 2017. Differentiable learning of logical rules for knowledge base reasoning. InAdvances in Neural Information Processing Systems, Vol. 30

2017

[21] [21]

Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, and Jure Leskovec. 2021. QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 535–546

2021

[22] [22]

Xiangxiang Zeng, Xinqi Tu, Yuansheng Liu, Xiangzheng Fu, and Yansen Su

[23] [23]

Toward better drug discovery with knowledge graph.Current Opinion in Structural Biology72 (2022), 114–126

2022

[24] [24]

Rui Zhang, Dimitar Hristovski, Dalton Schutte, Andrej Kastrin, Marcelo Fiszman, and Halil Kilicoglu. 2021. Drug repurposing for COVID-19 via knowledge graph completion.Journal of Biomedical Informatics115 (2021), 103696

2021

[25] [25]

Yongqi Zhang and Quanming Yao. 2022. Knowledge graph reasoning with relational digraph. InProceedings of the ACM web conference 2022. 912–924

2022

[26] [26]

Sijin Zhou, Xinyi Dai, Haokun Chen, Weinan Zhang, Kan Ren, Ruiming Tang, Xiuqiang He, and Yong Yu. 2020. Interactive recommender system via knowledge graph-enhanced reinforcement learning. InProceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 179– 188

2020