Efficient Knowledge Graph Accuracy Evaluation

Bunyamin Sisman; Jun Yang; Junyang Gao; Xian Li; Xin Luna Dong; Yifan Ethan Xu

arxiv: 1907.09657 · v1 · pith:H4K5AUZ3new · submitted 2019-07-23 · 💻 cs.DB

Efficient Knowledge Graph Accuracy Evaluation

Junyang Gao , Xian Li , Yifan Ethan Xu , Bunyamin Sisman , Xin Luna Dong , Jun Yang This is my paper

Pith reviewed 2026-05-24 17:20 UTC · model grok-4.3

classification 💻 cs.DB

keywords knowledge graphaccuracy evaluationcluster samplingstratified samplingreservoir samplinghuman annotation costevolving datastatistical estimation

0 comments

The pith

Cluster sampling and stratification cut the human cost of knowledge graph accuracy evaluation by up to 60 percent on static graphs and 80 percent on evolving graphs while preserving statistical guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a sampling framework that produces statistically valid accuracy estimates for large knowledge graphs using far fewer human annotations than uniform random sampling. It motivates cluster sampling from the observed property that annotation cost is often shared across related triples, then layers weighted sampling, two-stage designs, and stratification to further control variance and cost. For graphs that change over time it adds incremental estimators based on stratified sampling and a weighted reservoir-sampling variant. The central claim is that these designs deliver the same accuracy confidence intervals at substantially lower total annotation expense, a result demonstrated on real-world data sets.

Core claim

The proposed framework forms clusters of triples that share annotation cost, applies cluster sampling within a stratified or weighted design, and thereby obtains unbiased accuracy estimates whose variance is controlled with fewer individual annotations. For evolving graphs the same cost structure is exploited by maintaining a stratified sample across updates or by using a weighted reservoir that admits efficient insertions and deletions while retaining the original sampling probabilities.

What carries the argument

Cluster sampling over cost-correlated groups of triples, combined with stratification and weighted two-stage sampling, to reduce the number of annotations required for a given precision of the accuracy estimate.

If this is right

Accuracy estimates retain the same statistical validity and confidence intervals as uniform sampling.
The same designs apply without modification to both static and incrementally updated knowledge graphs.
Total annotation cost drops by up to 60 percent for static evaluation and up to 80 percent for evolving evaluation at unchanged quality.
Stratification by triple type or source further reduces variance for the same sample size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

More frequent accuracy audits become feasible during knowledge-graph construction because each audit consumes less annotator time.
The same cost-clustering idea could be tested on other annotation tasks whose cost structure exhibits locality, such as entity linking or relation extraction.
If clusters can be discovered automatically from the graph structure rather than supplied by domain experts, the method becomes more widely deployable.
An adaptive variant that re-stratifies after an initial pilot sample could be examined to further tighten cost bounds.

Load-bearing premise

Annotation cost is sufficiently correlated within identifiable clusters that sampling whole clusters reduces total annotations without introducing bias in the accuracy estimate.

What would settle it

Running both the proposed cluster-sampling estimator and a full census on the same knowledge graph and observing whether the accuracy point estimate falls outside the claimed confidence interval more often than the nominal error rate.

Figures

Figures reproduced from arXiv: 1907.09657 by Bunyamin Sisman, Jun Yang, Junyang Gao, Xian Li, Xin Luna Dong, Yifan Ethan Xu.

**Figure 1.** Figure 1: Evaluation cost comparison of triple-level and entitylevel tasks on MOVIE. For entity-level tasks, the first triple evaluated from an entity cluster is marked as solid triangle. • Entity Identification: Besides assigning annotators an Evaluation Task to audit, we also provide a small set of related information regarding to the subject of this Task. Annotators are required to use the provided informatio… view at source ↗

**Figure 2.** Figure 2: Iterative KG accuracy evaluation framework. where E 0 is the set of distinct ids from G 0 . c1, c2 are the average cost of entity identification and relationship validation, respectively. Average costs c1 and c2 are calculated from empirical evaluation costs by human annotators over multiple evaluation tasks. In reality, the cost of evaluating triples vary by different human annotators, but in practice w… view at source ↗

**Figure 3.** Figure 3: Correlation between entity accuracy and cluster size in real-life KGs, NELL and YAGO. Entity accuracy is defined as the percentage of triples being correct in the entity cluster. 5.3 Further Optimization: Stratification In [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Cost function fitting on different evaluation tasks. where ε is a small error term from a normal distribution with mean 0 and standard deviation σ, and c ≥ 0 scales the influence of cluster size on entity accuracy. ε and c together control the correlation between Mi and ˆpi . Tuning c, ε and k, the Binomial Mixture Model (BMM) allows us to experiment with different data characteristics without the burden o… view at source ↗

**Figure 6.** Figure 6: Finding optimal second stage sample size m on NELL and MOVIE-SYN with various synthetic generated labels. Grey ribbon and Blue ribbon in plots represent the standard deviation of SRS performances and the theoretical range of TWCS annotation cost respectively. smaller than 5. Hence, when m becomes larger than 5 and no matter how large the m will be, the total number of triples we evaluated is roughly a con… view at source ↗

**Figure 7.** Figure 7: TWCS evaluation cost on various KGs with different size and different overall accuracy. However, on NELL and MOVIE, we observe that applying stratification on TWCS does not help too much and it might be even slightly worse than the basic TWCS (as on NELL). This is because, in practice, cluster size may serve as a good signal indicating similarities among entity accuracies for large clusters but not for t… view at source ↗

**Figure 8.** Figure 8: Comparing evaluation cost for various solutions on evolving KG with a single update batch. Overall accuracy after update is shown on top of bars. lion triples. On the other hand, we vary the overall accuracy from 10% to 90% on MOVIE-FULL and report the evaluation cost in [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Comparing evaluation quality of incremental evaluation solutions on evolving KG with a sequence of updates: unbiasedness vs. fault tolerance. Blue ribbon in plots represents the 5% MoE range of ground truth accuracy. timation at the beginning, while RS is more fault-tolerant, quickly jumping away from the bad estimation and converging to the groundtruth after 5 to 10 batches of updates. Based on the exper… view at source ↗

read the original abstract

Estimation of the accuracy of a large-scale knowledge graph (KG) often requires humans to annotate samples from the graph. How to obtain statistically meaningful estimates for accuracy evaluation while keeping human annotation costs low is a problem critical to the development cycle of a KG and its practical applications. Surprisingly, this challenging problem has largely been ignored in prior research. To address the problem, this paper proposes an efficient sampling and evaluation framework, which aims to provide quality accuracy evaluation with strong statistical guarantee while minimizing human efforts. Motivated by the properties of the annotation cost function observed in practice, we propose the use of cluster sampling to reduce the overall cost. We further apply weighted and two-stage sampling as well as stratification for better sampling designs. We also extend our framework to enable efficient incremental evaluation on evolving KG, introducing two solutions based on stratified sampling and a weighted variant of reservoir sampling. Extensive experiments on real-world datasets demonstrate the effectiveness and efficiency of our proposed solution. Compared to baseline approaches, our best solutions can provide up to 60% cost reduction on static KG evaluation and up to 80% cost reduction on evolving KG evaluation, without loss of evaluation quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies cluster sampling and variants to KG accuracy evaluation and reports 60-80% cost cuts on their data, but the gains rest on annotation cost properties that are observed rather than guaranteed.

read the letter

The paper's main point is that you can use cluster sampling plus some weighting and stratification to evaluate how accurate a big knowledge graph is, and do it with a lot less human labeling work. On their datasets they get 60 percent savings for static graphs and 80 percent for ones that change over time, without hurting the quality of the accuracy number. What stands out is that they noticed this evaluation step was mostly ignored before and then built a full framework around it, including an incremental version for changing graphs using reservoir sampling. The experiments are on real data, which gives some weight to the numbers. The part that needs watching is the dependence on how the annotation costs actually work. They say the cost function has properties seen in practice that make clustering pay off. But if the clusters don't line up with real effort on a different graph, you could end up with wider confidence intervals or smaller savings. Cluster sampling usually increases variance for the same number of items, so the cost win has to more than make up for it. The abstract does not spell out the exact sampling formulas or how they handle the error bars, so it is hard to see how robust the guarantees are. This is the kind of paper that people who actually build and maintain knowledge graphs will want to read. It gives concrete ways to spend less on evaluation while still getting reliable accuracy figures. A methods person in databases or data quality would also pick up the sampling designs. I would send it out for peer review. The practical problem is real, the approach is grounded in standard sampling theory, and the experiments provide a starting point even if more work is needed on when the cost assumptions hold.

Referee Report

2 major / 2 minor

Summary. The paper proposes an efficient sampling and evaluation framework for estimating the accuracy of large-scale knowledge graphs. Motivated by observed properties of the annotation cost function, it employs cluster sampling along with weighted, two-stage, and stratified variants to minimize human annotation costs while providing strong statistical guarantees. The framework is extended to incremental evaluation on evolving KGs via stratified sampling and a weighted reservoir sampling variant. Experiments on real-world datasets report up to 60% cost reduction for static KGs and 80% for evolving KGs without loss of evaluation quality compared to baselines.

Significance. If the claimed cost reductions hold while preserving unbiased estimators and valid confidence intervals, the work would meaningfully advance practical KG development by lowering the barrier to frequent accuracy assessments. The focus on both static and evolving scenarios, combined with experiments on real-world data, strengthens its potential impact; reproducible experimental setups would further enhance its value.

major comments (2)

[Abstract] Abstract and methods: the central claim of cost reduction without loss of evaluation quality rests on cluster sampling exploiting intra-cluster homogeneity in the annotation cost function; however, no formal characterization or sensitivity analysis of this property is provided, leaving open whether the savings generalize or are dataset-specific.
[Experiments] The assertion of 'strong statistical guarantee' requires explicit handling of variance inflation inherent to cluster sampling (relative to simple random sampling at fixed sample size); without reported comparisons of estimator variance or confidence-interval coverage against baselines, it is unclear whether quality is truly preserved.

minor comments (2)

Notation for sampling designs (e.g., cluster weights, stratification variables) should be introduced consistently with a dedicated notation table or section to aid reproducibility.
Clarify how error bars or variance estimates are computed in the reported results to allow direct verification of the 'no loss of quality' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and commit to revisions that strengthen the statistical foundations and empirical support of the claims.

read point-by-point responses

Referee: [Abstract] Abstract and methods: the central claim of cost reduction without loss of evaluation quality rests on cluster sampling exploiting intra-cluster homogeneity in the annotation cost function; however, no formal characterization or sensitivity analysis of this property is provided, leaving open whether the savings generalize or are dataset-specific.

Authors: The motivation for cluster sampling is indeed empirical, based on the observed intra-cluster homogeneity of annotation costs (detailed in Section 3 of the manuscript). While we do not provide a formal theoretical characterization (as the degree of homogeneity depends on KG structure and may not admit a simple closed form), the experiments demonstrate consistent savings across multiple real-world KGs. In revision we will add a sensitivity analysis that varies cluster definition, intra-cluster cost correlation, and homogeneity levels to better assess generalizability. revision: partial
Referee: [Experiments] The assertion of 'strong statistical guarantee' requires explicit handling of variance inflation inherent to cluster sampling (relative to simple random sampling at fixed sample size); without reported comparisons of estimator variance or confidence-interval coverage against baselines, it is unclear whether quality is truly preserved.

Authors: Cluster sampling does introduce a design effect that can inflate variance relative to simple random sampling at the same sample size. Our estimators remain unbiased under the sampling design, and the reported confidence intervals use the corresponding variance estimators for the chosen design. However, the manuscript does not include explicit side-by-side comparisons of estimator variance or empirical coverage rates versus baselines. We will add these comparisons (including variance ratios and Monte-Carlo coverage simulations) to the experimental section in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical validation independent of inputs

full rationale

The paper proposes a sampling framework (cluster, weighted, two-stage, stratified) motivated by observed annotation cost properties and extends it to evolving KGs via stratified and reservoir sampling variants. Cost reductions (60%/80%) and quality preservation are reported from experiments on real-world datasets, not from any closed-form derivation or parameter fit that reduces to the input data by construction. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or described chain. Standard sampling theory supplies the unbiasedness/variance guarantees; the cost savings are an empirical outcome under the stated preconditions rather than a tautology. This is the normal non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard sampling theory plus the domain assumption that annotation costs vary in ways that reward cluster sampling; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Annotation cost function has properties observed in practice that justify cluster sampling for cost reduction
Explicitly stated as motivation for the sampling design.

pith-pipeline@v0.9.0 · 5735 in / 1193 out tokens · 23303 ms · 2026-05-24T17:20:46.027283+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Cost(G′) = |E′| · c1 + |G′| · c2 where E′ is the set of distinct ids

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

Efficient Knowledge Graph Accuracy Evaluation

INTRODUCTION Over the past few years, we have seen an increasing number of large-scale KGs with millions of relational facts in the format of RDF triples (subject,predicate,object). Examples include DBPe- dia [1], Y AGO [15, 5], NELL [25], Knowledge-Vault [13], etc. However, the KG construction processes are far from perfect, so these KGs may contain many...

work page internal anchor Pith review Pith/arXiv arXiv 2008
[2]

Formally, G = {t | t : (s, p,o)}

PRELIMINARIES 2.1 Knowledge Graphs We model knowledge graph G as a set of triples in the form of (subject, predicate, object), denoted by (s, p,o). Formally, G = {t | t : (s, p,o)}. For example, in tuple (/m/02mjmr, /people/person- /place of birth, /m/02hrh0 ), /m/02mjmr is the Freebase id for Barack Obama, and /m/02hrh0 is the id for Honolulu. Each entit...

work page
[3]

triple level

EV ALUATION COST MODEL Prior research typically ignores the evaluation time needed by manual annotations. In this section, we study human annotators’ performance on different evaluation tasks and propose a cost func- tion that approximates the manual annotation time. Analytically and empirically, we argue that annotating triples in groups of enti- ties is...

work page
[4]

There are two evaluation procedures

EV ALUATION FRAMEWORK In a nutshell, our evaluation framework is shown in Fig 2. There are two evaluation procedures. Static Evaluation conducts efﬁcient and high-quality accuracy evaluation on static KGs. It works as the following iterative proce- dure, sharing the same spirit as in Online Aggregation [20]: • Step 1: Sample Collector selects a small batc...

work page
[5]

There are two sampling strategies to evaluate the quality of KGs: Table 2: Notations

SAMPLING DESIGN The sampling design is the core component of the framework. There are two sampling strategies to evaluate the quality of KGs: Table 2: Notations. N Number of entity clusters in G Mi Size of the ith entity cluster M = ∑N i=1 Mi Total number of triples in G n Number of entity clusters in the sample m Maximum number of triples to draw within ...

work page
[6]

In the ﬁrst stage, we sample entity clusters using WCS

work page
[7]

More speciﬁcally, min {MIk ,m} triples are randomly drawn from the k-th sample cluster without replacement

In the second stage, only a small number of triples are se- lected randomly from clusters sampled in the ﬁrst stage. More speciﬁcally, min {MIk ,m} triples are randomly drawn from the k-th sample cluster without replacement. Drawing without replacement in the second stage greatly reduces sampling variances when cluster sizes are comparable or smaller than...

work page
[8]

EV ALUATION ON EVOLVING KG We discuss two methods to reduce the annotation cost in esti- mating µ(G + ∆) as a KG evolves from G to G + ∆, one based on reservoir sampling and the other based on stratiﬁed sampling. 6.1 Reservoir Incremental Evaluation Reservoir Incremental Evaluation (RS) is based on the reservoir sampling [28] scheme, which stochastically ...

work page
[9]

Section 7.1 elaborates the experiment setup

EXPERIMENTS In this section, we comprehensively and quantitatively evaluate the performance of all proposed methods. Section 7.1 elaborates the experiment setup. Section 7.2 focuses on the accuracy evalua- tion on static KGs. We compare the evaluation efﬁciency and esti- mation quality of various methods on different static KGs with dif- ferent data chara...

work page
[10]

Linked Data

RELATED WORK As discussed earlier, SRS is a simple but prevalent method for KG accuracy evaluation. Beyond SRS, Ojha et al. [26] were the ﬁrst to systematically tackle the problem of efﬁcient accuracy eval- uation of large-scale KGs. One key observation is that, by explor- ing dependencies (i.e., type consistency, Horn-clause coupling con- straints [7, 25...

work page
[11]

We presented a general evaluation framework that works on both static and evolving KGs

CONCLUSION In this paper, we have initiated a comprehensive study into the important problem of efﬁcient and reliable knowledge graph ac- curacy evaluation. We presented a general evaluation framework that works on both static and evolving KGs. We devised a suite of sampling techniques for efﬁcient accuracy evaluation on these two scenarios. As demonstrat...

work page
[12]

https://wiki.dbpedia.org

Dbpedia. https://wiki.dbpedia.org

work page
[13]

https://www.imdb.com

Imdb. https://www.imdb.com

work page
[14]

http://rtw.ml.cmu.edu/rtw/resources

Nell. http://rtw.ml.cmu.edu/rtw/resources

work page
[15]

https://www.wikidata.org

Wikidata. https://www.wikidata.org

work page
[16]

https://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ yago-naga

Yago2. https://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ yago-naga

work page
[17]

Acosta, A

M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, F. Fl ¨ock, and J. Lehmann. Detecting linked data quality issues via crowdsourcing: A dbpedia study. Semantic Web, (Preprint):1–33, 2016

work page 2016
[18]

Bragg, D

J. Bragg, D. S. Weld, et al. Crowdsourcing multi-label classiﬁcation for taxonomy creation. In First AAAI conference on human computation and crowdsourcing, 2013

work page 2013
[19]

Probabilistic Similarity Logic

M. Brocheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. arXiv preprint arXiv:1203.3469, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[20]

Casella and R

G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Paciﬁc Grove, CA, 2002

work page 2002
[21]

Christen and K

P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In Quality measures in data mining, pages 127–151. Springer, 2007

work page 2007
[22]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, pages 2201–2206. ACM, 2016

work page 2016
[23]

Dalenius and J

T. Dalenius and J. L. Hodges Jr. Minimum variance stratiﬁcation. Journal of the American Statistical Association, 54(285):88–101, 1959

work page 1959
[24]

X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610. ACM, 2014

work page 2014
[25]

P. S. Efraimidis and P. G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006

work page 2006
[26]

Fabian, K

M. Fabian, K. Gjergji, W. Gerhard, et al. Yago: A core of semantic knowledge unifying wordnet and wikipedia. In 16th International World Wide Web Conference, WWW, pages 697–706, 2007

work page 2007
[27]

J. Gao, X. Li, Y . E. Xu, B. Sisman, X. L. Dong, and J. Yang. Efﬁcient knowledge graph accuracy evaluation. Technical report, Duke University, 2019. https://users.cs.duke. edu/~jygao/KG_eval_vldb_full.pdf

work page 2019
[28]

Gerber, D

D. Gerber, D. Esteves, J. Lehmann, L. B ¨uhmann, R. Usbeck, A.-C. N. Ngomo, and R. Speck. Defactotemporal and multilingual deep fact validation. Web Semantics: Science, Services and Agents on the World Wide Web, 35:85–101, 2015

work page 2015
[29]

M. H. Hansen and W. N. Hurwitz. On the theory of sampling from ﬁnite populations. The Annals of Mathematical Statistics, 14(4):333–362, 1943

work page 1943
[30]

Heath and C

T. Heath and C. Bizer. Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1):1–136, 2011

work page 2011
[31]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Acm Sigmod Record, volume 26, pages 171–182. ACM, 1997

work page 1997
[32]

N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. Association for Computational Linguistics, 2011

work page 2011
[33]

H. Li, Y . Li, F. Xu, and X. Zhong. Probabilistic error detecting in numerical linked data. In International Conference on Database and Expert Systems Applications, pages 61–75. Springer, 2015

work page 2015
[34]

S. Liu, M. dAquin, and E. Motta. Measuring accuracy of triples in knowledge graphs. In International Conference on Language, Data and Knowledge, pages 343–357. Springer, 2017

work page 2017
[35]

N. G. Marchant and B. I. Rubinstein. In search of an entity resolution oasis: optimal asymptotic sequential importance sampling. PVLDB, 10(11):1322–1333, 2017

work page 2017
[36]

Mitchell, W

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al. Never-ending learning. Communications of the ACM, 61(5):103–115, 2018

work page 2018
[37]

Ojha and P

P. Ojha and P. Talukdar. Kgeval: Accuracy estimation of automatically constructed knowledge graphs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1741–1750, 2017

work page 2017
[38]

Ortona, V

S. Ortona, V . V . Meduri, and P. Papotti. Robust discovery of positive and negative rules in knowledge bases. In2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1168–1179. IEEE, 2018

work page 2018
[39]

J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985

work page 1985
[40]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 469–480. ACM, 2014

work page 2014

[1] [1]

Efficient Knowledge Graph Accuracy Evaluation

INTRODUCTION Over the past few years, we have seen an increasing number of large-scale KGs with millions of relational facts in the format of RDF triples (subject,predicate,object). Examples include DBPe- dia [1], Y AGO [15, 5], NELL [25], Knowledge-Vault [13], etc. However, the KG construction processes are far from perfect, so these KGs may contain many...

work page internal anchor Pith review Pith/arXiv arXiv 2008

[2] [2]

Formally, G = {t | t : (s, p,o)}

PRELIMINARIES 2.1 Knowledge Graphs We model knowledge graph G as a set of triples in the form of (subject, predicate, object), denoted by (s, p,o). Formally, G = {t | t : (s, p,o)}. For example, in tuple (/m/02mjmr, /people/person- /place of birth, /m/02hrh0 ), /m/02mjmr is the Freebase id for Barack Obama, and /m/02hrh0 is the id for Honolulu. Each entit...

work page

[3] [3]

triple level

EV ALUATION COST MODEL Prior research typically ignores the evaluation time needed by manual annotations. In this section, we study human annotators’ performance on different evaluation tasks and propose a cost func- tion that approximates the manual annotation time. Analytically and empirically, we argue that annotating triples in groups of enti- ties is...

work page

[4] [4]

There are two evaluation procedures

EV ALUATION FRAMEWORK In a nutshell, our evaluation framework is shown in Fig 2. There are two evaluation procedures. Static Evaluation conducts efﬁcient and high-quality accuracy evaluation on static KGs. It works as the following iterative proce- dure, sharing the same spirit as in Online Aggregation [20]: • Step 1: Sample Collector selects a small batc...

work page

[5] [5]

There are two sampling strategies to evaluate the quality of KGs: Table 2: Notations

SAMPLING DESIGN The sampling design is the core component of the framework. There are two sampling strategies to evaluate the quality of KGs: Table 2: Notations. N Number of entity clusters in G Mi Size of the ith entity cluster M = ∑N i=1 Mi Total number of triples in G n Number of entity clusters in the sample m Maximum number of triples to draw within ...

work page

[6] [6]

In the ﬁrst stage, we sample entity clusters using WCS

work page

[7] [7]

More speciﬁcally, min {MIk ,m} triples are randomly drawn from the k-th sample cluster without replacement

In the second stage, only a small number of triples are se- lected randomly from clusters sampled in the ﬁrst stage. More speciﬁcally, min {MIk ,m} triples are randomly drawn from the k-th sample cluster without replacement. Drawing without replacement in the second stage greatly reduces sampling variances when cluster sizes are comparable or smaller than...

work page

[8] [8]

EV ALUATION ON EVOLVING KG We discuss two methods to reduce the annotation cost in esti- mating µ(G + ∆) as a KG evolves from G to G + ∆, one based on reservoir sampling and the other based on stratiﬁed sampling. 6.1 Reservoir Incremental Evaluation Reservoir Incremental Evaluation (RS) is based on the reservoir sampling [28] scheme, which stochastically ...

work page

[9] [9]

Section 7.1 elaborates the experiment setup

EXPERIMENTS In this section, we comprehensively and quantitatively evaluate the performance of all proposed methods. Section 7.1 elaborates the experiment setup. Section 7.2 focuses on the accuracy evalua- tion on static KGs. We compare the evaluation efﬁciency and esti- mation quality of various methods on different static KGs with dif- ferent data chara...

work page

[10] [10]

Linked Data

RELATED WORK As discussed earlier, SRS is a simple but prevalent method for KG accuracy evaluation. Beyond SRS, Ojha et al. [26] were the ﬁrst to systematically tackle the problem of efﬁcient accuracy eval- uation of large-scale KGs. One key observation is that, by explor- ing dependencies (i.e., type consistency, Horn-clause coupling con- straints [7, 25...

work page

[11] [11]

We presented a general evaluation framework that works on both static and evolving KGs

CONCLUSION In this paper, we have initiated a comprehensive study into the important problem of efﬁcient and reliable knowledge graph ac- curacy evaluation. We presented a general evaluation framework that works on both static and evolving KGs. We devised a suite of sampling techniques for efﬁcient accuracy evaluation on these two scenarios. As demonstrat...

work page

[12] [12]

https://wiki.dbpedia.org

Dbpedia. https://wiki.dbpedia.org

work page

[13] [13]

https://www.imdb.com

Imdb. https://www.imdb.com

work page

[14] [14]

http://rtw.ml.cmu.edu/rtw/resources

Nell. http://rtw.ml.cmu.edu/rtw/resources

work page

[15] [15]

https://www.wikidata.org

Wikidata. https://www.wikidata.org

work page

[16] [16]

https://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ yago-naga

Yago2. https://www.mpi-inf.mpg.de/departments/ databases-and-information-systems/research/ yago-naga

work page

[17] [17]

Acosta, A

M. Acosta, A. Zaveri, E. Simperl, D. Kontokostas, F. Fl ¨ock, and J. Lehmann. Detecting linked data quality issues via crowdsourcing: A dbpedia study. Semantic Web, (Preprint):1–33, 2016

work page 2016

[18] [18]

Bragg, D

J. Bragg, D. S. Weld, et al. Crowdsourcing multi-label classiﬁcation for taxonomy creation. In First AAAI conference on human computation and crowdsourcing, 2013

work page 2013

[19] [19]

Probabilistic Similarity Logic

M. Brocheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. arXiv preprint arXiv:1203.3469, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[20] [20]

Casella and R

G. Casella and R. L. Berger. Statistical inference, volume 2. Duxbury Paciﬁc Grove, CA, 2002

work page 2002

[21] [21]

Christen and K

P. Christen and K. Goiser. Quality and complexity measures for data linkage and deduplication. In Quality measures in data mining, pages 127–151. Springer, 2007

work page 2007

[22] [22]

X. Chu, I. F. Ilyas, S. Krishnan, and J. Wang. Data cleaning: Overview and emerging challenges. In Proceedings of the 2016 International Conference on Management of Data, pages 2201–2206. ACM, 2016

work page 2016

[23] [23]

Dalenius and J

T. Dalenius and J. L. Hodges Jr. Minimum variance stratiﬁcation. Journal of the American Statistical Association, 54(285):88–101, 1959

work page 1959

[24] [24]

X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–610. ACM, 2014

work page 2014

[25] [25]

P. S. Efraimidis and P. G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006

work page 2006

[26] [26]

Fabian, K

M. Fabian, K. Gjergji, W. Gerhard, et al. Yago: A core of semantic knowledge unifying wordnet and wikipedia. In 16th International World Wide Web Conference, WWW, pages 697–706, 2007

work page 2007

[27] [27]

J. Gao, X. Li, Y . E. Xu, B. Sisman, X. L. Dong, and J. Yang. Efﬁcient knowledge graph accuracy evaluation. Technical report, Duke University, 2019. https://users.cs.duke. edu/~jygao/KG_eval_vldb_full.pdf

work page 2019

[28] [28]

Gerber, D

D. Gerber, D. Esteves, J. Lehmann, L. B ¨uhmann, R. Usbeck, A.-C. N. Ngomo, and R. Speck. Defactotemporal and multilingual deep fact validation. Web Semantics: Science, Services and Agents on the World Wide Web, 35:85–101, 2015

work page 2015

[29] [29]

M. H. Hansen and W. N. Hurwitz. On the theory of sampling from ﬁnite populations. The Annals of Mathematical Statistics, 14(4):333–362, 1943

work page 1943

[30] [30]

Heath and C

T. Heath and C. Bizer. Linked data: Evolving the web into a global data space. Synthesis lectures on the semantic web: theory and technology, 1(1):1–136, 2011

work page 2011

[31] [31]

J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In Acm Sigmod Record, volume 26, pages 171–182. ACM, 1997

work page 1997

[32] [32]

N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 529–539. Association for Computational Linguistics, 2011

work page 2011

[33] [33]

H. Li, Y . Li, F. Xu, and X. Zhong. Probabilistic error detecting in numerical linked data. In International Conference on Database and Expert Systems Applications, pages 61–75. Springer, 2015

work page 2015

[34] [34]

S. Liu, M. dAquin, and E. Motta. Measuring accuracy of triples in knowledge graphs. In International Conference on Language, Data and Knowledge, pages 343–357. Springer, 2017

work page 2017

[35] [35]

N. G. Marchant and B. I. Rubinstein. In search of an entity resolution oasis: optimal asymptotic sequential importance sampling. PVLDB, 10(11):1322–1333, 2017

work page 2017

[36] [36]

Mitchell, W

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, et al. Never-ending learning. Communications of the ACM, 61(5):103–115, 2018

work page 2018

[37] [37]

Ojha and P

P. Ojha and P. Talukdar. Kgeval: Accuracy estimation of automatically constructed knowledge graphs. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1741–1750, 2017

work page 2017

[38] [38]

Ortona, V

S. Ortona, V . V . Meduri, and P. Papotti. Robust discovery of positive and negative rules in knowledge bases. In2018 IEEE 34th International Conference on Data Engineering (ICDE), pages 1168–1179. IEEE, 2018

work page 2018

[39] [39]

J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985

work page 1985

[40] [40]

J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 469–480. ACM, 2014

work page 2014