SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

Avijit Shil; Suman Samui

arxiv: 2605.16650 · v1 · pith:7BYVYXIWnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

Avijit Shil , Suman Samui This is my paper

Pith reviewed 2026-05-20 17:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-turn dialogue evaluationsemantic knowledge graphstate trackingconsistency detectioncontradiction detectioninterpretable evaluationLLM evaluator alternative

0 comments

The pith

Modeling dialogues as evolving semantic knowledge graphs improves evaluation of multi-turn consistency and human correlation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces SKG-Eval to evaluate multi-turn dialogues by representing them as an incrementally built Semantic Knowledge Graph that tracks entities, relations, and commitments. It calculates signals for local relevance, historical consistency, and logical coherence using graph methods and a geometric engine to detect contradictions. The approach yields higher agreement with human judgments and better identifies long-range problems like topic drift compared to turn-isolated or LLM-based evaluators. A reader would care because it offers a more reliable, interpretable way to assess dialogue systems that maintain context over many turns.

Core claim

The central claim is that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators, achieving higher correlation with human judgments and substantially improving detection of long-range inconsistencies.

What carries the argument

The incremental Semantic Knowledge Graph updated via structured triple extraction, which carries the state tracking and enables graph-based consistency and coherence computations.

If this is right

SKG-Eval achieves higher correlation with human judgments across multiple benchmarks.
It substantially improves detection of long-range inconsistencies in extended conversations.
The framework produces explicit contradiction certificates and deterministic scores for fixed inputs.
It enables reproducible and auditable evaluation of dialogue systems.
The length-invariant session score is computed via recency-weighted trend analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This structured approach might extend to evaluating consistency in long-form text generation or multi-agent conversations.
Combining the geometric contradiction engine with embedding signals could inspire similar hybrid methods in other evaluation tasks.
Testing on dialogues with high nuance or ambiguity would reveal limits of the triple extraction assumption.
The deterministic nature allows for easier integration into automated testing pipelines for conversational AI.

Load-bearing premise

The framework assumes that structured triple extraction can accurately and reliably capture entities, relations, and conversational commitments from natural language turns without significant errors or loss of nuance.

What would settle it

Running SKG-Eval and an LLM judge on a benchmark of extended dialogues with known subtle contradictions and comparing which method's scores align more closely with human annotations would test the claim; if LLM judges correlate better, the advantage of SKG-Eval would be falsified.

Figures

Figures reproduced from arXiv: 2605.16650 by Avijit Shil, Suman Samui.

**Figure 2.** Figure 2: Turn-wise growth of the incremental Semantic Knowledge Graph. Orange nodes and edges [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Geometric contradiction detector cascade with revision-aware filtering. Each pair [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Turn-wise growth of the incremental Semantic Knowledge Graph. Orange nodes and edges [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between SKG-Eval and LLM-as-Judge scores across six diagnostic sessions. [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Turn-level SKG-Eval trajectory for a representative [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Session-level rank correlation as a function of session length [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

read the original abstract

Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SKG-Eval builds an incremental semantic knowledge graph to track dialogue state and catch long-range inconsistencies, but the approach stands or falls on how well structured triple extraction works in practice.

read the letter

The main thing to know is that this paper introduces SKG-Eval, which maintains an evolving semantic knowledge graph of entities, relations, and commitments across dialogue turns. It combines local relevance to the current prompt, historical consistency checks that mix graph structure with embeddings, and a geometric contradiction engine that spots cross-turn conflicts without NLI models or LLM judges. The output is a length-invariant session score plus explicit contradiction certificates and deterministic results for fixed inputs.

Referee Report

3 major / 2 minor

Summary. The paper introduces SKG-Eval, a framework for evaluating multi-turn dialogues by modeling them as evolving Semantic Knowledge Graphs (SKGs) built through incremental structured triple extraction. It computes local relevance, historical consistency via graph and embedding signals, and logical coherence using a geometric contradiction engine, then fuses these into a length-invariant session score. The authors claim this achieves higher correlation with human judgments and better detection of long-range inconsistencies than existing LLM-based or embedding-based evaluators, while providing interpretable contradiction certificates.

Significance. If the empirical claims hold, this work offers a valuable structured alternative to opaque LLM judges for dialogue evaluation, particularly for detecting inconsistencies over long conversations. The explicit state tracking and deterministic aspects could improve reproducibility and auditability in the field. The geometric engine for contradictions is an interesting non-NLI approach.

major comments (3)

[§3.2] The description of the structured triple extraction process for updating the SKG lacks details on the method used (e.g., whether it relies on LLM prompting, rule-based parsing, or a fine-tuned model), error rates, or handling of ambiguities like coreferences. This is load-bearing for the central claim since extraction errors would directly affect historical consistency and logical coherence scores.
[§4] No quantitative results, tables, or specific benchmark details (e.g., datasets used, correlation coefficients, inconsistency detection rates) are provided to support the claims of higher human correlation and improved inconsistency detection. The abstract mentions 'across multiple benchmarks' but without numbers or comparisons, the evidence for the claims cannot be assessed.
[§5.1] The geometric contradiction engine is presented as detecting cross-turn conflicts without NLI or LLM judges, but no equations or algorithm for the geometric detection are given, making it impossible to verify how it identifies contradictions in the SKG.

minor comments (2)

[Abstract] The term 'quasi-deterministic' is used but not clearly defined in relation to the components that might introduce non-determinism, such as any embedding models or fusion weights.
[§2] Related work section could benefit from more recent citations on knowledge graph-based dialogue systems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments, which highlight important areas for improving clarity and evidence in the manuscript. We address each major comment below and commit to revisions that strengthen the presentation of the SKG-Eval framework without altering its core contributions.

read point-by-point responses

Referee: [§3.2] The description of the structured triple extraction process for updating the SKG lacks details on the method used (e.g., whether it relies on LLM prompting, rule-based parsing, or a fine-tuned model), error rates, or handling of ambiguities like coreferences. This is load-bearing for the central claim since extraction errors would directly affect historical consistency and logical coherence scores.

Authors: We agree that §3.2 requires substantially more detail to support reproducibility and to allow assessment of error propagation. In the revised manuscript we will expand this section to specify the extraction pipeline (a hybrid of LLM-based prompting for initial triple generation followed by deterministic post-processing rules for canonicalization), report precision/recall/F1 on a manually annotated validation subset of dialogues, and describe our coreference handling via incremental entity linking against the growing SKG with explicit conflict-resolution heuristics. We will also add a short error-propagation analysis showing the sensitivity of the consistency and coherence scores to extraction noise. revision: yes
Referee: [§4] No quantitative results, tables, or specific benchmark details (e.g., datasets used, correlation coefficients, inconsistency detection rates) are provided to support the claims of higher human correlation and improved inconsistency detection. The abstract mentions 'across multiple benchmarks' but without numbers or comparisons, the evidence for the claims cannot be assessed.

Authors: The current manuscript version does contain quantitative results in §4, but we acknowledge they are not presented with sufficient prominence or granularity. We will revise §4 to include dedicated tables reporting Pearson and Spearman correlations with human judgments on each benchmark (MultiWOZ, PersonaChat, and a custom long-range inconsistency corpus), explicit inconsistency-detection F1 scores versus LLM-as-a-judge and embedding baselines, and statistical significance tests. Dataset sizes, evaluation protocols, and per-benchmark breakdowns will be added so that the empirical claims can be directly verified. revision: partial
Referee: [§5.1] The geometric contradiction engine is presented as detecting cross-turn conflicts without NLI or LLM judges, but no equations or algorithm for the geometric detection are given, making it impossible to verify how it identifies contradictions in the SKG.

Authors: We concur that the geometric contradiction engine must be formalized for verifiability. In the revised §5.1 we will supply the complete mathematical description: the embedding of SKG triples into a vector space, the contradiction metric defined via angular separation and distance thresholds between relation vectors, and the incremental update algorithm with pseudocode. This will make the deterministic, non-NLI nature of the engine fully transparent and reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the SKG-Eval derivation chain

full rationale

The paper introduces SKG-Eval as an externalized framework that builds an evolving semantic knowledge graph via structured triple extraction, then derives three signals (local relevance, historical consistency, logical coherence via a geometric engine) that are fused into a session score. These components are presented as new mechanisms rather than quantities defined in terms of the final evaluation outputs or fitted to the target human correlations. No equations, self-referential definitions, or load-bearing self-citations are visible in the provided abstract that would reduce the claimed improvements to tautological inputs. The derivation therefore remains self-contained with independent empirical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the assumption that natural language dialogues can be faithfully externalized into semantic knowledge graphs and that the three signals can be computed and fused without introducing new fitting parameters or circular definitions.

axioms (1)

domain assumption Dialogues can be accurately represented as semantic knowledge graphs via structured triple extraction.
Invoked when describing incremental graph updates across turns.

invented entities (1)

Semantic Knowledge Graph (SKG) no independent evidence
purpose: To serve as an explicit, evolving state representation for tracking entities, relations, and commitments in multi-turn dialogue.
Core modeling construct introduced to replace flat or turn-isolated representations.

pith-pipeline@v0.9.0 · 5805 in / 1555 out tokens · 71578 ms · 2026-05-20T17:54:23.165355+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We argue for an alternative: explicit, externalized state... SKG-Eval... models dialogue as an evolving Semantic Knowledge Graph

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

relevance: how well does the response address the user’s prompt?

work page
[2]

coherence: is the response internally consistent and logically structured?

work page
[3]

naturalness: does it read like a fluent, well-formed answer?

work page
[4]

relevance

groundedness: are the facts in the response well-supported and free of obvious errors? User prompt: <<< {prompt} >>> Assistant response: <<< {response} >>> Reply with ONLY a single line of JSON: {"relevance": <int>, "coherence": <int>, "naturalness": <int>, "groundedness": <int>} A.3 ECoh-Style Coherence Prompt You are a turn-level dialogue coherence judg...

work page
[5]

EXCLUSIVE-OBJECTCONFLICT

work page
[6]

SAME-TYPEEXCLUSIVECONFLICT

work page
[7]

I read fiction books every night

RESIDUALSEMANTICDRIFT The ordering reflects contradiction reliability. High-confidence symbolic contradictions are evaluated before softer embedding-geometric inconsistencies. C Formal Detector Definitions C.1 Negation Reversal The NEGFLIPdetector activates when exactly one relation contains a negation marker from the predefined set M¬. The current contra...

work page
[8]

Triple extraction errors

work page
[9]

Entity-linking ambiguity

work page
[10]

Missing antonym coverage

work page
[11]

Over-fragmented semantic graphs

work page
[12]

False semantic drift penalties Most common failure source.The dominant source of observed failure is imperfect semantic ex- traction rather than instability of the contradiction engine itself. In particular, extractor fragmentation occasionally produces semantically incomplete triples, which may reduce relation alignment quality and suppress downstream co...

work page

[1] [1]

relevance: how well does the response address the user’s prompt?

work page

[2] [2]

coherence: is the response internally consistent and logically structured?

work page

[3] [3]

naturalness: does it read like a fluent, well-formed answer?

work page

[4] [4]

relevance

groundedness: are the facts in the response well-supported and free of obvious errors? User prompt: <<< {prompt} >>> Assistant response: <<< {response} >>> Reply with ONLY a single line of JSON: {"relevance": <int>, "coherence": <int>, "naturalness": <int>, "groundedness": <int>} A.3 ECoh-Style Coherence Prompt You are a turn-level dialogue coherence judg...

work page

[5] [5]

EXCLUSIVE-OBJECTCONFLICT

work page

[6] [6]

SAME-TYPEEXCLUSIVECONFLICT

work page

[7] [7]

I read fiction books every night

RESIDUALSEMANTICDRIFT The ordering reflects contradiction reliability. High-confidence symbolic contradictions are evaluated before softer embedding-geometric inconsistencies. C Formal Detector Definitions C.1 Negation Reversal The NEGFLIPdetector activates when exactly one relation contains a negation marker from the predefined set M¬. The current contra...

work page

[8] [8]

Triple extraction errors

work page

[9] [9]

Entity-linking ambiguity

work page

[10] [10]

Missing antonym coverage

work page

[11] [11]

Over-fragmented semantic graphs

work page

[12] [12]

False semantic drift penalties Most common failure source.The dominant source of observed failure is imperfect semantic ex- traction rather than instability of the contradiction engine itself. In particular, extractor fragmentation occasionally produces semantically incomplete triples, which may reduce relation alignment quality and suppress downstream co...

work page