pith. sign in

arxiv: 2606.27742 · v1 · pith:UOYOVN5Tnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI

KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

Pith reviewed 2026-06-29 04:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords text-to-cypherknowledge graphsenterprise searchLoRA fine-tuningsupervised fine-tuningdata generationnatural language interfacesCypher queries
0
0 comments X

The pith

KG2Cypher pipeline generates validated Text-Cypher pairs from knowledge graphs to train accurate text-to-Cypher models for enterprise use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline that starts with existing enterprise knowledge graphs to create training data for text-to-Cypher conversion. It builds Cypher queries from observed facts, has LLMs generate corresponding natural language questions, validates them with an LLM judge and humans, then uses the pairs for supervised fine-tuning with LoRA. This approach improves execution accuracy on Korean enterprise queries, reaching over 95% exact match in multi-class settings. The method addresses the high cost of building natural language interfaces for private graphs by leveraging data-centric generation rather than manual annotation.

Core claim

KG2Cypher constructs an executable Cypher query from observed graph facts and uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference, achieving 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1 in an 11-class setting.

What carries the argument

The data-centric pipeline that generates Text-Cypher pairs by deriving Cypher from graph facts and reverse-generating questions with LLMs, followed by validation and LoRA SFT training.

If this is right

  • LoRA SFT raises execution-result F1 from 0.806 to 0.950 on broadcast-program queries.
  • Execution-result F1 improves from 0.70 to 0.92 on company queries.
  • In 11-class setting, achieves 95.2% exact match and 99.9% execution rate.
  • The system handles short search-style queries and schema paraphrases in Korean enterprise settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could lower the barrier for companies to deploy natural language query interfaces on their internal graphs.
  • The validation step might be adaptable to other structured query languages like SQL.
  • Scaling the pipeline could enable zero-shot or few-shot adaptations to new graph schemas.

Load-bearing premise

The LLM judge combined with human validation produces sufficiently unbiased Text-Cypher pairs that training improves real execution performance rather than fitting to validation artifacts.

What would settle it

Evaluating the trained model on a new enterprise graph with unseen query classes or different schema structures and measuring if execution-result F1 drops significantly below 0.9.

Figures

Figures reproduced from arXiv: 2606.27742 by Hyemin Lee, Junghyuk Seo, Minjun Choi, Sujin Mo, Yerin Kim, Youngjoong Ko.

Figure 1
Figure 1. Figure 1: Overview of KG2Cypher. Left: KG-grounded data construction from graph facts to validated Text [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLM judge calibration. Human comments guide automatic scoring-guide revision, and faithfulness MAE falls below the 0.27 target. Lower MAE is better. 4.2 LLM Judge Calibration Before using the LLM judge to validate generated Text-Cypher pairs, we calibrate its scoring prompt on 200 human-annotated samples. The judge scores faithfulness, fluency, and completeness, and cali￾bration checks agreement with human… view at source ↗
read the original abstract

Enterprise Knowledge Graphs (KGs) are increasingly used for internal search, analytics, and question answering, but building natural-language interfaces for private enterprise graphs remains costly. We present KG2Cypher, a data-centric pipeline for building enterprise text-to-Cypher systems from existing KGs. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference. We evaluate KG2Cypher in Korean enterprise settings, where short search-style queries and schema paraphrases make language grounding difficult. LoRA SFT improves execution-result F1 from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. In an 11-class setting, KG2Cypher achieves 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents KG2Cypher, a data-centric pipeline that constructs executable Cypher queries from observed enterprise KG facts, generates corresponding natural-language questions via LLMs, validates the resulting Text-Cypher pairs with an LLM judge plus human review, and uses the filtered pairs for class-conditioned LoRA SFT. The trained model is deployed with schema prompting and entity retrieval. On Korean enterprise broadcast-program and company queries, LoRA SFT raises execution-result F1 from 0.806 to 0.950 and from 0.70 to 0.92 respectively; in an 11-class setting the system reports 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.

Significance. If the reported gains prove robust to the validation process, the work offers a practical, low-annotation route to text-to-Cypher systems for private KGs, directly addressing the cost barrier noted in the abstract. The emphasis on execution-result metrics rather than surface match and the concrete before-and-after numbers constitute a strength; the absence of error bars or ablations, however, leaves the magnitude of improvement difficult to interpret.

major comments (3)
  1. [Pipeline description / validation step] The validation subsection (described in the pipeline overview) provides no prompt template, decision threshold, or inter-annotator agreement statistics for the LLM judge. Because the central claim is that LoRA SFT on judge-filtered pairs produces the observed F1 lifts (0.806→0.950 and 0.70→0.92), the lack of these quantities makes it impossible to rule out that the model is fitting to judge-specific artifacts rather than improving Cypher generation.
  2. [Experiments / results tables] Experiments section reports aggregate F1 and exact-match figures but contains no ablation that isolates the contribution of the LLM-judge filter versus the human-validation step, nor any error bars or statistical significance tests on the before-and-after deltas. These omissions are load-bearing for the claim that the pipeline reliably improves execution accuracy.
  3. [Evaluation setup] No analysis is given of whether the test queries share the same synthetic generation process as the training pairs; if they do, the high execution rate (99.9%) and F1 (0.964) could be explained by distributional overlap rather than generalization.
minor comments (2)
  1. [Abstract and §4] The abstract and results tables use “execution-result F1” without an explicit definition or reference to the precise matching criterion (e.g., whether partial result overlap is credited).
  2. [Figures 2–3 and Table 2] Figure captions and table footnotes do not indicate the number of human validators or the exact protocol used for the final human-validation pass.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation details, experimental reporting, and evaluation setup. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Pipeline description / validation step] The validation subsection (described in the pipeline overview) provides no prompt template, decision threshold, or inter-annotator agreement statistics for the LLM judge. Because the central claim is that LoRA SFT on judge-filtered pairs produces the observed F1 lifts (0.806→0.950 and 0.70→0.92), the lack of these quantities makes it impossible to rule out that the model is fitting to judge-specific artifacts rather than improving Cypher generation.

    Authors: We will add the LLM judge prompt template and decision threshold to the revised manuscript (in the validation subsection or an appendix) to support reproducibility. The training data additionally underwent a human validation step after the LLM judge, which we will emphasize as a safeguard against potential judge-specific artifacts. We did not compute inter-annotator agreement for the LLM judge and will acknowledge this as a limitation. revision: yes

  2. Referee: [Experiments / results tables] Experiments section reports aggregate F1 and exact-match figures but contains no ablation that isolates the contribution of the LLM-judge filter versus the human-validation step, nor any error bars or statistical significance tests on the before-and-after deltas. These omissions are load-bearing for the claim that the pipeline reliably improves execution accuracy.

    Authors: We will add error bars by reporting means and standard deviations from multiple training runs with different random seeds. A dedicated ablation isolating the LLM-judge filter from the subsequent human-validation step is not feasible within the revision timeline due to computational constraints; we will explicitly note this limitation and the role of the combined validation process in the revised text. revision: partial

  3. Referee: [Evaluation setup] No analysis is given of whether the test queries share the same synthetic generation process as the training pairs; if they do, the high execution rate (99.9%) and F1 (0.964) could be explained by distributional overlap rather than generalization.

    Authors: The test queries were collected from real enterprise user logs and are independent of the synthetic generation process used to create the training pairs. We will add an explicit statement and brief description of the test-query collection process in the evaluation setup section to clarify the absence of distributional overlap. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with held-out execution metrics shows no circularity

full rationale

The paper presents a data-centric pipeline for generating Text-Cypher pairs via synthesis, LLM question generation, LLM-judge filtering, and human validation, followed by LoRA SFT and evaluation. All reported metrics (execution-result F1, exact match, execution rate) are computed on held-out queries against actual Cypher execution results on the enterprise KG. No equations, derivations, or self-citations are invoked that reduce any claimed result to a fitted parameter or prior self-result by construction. The evaluation is externally grounded in execution outcomes rather than internal consistency with the generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on the assumption that LLMs can both generate and judge query-question pairs at scale without systematic bias; no free parameters or invented entities are introduced beyond standard fine-tuning.

axioms (1)
  • domain assumption LLMs can generate natural-language questions from Cypher queries and judge pair quality with sufficient reliability for downstream training
    Invoked in the data-construction and validation stages described in the abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 1274 out tokens · 46485 ms · 2026-06-29T04:44:05.942912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China

    Mind the query: A benchmark dataset towards Text2Cypher task. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China. Association for Computational Linguistics. Mohnish Dubey, Debayan Banerjee, Abdelrahman Ab- delkawi, and Jens Lehmann

  2. [2]

    InThe Semantic Web – ISWC 2019, pages 69–78

    Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. InThe Semantic Web – ISWC 2019, pages 69–78. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor

  3. [3]

    InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445

    Cypher: An evolving query language for property graphs. InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

  4. [4]

    In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772

    Beyond seen data: Improving KBQA generalization through schema-guided logical form generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaug...

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models. Preprint, arXiv:2407.21783. Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su

  6. [6]

    InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488

    Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488. ACM. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    Lora: Low-rank adaptation of large language models.Preprint, arXiv:2106.09685. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica

  8. [8]

    Text2Cypher: Bridging natural language and graph databases

    Text2cypher: Bridg- ing natural language and graph databases.Preprint, arXiv:2412.10064. Makbule Gulcin Ozsoy and William Tai

  9. [9]

    Text2cypher across languages: Evaluating and fine- tuning llms.arXiv preprint arXiv:2506.21445, arXiv:2506.21445. Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Jun- seong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, ...

  10. [10]

    CoRR , volume =

    Klue: Korean language understanding evaluation.Preprint, arXiv:2105.09680. Torsten Scholak, Nathan Schucher, and Dzmitry Bah- danau

  11. [11]

    InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Picard: Parsing incrementally for constrained auto-regressive decoding from language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, and Sathwik Tejaswi Madhusud- han

  12. [12]

    SynthCypher: A fully synthetic data generation framework for text-to-Cypher querying in knowledge graphs.arXiv preprint arXiv:2412.12612, 2024

    Auto-cypher: Improving llms on cypher generation via llm-supervised generation-verification framework.Preprint, arXiv:2412.12612. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson

  13. [13]

    InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph ...

  14. [14]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Judg- ing llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685. Victor Zhong, Caiming Xiong, and Richard Socher

  15. [15]

    In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    Seq2sql: Generating structured queries from natural language using reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. A LLM Judge Metric Validation We use synthetic failure simulations to check why the calibration gate uses MAE, adjacent agreement, and catch rate together. Each simulation con...