KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

Hyemin Lee; Junghyuk Seo; Minjun Choi; Sujin Mo; Yerin Kim; Youngjoong Ko

arxiv: 2606.27742 · v1 · pith:UOYOVN5Tnew · submitted 2026-06-26 · 💻 cs.CL · cs.AI

KG2Cypher: Data-Centric Pipeline for Building Enterprise Text-to-Cypher Systems

Minjun Choi , Yerin Kim , Junghyuk Seo , Sujin Mo , Hyemin Lee , Youngjoong Ko This is my paper

Pith reviewed 2026-06-29 04:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords text-to-cypherknowledge graphsenterprise searchLoRA fine-tuningsupervised fine-tuningdata generationnatural language interfacesCypher queries

0 comments

The pith

KG2Cypher pipeline generates validated Text-Cypher pairs from knowledge graphs to train accurate text-to-Cypher models for enterprise use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a pipeline that starts with existing enterprise knowledge graphs to create training data for text-to-Cypher conversion. It builds Cypher queries from observed facts, has LLMs generate corresponding natural language questions, validates them with an LLM judge and humans, then uses the pairs for supervised fine-tuning with LoRA. This approach improves execution accuracy on Korean enterprise queries, reaching over 95% exact match in multi-class settings. The method addresses the high cost of building natural language interfaces for private graphs by leveraging data-centric generation rather than manual annotation.

Core claim

KG2Cypher constructs an executable Cypher query from observed graph facts and uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference, achieving 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1 in an 11-class setting.

What carries the argument

The data-centric pipeline that generates Text-Cypher pairs by deriving Cypher from graph facts and reverse-generating questions with LLMs, followed by validation and LoRA SFT training.

If this is right

LoRA SFT raises execution-result F1 from 0.806 to 0.950 on broadcast-program queries.
Execution-result F1 improves from 0.70 to 0.92 on company queries.
In 11-class setting, achieves 95.2% exact match and 99.9% execution rate.
The system handles short search-style queries and schema paraphrases in Korean enterprise settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could lower the barrier for companies to deploy natural language query interfaces on their internal graphs.
The validation step might be adaptable to other structured query languages like SQL.
Scaling the pipeline could enable zero-shot or few-shot adaptations to new graph schemas.

Load-bearing premise

The LLM judge combined with human validation produces sufficiently unbiased Text-Cypher pairs that training improves real execution performance rather than fitting to validation artifacts.

What would settle it

Evaluating the trained model on a new enterprise graph with unseen query classes or different schema structures and measuring if execution-result F1 drops significantly below 0.9.

Figures

Figures reproduced from arXiv: 2606.27742 by Hyemin Lee, Junghyuk Seo, Minjun Choi, Sujin Mo, Yerin Kim, Youngjoong Ko.

**Figure 2.** Figure 2: LLM judge calibration. Human comments guide automatic scoring-guide revision, and faithfulness MAE falls below the 0.27 target. Lower MAE is better. 4.2 LLM Judge Calibration Before using the LLM judge to validate generated Text-Cypher pairs, we calibrate its scoring prompt on 200 human-annotated samples. The judge scores faithfulness, fluency, and completeness, and calibration checks agreement with human… view at source ↗

read the original abstract

Enterprise Knowledge Graphs (KGs) are increasingly used for internal search, analytics, and question answering, but building natural-language interfaces for private enterprise graphs remains costly. We present KG2Cypher, a data-centric pipeline for building enterprise text-to-Cypher systems from existing KGs. KG2Cypher first constructs an executable Cypher query from observed graph facts and then uses LLMs to generate its associated natural-language question. The resulting Text-Cypher pairs are validated with an LLM judge and human validation, and are converted into candidate-aware SFT data. The trained generator is served with class-conditioned schema prompting, entity retrieval, and LoRA-based inference. We evaluate KG2Cypher in Korean enterprise settings, where short search-style queries and schema paraphrases make language grounding difficult. LoRA SFT improves execution-result F1 from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. In an 11-class setting, KG2Cypher achieves 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KG2Cypher delivers a practical pipeline for synthetic Text-Cypher data on enterprise KGs with clear metric lifts, but the LLM-judge step leaves open whether gains reflect real generalization or just consistency with the validator.

read the letter

The paper's core contribution is a data pipeline that starts from observed graph facts to build executable Cypher, then uses LLMs to generate matching natural-language questions, followed by LLM-judge plus human filtering and LoRA fine-tuning. It adds class-conditioned schema prompting at inference time and tests this on Korean enterprise graphs with short, paraphrase-heavy queries.

It does a few things solidly. The before-and-after numbers are concrete: execution-result F1 rises from 0.806 to 0.950 on broadcast-program queries and from 0.70 to 0.92 on company queries. The 11-class setting reports 95.2% exact match, 99.9% execution rate, and 0.964 F1. The approach is explicitly data-centric and tailored to private KGs where labeled pairs are scarce, which matches real deployment constraints.

The main soft spot is the validation stage. The stress-test concern holds: without reported judge prompt details, agreement rates with humans, or checks that the test split is independent of the generation process, the LoRA gains could partly come from the model learning the judge's acceptance patterns rather than better Cypher logic. No error bars, no ablation on the filtering step, and no breakdown of how many pairs were rejected appear in the abstract. That makes the central claim plausible but not fully pinned down.

This paper is for practitioners who need to stand up text-to-Cypher interfaces on their own enterprise graphs, especially in non-English settings. Applied researchers working on LLM data synthesis for structured queries will get usable details. It is coherent on its own terms and shows honest engagement with the practical problem, so it deserves a serious referee even if the evaluation needs tightening on the judge artifacts.

Referee Report

3 major / 2 minor

Summary. The paper presents KG2Cypher, a data-centric pipeline that constructs executable Cypher queries from observed enterprise KG facts, generates corresponding natural-language questions via LLMs, validates the resulting Text-Cypher pairs with an LLM judge plus human review, and uses the filtered pairs for class-conditioned LoRA SFT. The trained model is deployed with schema prompting and entity retrieval. On Korean enterprise broadcast-program and company queries, LoRA SFT raises execution-result F1 from 0.806 to 0.950 and from 0.70 to 0.92 respectively; in an 11-class setting the system reports 95.2% exact match, 99.9% execution rate, and 0.964 execution-result F1.

Significance. If the reported gains prove robust to the validation process, the work offers a practical, low-annotation route to text-to-Cypher systems for private KGs, directly addressing the cost barrier noted in the abstract. The emphasis on execution-result metrics rather than surface match and the concrete before-and-after numbers constitute a strength; the absence of error bars or ablations, however, leaves the magnitude of improvement difficult to interpret.

major comments (3)

[Pipeline description / validation step] The validation subsection (described in the pipeline overview) provides no prompt template, decision threshold, or inter-annotator agreement statistics for the LLM judge. Because the central claim is that LoRA SFT on judge-filtered pairs produces the observed F1 lifts (0.806→0.950 and 0.70→0.92), the lack of these quantities makes it impossible to rule out that the model is fitting to judge-specific artifacts rather than improving Cypher generation.
[Experiments / results tables] Experiments section reports aggregate F1 and exact-match figures but contains no ablation that isolates the contribution of the LLM-judge filter versus the human-validation step, nor any error bars or statistical significance tests on the before-and-after deltas. These omissions are load-bearing for the claim that the pipeline reliably improves execution accuracy.
[Evaluation setup] No analysis is given of whether the test queries share the same synthetic generation process as the training pairs; if they do, the high execution rate (99.9%) and F1 (0.964) could be explained by distributional overlap rather than generalization.

minor comments (2)

[Abstract and §4] The abstract and results tables use “execution-result F1” without an explicit definition or reference to the precise matching criterion (e.g., whether partial result overlap is credited).
[Figures 2–3 and Table 2] Figure captions and table footnotes do not indicate the number of human validators or the exact protocol used for the final human-validation pass.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the validation details, experimental reporting, and evaluation setup. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Pipeline description / validation step] The validation subsection (described in the pipeline overview) provides no prompt template, decision threshold, or inter-annotator agreement statistics for the LLM judge. Because the central claim is that LoRA SFT on judge-filtered pairs produces the observed F1 lifts (0.806→0.950 and 0.70→0.92), the lack of these quantities makes it impossible to rule out that the model is fitting to judge-specific artifacts rather than improving Cypher generation.

Authors: We will add the LLM judge prompt template and decision threshold to the revised manuscript (in the validation subsection or an appendix) to support reproducibility. The training data additionally underwent a human validation step after the LLM judge, which we will emphasize as a safeguard against potential judge-specific artifacts. We did not compute inter-annotator agreement for the LLM judge and will acknowledge this as a limitation. revision: yes
Referee: [Experiments / results tables] Experiments section reports aggregate F1 and exact-match figures but contains no ablation that isolates the contribution of the LLM-judge filter versus the human-validation step, nor any error bars or statistical significance tests on the before-and-after deltas. These omissions are load-bearing for the claim that the pipeline reliably improves execution accuracy.

Authors: We will add error bars by reporting means and standard deviations from multiple training runs with different random seeds. A dedicated ablation isolating the LLM-judge filter from the subsequent human-validation step is not feasible within the revision timeline due to computational constraints; we will explicitly note this limitation and the role of the combined validation process in the revised text. revision: partial
Referee: [Evaluation setup] No analysis is given of whether the test queries share the same synthetic generation process as the training pairs; if they do, the high execution rate (99.9%) and F1 (0.964) could be explained by distributional overlap rather than generalization.

Authors: The test queries were collected from real enterprise user logs and are independent of the synthetic generation process used to create the training pairs. We will add an explicit statement and brief description of the test-query collection process in the evaluation setup section to clarify the absence of distributional overlap. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with held-out execution metrics shows no circularity

full rationale

The paper presents a data-centric pipeline for generating Text-Cypher pairs via synthesis, LLM question generation, LLM-judge filtering, and human validation, followed by LoRA SFT and evaluation. All reported metrics (execution-result F1, exact match, execution rate) are computed on held-out queries against actual Cypher execution results on the enterprise KG. No equations, derivations, or self-citations are invoked that reduce any claimed result to a fitted parameter or prior self-result by construction. The evaluation is externally grounded in execution outcomes rather than internal consistency with the generation process.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on the assumption that LLMs can both generate and judge query-question pairs at scale without systematic bias; no free parameters or invented entities are introduced beyond standard fine-tuning.

axioms (1)

domain assumption LLMs can generate natural-language questions from Cypher queries and judge pair quality with sufficient reliability for downstream training
Invoked in the data-construction and validation stages described in the abstract.

pith-pipeline@v0.9.1-grok · 5757 in / 1274 out tokens · 46485 ms · 2026-06-29T04:44:05.942912+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 7 canonical work pages · 3 internal anchors

[1]

InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China

Mind the query: A benchmark dataset towards Text2Cypher task. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China. Association for Computational Linguistics. Mohnish Dubey, Debayan Banerjee, Abdelrahman Ab- delkawi, and Jens Lehmann

2025
[2]

InThe Semantic Web – ISWC 2019, pages 69–78

Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. InThe Semantic Web – ISWC 2019, pages 69–78. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor

2019
[3]

InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445

Cypher: An evolving query language for property graphs. InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

2018
[4]

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772

Beyond seen data: Improving KBQA generalization through schema-guided logical form generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaug...

2025
[5]

The Llama 3 Herd of Models

The llama 3 herd of models. Preprint, arXiv:2407.21783. Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su

work page internal anchor Pith review Pith/arXiv arXiv
[6]

InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488

Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488. ACM. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

2021
[7]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models.Preprint, arXiv:2106.09685. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Text2Cypher: Bridging natural language and graph databases

Text2cypher: Bridg- ing natural language and graph databases.Preprint, arXiv:2412.10064. Makbule Gulcin Ozsoy and William Tai

work page arXiv
[9]

Text2cypher across languages: Evaluating and fine- tuning llms.arXiv preprint arXiv:2506.21445, arXiv:2506.21445. Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Jun- seong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, ...

work page arXiv
[10]

CoRR , volume =

Klue: Korean language understanding evaluation.Preprint, arXiv:2105.09680. Torsten Scholak, Nathan Schucher, and Dzmitry Bah- danau

work page arXiv
[11]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Picard: Parsing incrementally for constrained auto-regressive decoding from language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, and Sathwik Tejaswi Madhusud- han

2021
[12]

SynthCypher: A fully synthetic data generation framework for text-to-Cypher querying in knowledge graphs.arXiv preprint arXiv:2412.12612, 2024

Auto-cypher: Improving llms on cypher generation via llm-supervised generation-verification framework.Preprint, arXiv:2412.12612. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson

work page arXiv
[13]

InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph ...

2018
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judg- ing llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685. Victor Zhong, Caiming Xiong, and Richard Socher

work page internal anchor Pith review Pith/arXiv arXiv
[15]

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Seq2sql: Generating structured queries from natural language using reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. A LLM Judge Metric Validation We use synthetic failure simulations to check why the calibration gate uses MAE, adjacent agreement, and catch rate together. Each simulation con...

2017

[1] [1]

InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China

Mind the query: A benchmark dataset towards Text2Cypher task. InProceedings of the 2025 Conference on Em- pirical Methods in Natural Language Processing: Industry Track, pages 1890–1905, Suzhou, China. Association for Computational Linguistics. Mohnish Dubey, Debayan Banerjee, Abdelrahman Ab- delkawi, and Jens Lehmann

2025

[2] [2]

InThe Semantic Web – ISWC 2019, pages 69–78

Lc-quad 2.0: A large dataset for complex question answering over wikidata and dbpedia. InThe Semantic Web – ISWC 2019, pages 69–78. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor

2019

[3] [3]

InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445

Cypher: An evolving query language for property graphs. InProceedings of the 2018 International Conference on Management of Data, pages 1433–1445. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

2018

[4] [4]

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772

Beyond seen data: Improving KBQA generalization through schema-guided logical form generation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8753–8772. Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaug...

2025

[5] [5]

The Llama 3 Herd of Models

The llama 3 herd of models. Preprint, arXiv:2407.21783. Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488

Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. InProceedings of the Web Con- ference 2021, WWW ’21, page 3477–3488. ACM. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

2021

[7] [7]

LoRA: Low-Rank Adaptation of Large Language Models

Lora: Low-rank adaptation of large language models.Preprint, arXiv:2106.09685. Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gon- zalez, Hao Zhang, and Ion Stoica

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Text2Cypher: Bridging natural language and graph databases

Text2cypher: Bridg- ing natural language and graph databases.Preprint, arXiv:2412.10064. Makbule Gulcin Ozsoy and William Tai

work page arXiv

[9] [9]

Text2cypher across languages: Evaluating and fine- tuning llms.arXiv preprint arXiv:2506.21445, arXiv:2506.21445. Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Jiyoon Han, Jangwon Park, Chisung Song, Jun- seong Kim, Yongsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, ...

work page arXiv

[10] [10]

CoRR , volume =

Klue: Korean language understanding evaluation.Preprint, arXiv:2105.09680. Torsten Scholak, Nathan Schucher, and Dzmitry Bah- danau

work page arXiv

[11] [11]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Picard: Parsing incrementally for constrained auto-regressive decoding from language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Aman Tiwari, Shiva Krishna Reddy Malay, Vikas Yadav, Masoud Hashemi, and Sathwik Tejaswi Madhusud- han

2021

[12] [12]

SynthCypher: A fully synthetic data generation framework for text-to-Cypher querying in knowledge graphs.arXiv preprint arXiv:2412.12612, 2024

Auto-cypher: Improving llms on cypher generation via llm-supervised generation-verification framework.Preprint, arXiv:2412.12612. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson

work page arXiv

[13] [13]

InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph ...

2018

[14] [14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judg- ing llm-as-a-judge with mt-bench and chatbot arena. Preprint, arXiv:2306.05685. Victor Zhong, Caiming Xiong, and Richard Socher

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Seq2sql: Generating structured queries from natural language using reinforcement learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. A LLM Judge Metric Validation We use synthetic failure simulations to check why the calibration gate uses MAE, adjacent agreement, and catch rate together. Each simulation con...

2017