PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems
Pith reviewed 2026-06-27 18:49 UTC · model grok-4.3
The pith
PIPE-Cypher turns live enterprise graphs into balanced, executable Text-to-Cypher benchmarks using local models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PIPE-Cypher is a local benchmark-generation pipeline that turns a live property graph and optional seed queries into balanced NL-to-Cypher benchmarks through schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge.
What carries the argument
The PIPE-Cypher pipeline, which uses constrained generation, execution validation, diversity controls, and a calibrated local LLM judge to produce executable and diverse NL-Cypher pairs.
If this is right
- Benchmarks can be regenerated as graphs change over time or workloads shift.
- Zero-shot transfer of models to new graphs is weak.
- A few-shot approach using schema-specific example banks improves performance for compatible model families.
- The pipeline supports audited ablation studies and calibration against human labels.
Where Pith is reading between the lines
- Similar automated pipelines could be adapted for other graph query languages beyond Cypher.
- Incorporating ongoing agent tool calls or analyst logs as seeds could keep benchmarks aligned with actual usage over time.
- The approach might lower the cost of maintaining up-to-date benchmarks in production graph deployments.
Load-bearing premise
The constrained generation, execution validation, diversity controls, and calibrated local LLM judge can reliably produce valid, diverse, balanced, and executable NL-Cypher pairs that reflect real enterprise usage without introducing bias or invalid queries.
What would settle it
Finding that many generated queries fail to execute on the graph or that human reviewers disagree with the local judge on validity and diversity.
Figures
read the original abstract
Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PIPE-Cypher, a local pipeline for automatically generating balanced NL-to-Cypher benchmarks from live enterprise property graphs and optional seed queries. It integrates schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge (using Qwen3.5-9B). The authors report producing 3,000 accepted examples for FinBench/SNB, completing three audited ablation suites, calibrating the judge with human labels, and evaluating 11 downstream models. The central claim is that this approach renders Text2Cypher benchmarking repeatable and adaptive to evolving graphs, users, and workloads, with evidence that zero-shot transfer is weak while schema-specific few-shot examples help compatible model families.
Significance. If the pipeline components reliably produce valid, diverse, executable, and workload-reflective pairs without systematic bias, the work would be significant for enterprise graph query systems. Standard benchmarks often fail to capture domain-specific schemas, terminology, and governance constraints; a repeatable, local generation process that evolves with the graph addresses a practical deployment gap. The reported use of local models, human calibration, ablation suites, and downstream evaluations on FinBench/SNB strengthens the contribution by providing concrete evidence of discriminativeness and reproducibility.
minor comments (3)
- The abstract refers to 'FinBench/SNB' without definition or citation; this should be expanded on first use in the introduction or a dedicated background section.
- The description of diversity controls and balance across query types and difficulty levels would benefit from explicit metrics or pseudocode in the methods section to allow replication.
- Table or figure captions for the ablation suites and model evaluation results should include precise definitions of the reported metrics (e.g., exact success rate or accuracy definitions) for clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed summary of our work and the recommendation for minor revision. The report does not enumerate any specific major comments requiring point-by-point rebuttal.
Circularity Check
No significant circularity identified
full rationale
The manuscript presents PIPE-Cypher as an engineering pipeline that assembles existing techniques (schema profiling, constrained generation, execution validation, diversity controls, redaction, and a calibrated local LLM judge) to produce NL-Cypher pairs from a live graph. No equations, fitted parameters, or derived predictions appear; the central claim is simply that the described safeguards yield 3,000 accepted examples plus ablation and human-calibration results. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the pipeline itself. The derivation is therefore self-contained construction rather than any reduction of outputs to inputs by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
CypherBench: Towards precise retrieval over full-scale modern knowledge graphs in the LLM era. Preprint, arXiv:2412.18702. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, and 1 others. 2025. Gemma 3 technical repo...
arXiv 2025
-
[2]
The LDBC financial benchmark.Preprint, arXiv:2306.15975. Qwen. 2026. Qwen3.5-9B. Hugging Face model repos- itory. Checkpoint source. Accessed: 2026-06-06. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natu- r...
arXiv 2026
-
[3]
Generating cross-model analytics workloads using LLMs. InProceedings of the 33rd ACM Inter- national Conference on Information and Knowledge Management, pages 4303–4307. ACM. Zijie Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin, and Xiaofan Zhang. 2025. Syn- theT2C: Generating synthetic data for fine-tuning large language models on the Text2...
Pith/arXiv arXiv 2025
-
[4]
and Auto- Cypher (Tiwari et al., 2025) Outcome-aware reverse grounding binds slots through live Cypher be- fore NL realization. Execution validation in Auto-Cypher (Tiwari et al., 2025) and Mind the Query (Chauhan et al., 2025) Execution is one gate in a ledger with direction, read-only, value, non-empty, and judge evidence. Schema/value checks in Mind th...
2025
-
[5]
SHA-256:10b25b
Template generation.Stage: Workload pro- posal. SHA-256:10b25b. Contract.Schema-only labels, relationships, properties, and categorical values; Realistic enterprise analyst wording; At most two typed slots and JSON-only output
-
[6]
SHA-256:48e949
Reverse binding.Stage: Graph grounding. SHA-256:48e949. Contract.Read-only MATCH/WHERE/RETURN DIS- TINCT/LIMIT only; Slot variables named exactly as requested; Forward relationship directions from the schema
-
[7]
SHA-256:61c557
Cypher generation.Stage: Candidate query. SHA-256:61c557. Contract.Only schema-visible constructs and observed directions; RETURN DISTINCT for set returns and exact equality for quoted val- ues; Context columns, categorical hints, place- holderized retrieval, and no writes
-
[8]
SHA- 256:56c419
Repair.Stage: Validation feedback. SHA- 256:56c419. Contract.Preserve question intent while fix- ing validation or execution issues; Keep query read-only and schema-grounded; Return only corrected Cypher
-
[9]
SHA-256: 421c7b
LLM judge.Stage: Quality gate. SHA-256: 421c7b. Contract.Inputs include question, Cypher, relevant schema excerpt, execution rows, and validation summary; Strict JSON scores for ambiguity, semantic alignment, schema use, and difficulty; Categorical values constrain query literals, not observed result-row val- ues; Pass only useful, unambiguous enterprise ...
-
[10]
SHA-256:4c07ff
Downstream Text2Cypher.Stage: Model evaluation. SHA-256:4c07ff. Contract.Read-only Cypher only; Schema- visible constructs and exact direction preser- vation; RETURN DISTINCT, count/ranking rules, and no explanations I.1 LLM Judge Prompt Used in Reported Runs The judge runs only after deterministic validation and live execution. For each reviewed example,...
-
[11]
Relationships:TRANSFER_TO
FinBench / Boolean Existence / easy.Ques- tion:Does account ’187743809466009406’ have any outgoing transfer? MATCH (src:Account {accountId: ’187743809466009406’}) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT COUNT(DISTINCT src) > 0 AS HasOutgoingTransfer Structure:single_hop, aggregation. Relationships:TRANSFER_TO. Gates:RO/Syn/Schema/Exec/Judge. Result s...
-
[12]
FinBench / Complex Aggregation / medium. Question:What is the total transferred amount from accounts owned by person ’Gwar’? MATCH (p:Person {personName: ’Gwar’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) RETURN DISTINCT SUM(t.amount) AS TotalTransferredAmount Structure:join_heavy, aggregation. Relationships:OWN_ACCOUNT, TRANS- FER_TO. ...
-
[13]
Relationships:OWN_ACCOUNT, TRANS- FER_TO
FinBench / Complex Retrieval / easy.Ques- tion:Which accounts received transfers from accounts owned by person ’Zof’? MATCH (p:Person {personName: ’Zof’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, bou...
-
[14]
FinBench / Negation Difference / medium. Question:Which accounts owned by person ’Kant’ have not sent any transfers? MATCH (p:Person {personName: ’Kant’}) -[:OWN_ACCOUNT]-> (a:Account) WHERE NOT (a) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, ...
-
[15]
Relationships:OWN_ACCOUNT, TRANS- FER_TO
FinBench / Path Temporal / medium.Ques- tion:Which accounts can receive money within two transfer hops from accounts owned by person ’Sossamon’? MATCH (p:Person {personName: ’Sossamon’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO *1..2]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlock...
-
[16]
Relationships:OWN_ACCOUNT, TRANS- FER_TO
FinBench / Ranking Topk / medium.Ques- tion:For accounts owned by person ’Barry’, which account sent the highest total transfer amount? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) WITH src, SUM(t.amount) AS totalAmount RETURN DISTINCT src.accountId AS AccountId, src.accountType AS AccountType, src.is...
-
[17]
Relationships:OWN_ACCOUNT
FinBench / Simple Aggregation / easy.Ques- tion:How many accounts are owned by per- son ’Kaewsuktae’? MATCH (p:Person {personName: ’Kaewsuktae’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT COUNT(DISTINCT a) AS AccountCount Structure:single_hop, aggregation. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{AccountCount: 1}; obse...
-
[18]
Relationships:OWN_ACCOUNT
FinBench / Simple Retrieval / easy.Ques- tion:Which accounts are owned by person ’Barry’? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:single_hop, bounded_result. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/J...
-
[19]
Relationships:registered_address
ICIJ Offshore Leaks / Boolean Existence / medium.Question:Does offshore entity ’ENTITY_V ALUE_1’ have a registered ad- dress? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) OPTIONAL MATCH (e) -[:registered_address]-> (addr:Address) RETURN DISTINCT COUNT(addr) > 0 AS HasRegisteredAddress Structure:single_hop, aggregation, optional. Relationships:registered_addr...
-
[20]
Relationships:officer_of
ICIJ Offshore Leaks / Complex Aggrega- tion / easy.Question:How many distinct officers are connected to entities in jurisdic- tion ’JURISDICTION_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) RETURN DISTINCT COUNT(DISTINCT o) AS OfficerCount Structure:single_hop, aggregation. Relationships:officer_of. Gates...
-
[21]
Relationships:registered_address
ICIJ Offshore Leaks / Complex Retrieval / easy.Question:Which officers share a registered address with offshore entity ’EN- TITY_V ALUE_1’? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) -[:registered_address]-> (addr:Address) <-[:registered_address]- (o:Officer) RETURN DISTINCT o.node_id AS OfficerId, o.name AS OfficerName, addr.address AS RegisteredAddress L...
-
[22]
Relationships:registered_address
ICIJ Offshore Leaks / Negation Difference / easy.Question:Which offshore entities in jurisdiction ’JURISDICTION_V ALUE_1’ do not have a registered address? MATCH (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) WHERE NOT (e) -[:registered_address]-> (:Address) RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction LIM...
-
[23]
Relationships:officer_of
ICIJ Offshore Leaks / Path Temporal / medium.Question:Which officers share offshore entities with officer ’OFFI- CER_V ALUE_1’, and when did each connec- tion start? MATCH (src:Officer)-[srcRel:officer_of]-> (entity:Entity) <-[dstRel:officer_of]- (dst:Officer) WHERE trim(src.name) = ’OFFICER_VALUE_1’ AND dst <> src RETURN DISTINCT dst.node_id AS OfficerId...
-
[24]
Relationships:none
ICIJ Offshore Leaks / Ranking Topk / medium.Question:Which jurisdictions have the most offshore entities? MATCH (e:Entity) WHERE e.jurisdiction IS NOT NULL WITH e.jurisdiction AS jurisdiction, COUNT(DISTINCT e) AS entityCount RETURN DISTINCT jurisdiction, entityCount ORDER BY entityCount DESC LIMIT 10 Structure:node_scan, aggregation, or- der_rank, negati...
-
[25]
Relationships:officer_of
ICIJ Offshore Leaks / Simple Aggrega- tion / easy.Question:How many off- shore entities are connected to officer ’OF- FICER_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT COUNT(DISTINCT e) AS OffshoreEntityCount Structure:single_hop, aggregation. Relationships:officer_of. Gates:RO/Syn/Schema...
-
[26]
Relationships:officer_of
ICIJ Offshore Leaks / Simple Retrieval / easy.Question:Which offshore entities is officer ’OFFICER_V ALUE_1’ connected to? MATCH (o:Officer)-[r:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction, r.link AS Link LIMIT 300 Structure:single_hop, bounded_r...
-
[27]
Relationships:LIKES
SNB / Boolean Existence / medium.Ques- tion:Does person with id 6597069766828 like any post? MATCH (p:Person {id: 6597069766828}) OPTIONAL MATCH (p) -[:LIKES]-> (post:Post) RETURN DISTINCT COUNT(post) > 0 AS LikesAnyPost Structure:single_hop, aggregation, optional. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{LikesAnyPost: True}; ob...
-
[28]
SNB / Complex Aggregation / medium. Question:How many distinct posts are in forums joined by person with id 6597069766845? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person {id: 6597069766845}) MATCH (forum) -[:CONTAINER_OF]-> (post:Post) RETURN DISTINCT COUNT(DISTINCT post) AS JoinedForumPostCount Structure:join_heavy, aggregation. Relationships:CONTAINER_O...
-
[29]
Relationships:CONTAINER_OF, HAS_MEMBER, HAS_TAG
SNB / Complex Retrieval / easy.Question: Which people are members of forums contain- ing posts tagged ’Manuel_Noriega’? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person), (forum) -[:CONTAINER_OF]-> (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Manuel_Noriega’}) RETURN DISTINCT p.id AS PersonId LIMIT 200 Structure:join_heavy, bounded_result. Relationships:CONTA...
-
[30]
Relationships:HAS_CREATOR
SNB / Negation Difference / easy.Question: Which person records are not linked from any message record through :HAS_CREATOR? MATCH (p:Person) WHERE NOT EXISTS((:Message) -[:HAS_CREATOR]-> (p)) RETURN DISTINCT p.id AS PersonId, p.firstName AS PersonFirstName, p.lastName AS PersonLastName Structure:single_hop, negation. Relationships:HAS_CREATOR. Gates:RO/S...
-
[31]
Relationships:KNOWS
SNB / Path Temporal / easy.Question: Which people are within two knows hops of person with id 4398046511136? MATCH (src:Person {id: 4398046511136}) -[:KNOWS*1..2]-> (dst:Person) RETURN DISTINCT dst.id AS PersonId, dst.firstName AS FirstName, dst.lastName AS LastName LIMIT 200 Structure:single_hop, path, bounded_result. Relationships:KNOWS. Gates:RO/Syn/Sc...
-
[32]
Relationships:IS_LOCATED_IN
SNB / Ranking Topk / medium.Ques- tion:Which city records are linked from the most organisation records through :IS_LOCATED_IN? MATCH (s:Organisation) -[:IS_LOCATED_IN]-> (e:City) WITH e, COUNT(DISTINCT s) AS relatedCount RETURN DISTINCT e.id AS TargetId, e.name AS TargetName, relatedCount ORDER BY relatedCount DESC LIMIT 10 Structure:single_hop, aggregat...
-
[33]
Relationships:HAS_TAG
SNB / Simple Aggregation / easy.Question: How many posts are tagged ’Vietnam’? MATCH (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Vietnam’}) RETURN DISTINCT COUNT(DISTINCT post) AS PostCount Structure:single_hop, aggregation. Relationships:HAS_TAG. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostCount: 1}; observed rows: 1
-
[34]
Relationships:LIKES
SNB / Simple Retrieval / easy.Ques- tion:Which post IDs did person with id 4398046511124 like? MATCH (p:Person {id: 4398046511124}) -[:LIKES]-> (post:Post) RETURN DISTINCT post.id AS PostId LIMIT 200 Structure:single_hop, bounded_result. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostId: 343597385744}; ob- served rows: 5
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.