PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Anish Raghavendra; Suraj Ranganath

arxiv: 2606.08481 · v1 · pith:U2KFAJT4new · submitted 2026-06-07 · 💻 cs.LG · cs.AI· cs.DB· cs.SE

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

Suraj Ranganath , Anish Raghavendra This is my paper

Pith reviewed 2026-06-27 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DBcs.SE

keywords Text-to-Cypherbenchmark generationproperty graphsenterprise graphslocal LLMCypher queriesNL-to-Cyphergraph schema profiling

0 comments

The pith

PIPE-Cypher turns live enterprise graphs into balanced, executable Text-to-Cypher benchmarks using local models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Enterprise property graphs have unique schemas, terminology, and user patterns that make standard benchmarks irrelevant. The paper presents PIPE-Cypher as a pipeline to automatically generate tailored NL-to-Cypher benchmarks from a live graph and optional seed queries. It integrates schema profiling, constrained generation, execution validation, diversity controls, and a calibrated local LLM judge to ensure pairs are valid, diverse, and balanced. This approach was used to create 3,000 examples for FinBench/SNB and evaluate 11 models, showing that the benchmarks are discriminative. If the method works, Text2Cypher benchmarking becomes a repeatable process that evolves with the graph.

Core claim

PIPE-Cypher is a local benchmark-generation pipeline that turns a live property graph and optional seed queries into balanced NL-to-Cypher benchmarks through schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge.

What carries the argument

The PIPE-Cypher pipeline, which uses constrained generation, execution validation, diversity controls, and a calibrated local LLM judge to produce executable and diverse NL-Cypher pairs.

If this is right

Benchmarks can be regenerated as graphs change over time or workloads shift.
Zero-shot transfer of models to new graphs is weak.
A few-shot approach using schema-specific example banks improves performance for compatible model families.
The pipeline supports audited ablation studies and calibration against human labels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar automated pipelines could be adapted for other graph query languages beyond Cypher.
Incorporating ongoing agent tool calls or analyst logs as seeds could keep benchmarks aligned with actual usage over time.
The approach might lower the cost of maintaining up-to-date benchmarks in production graph deployments.

Load-bearing premise

The constrained generation, execution validation, diversity controls, and calibrated local LLM judge can reliably produce valid, diverse, balanced, and executable NL-Cypher pairs that reflect real enterprise usage without introducing bias or invalid queries.

What would settle it

Finding that many generated queries fail to execute on the graph or that human reviewers disagree with the local judge on validity and diversity.

Figures

Figures reproduced from arXiv: 2606.08481 by Anish Raghavendra, Suraj Ranganath.

**Figure 2.** Figure 2: Target-100 ablation yield on FinBench and [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 4.** Figure 4: Three-suite ablation stability over the original [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 6.** Figure 6: Full 3,000-example export distribution. The [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 5.** Figure 5: Full-run rejection taxonomy before bench [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 7.** Figure 7: FinBench schema graph used in the reported [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: SNB schema graph used in the reported runs. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: ICIJ Offshore Leaks schema graph used for [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 11.** Figure 11: Execution accuracy for 11 completed local downstream models under zero-shot and few-shot control [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Diversity diagnostics for the full 3,000- [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Cypher strategy coverage by workload category. Category balancing does not by itself guarantee operator [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Downstream outcomes by gold Cypher strategy on the full held-out test split. The local Qwen3.5-9B [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

read the original abstract

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIPE-Cypher describes a pipeline that turns a live graph into a custom Text2Cypher benchmark using schema profiling, constrained generation, execution checks, and a calibrated local LLM judge.

read the letter

The main point is that this paper gives a repeatable way to build enterprise-specific benchmarks for turning natural language into Cypher queries on property graphs that change over time.

It combines schema profiling, reverse-query grounding, constrained generation, deterministic governance, execution validation, redaction, diversity controls, and a local LLM judge. The authors report producing 3000 accepted examples on FinBench/SNB, running three ablation suites, calibrating the judge against human labels, and testing 11 downstream models. They also note that zero-shot transfer is weak while few-shot with schema-specific examples helps some model families.

The work is new in packaging those steps into one local process that can update as the graph and workloads evolve. The concrete outputs and the human calibration step give it more weight than a pure description.

The soft spots are in the level of detail. The abstract does not show error rates from the judge, quantitative diversity metrics, or how often constrained generation actually fails and gets filtered. It is not clear how closely the generated queries match real user logs versus what the pipeline invents. Those gaps make it hard to judge whether the outputs stay balanced and free of bias without seeing the full methods and results sections.

This paper is for people working on Text2Cypher systems or graph database tooling who need benchmarks that fit their own schema and terminology. It deserves a serious referee because the problem is practical, the pipeline has multiple built-in checks, and the reported scale and ablations provide something concrete to evaluate.

Referee Report

0 major / 3 minor

Summary. The manuscript presents PIPE-Cypher, a local pipeline for automatically generating balanced NL-to-Cypher benchmarks from live enterprise property graphs and optional seed queries. It integrates schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge (using Qwen3.5-9B). The authors report producing 3,000 accepted examples for FinBench/SNB, completing three audited ablation suites, calibrating the judge with human labels, and evaluating 11 downstream models. The central claim is that this approach renders Text2Cypher benchmarking repeatable and adaptive to evolving graphs, users, and workloads, with evidence that zero-shot transfer is weak while schema-specific few-shot examples help compatible model families.

Significance. If the pipeline components reliably produce valid, diverse, executable, and workload-reflective pairs without systematic bias, the work would be significant for enterprise graph query systems. Standard benchmarks often fail to capture domain-specific schemas, terminology, and governance constraints; a repeatable, local generation process that evolves with the graph addresses a practical deployment gap. The reported use of local models, human calibration, ablation suites, and downstream evaluations on FinBench/SNB strengthens the contribution by providing concrete evidence of discriminativeness and reproducibility.

minor comments (3)

The abstract refers to 'FinBench/SNB' without definition or citation; this should be expanded on first use in the introduction or a dedicated background section.
The description of diversity controls and balance across query types and difficulty levels would benefit from explicit metrics or pseudocode in the methods section to allow replication.
Table or figure captions for the ablation suites and model evaluation results should include precise definitions of the reported metrics (e.g., exact success rate or accuracy definitions) for clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary of our work and the recommendation for minor revision. The report does not enumerate any specific major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript presents PIPE-Cypher as an engineering pipeline that assembles existing techniques (schema profiling, constrained generation, execution validation, diversity controls, redaction, and a calibrated local LLM judge) to produce NL-Cypher pairs from a live graph. No equations, fitted parameters, or derived predictions appear; the central claim is simply that the described safeguards yield 3,000 accepted examples plus ablation and human-calibration results. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the pipeline itself. The derivation is therefore self-contained construction rather than any reduction of outputs to inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be extracted. The central claim implicitly rests on unstated assumptions about the reliability of local LLMs for generation and judging and the effectiveness of the listed controls, but these are not detailed.

pith-pipeline@v0.9.1-grok · 5793 in / 1258 out tokens · 24941 ms · 2026-06-27T18:49:58.844540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 1 linked inside Pith

[1]

Preprint, arXiv:2412.18702

CypherBench: Towards precise retrieval over full-scale modern knowledge graphs in the LLM era. Preprint, arXiv:2412.18702. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, and 1 others. 2025. Gemma 3 technical repo...

arXiv 2025
[2]

The LDBC financial benchmark.Preprint, arXiv:2306.15975. Qwen. 2026. Qwen3.5-9B. Hugging Face model repos- itory. Checkpoint source. Accessed: 2026-06-06. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natu- r...

arXiv 2026
[3]

InProceedings of the 33rd ACM Inter- national Conference on Information and Knowledge Management, pages 4303–4307

Generating cross-model analytics workloads using LLMs. InProceedings of the 33rd ACM Inter- national Conference on Information and Knowledge Management, pages 4303–4307. ACM. Zijie Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin, and Xiaofan Zhang. 2025. Syn- theT2C: Generating synthetic data for fine-tuning large language models on the Text2...

Pith/arXiv arXiv 2025
[4]

and Auto- Cypher (Tiwari et al., 2025) Outcome-aware reverse grounding binds slots through live Cypher be- fore NL realization. Execution validation in Auto-Cypher (Tiwari et al., 2025) and Mind the Query (Chauhan et al., 2025) Execution is one gate in a ledger with direction, read-only, value, non-empty, and judge evidence. Schema/value checks in Mind th...

2025
[5]

SHA-256:10b25b

Template generation.Stage: Workload pro- posal. SHA-256:10b25b. Contract.Schema-only labels, relationships, properties, and categorical values; Realistic enterprise analyst wording; At most two typed slots and JSON-only output
[6]

SHA-256:48e949

Reverse binding.Stage: Graph grounding. SHA-256:48e949. Contract.Read-only MATCH/WHERE/RETURN DIS- TINCT/LIMIT only; Slot variables named exactly as requested; Forward relationship directions from the schema
[7]

SHA-256:61c557

Cypher generation.Stage: Candidate query. SHA-256:61c557. Contract.Only schema-visible constructs and observed directions; RETURN DISTINCT for set returns and exact equality for quoted val- ues; Context columns, categorical hints, place- holderized retrieval, and no writes
[8]

SHA- 256:56c419

Repair.Stage: Validation feedback. SHA- 256:56c419. Contract.Preserve question intent while fix- ing validation or execution issues; Keep query read-only and schema-grounded; Return only corrected Cypher
[9]

SHA-256: 421c7b

LLM judge.Stage: Quality gate. SHA-256: 421c7b. Contract.Inputs include question, Cypher, relevant schema excerpt, execution rows, and validation summary; Strict JSON scores for ambiguity, semantic alignment, schema use, and difficulty; Categorical values constrain query literals, not observed result-row val- ues; Pass only useful, unambiguous enterprise ...
[10]

SHA-256:4c07ff

Downstream Text2Cypher.Stage: Model evaluation. SHA-256:4c07ff. Contract.Read-only Cypher only; Schema- visible constructs and exact direction preser- vation; RETURN DISTINCT, count/ranking rules, and no explanations I.1 LLM Judge Prompt Used in Reported Runs The judge runs only after deterministic validation and live execution. For each reviewed example,...
[11]

Relationships:TRANSFER_TO

FinBench / Boolean Existence / easy.Ques- tion:Does account ’187743809466009406’ have any outgoing transfer? MATCH (src:Account {accountId: ’187743809466009406’}) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT COUNT(DISTINCT src) > 0 AS HasOutgoingTransfer Structure:single_hop, aggregation. Relationships:TRANSFER_TO. Gates:RO/Syn/Schema/Exec/Judge. Result s...
[12]

FinBench / Complex Aggregation / medium. Question:What is the total transferred amount from accounts owned by person ’Gwar’? MATCH (p:Person {personName: ’Gwar’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) RETURN DISTINCT SUM(t.amount) AS TotalTransferredAmount Structure:join_heavy, aggregation. Relationships:OWN_ACCOUNT, TRANS- FER_TO. ...
[13]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Complex Retrieval / easy.Ques- tion:Which accounts received transfers from accounts owned by person ’Zof’? MATCH (p:Person {personName: ’Zof’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, bou...
[14]

FinBench / Negation Difference / medium. Question:Which accounts owned by person ’Kant’ have not sent any transfers? MATCH (p:Person {personName: ’Kant’}) -[:OWN_ACCOUNT]-> (a:Account) WHERE NOT (a) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, ...
[15]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Path Temporal / medium.Ques- tion:Which accounts can receive money within two transfer hops from accounts owned by person ’Sossamon’? MATCH (p:Person {personName: ’Sossamon’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO *1..2]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlock...
[16]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Ranking Topk / medium.Ques- tion:For accounts owned by person ’Barry’, which account sent the highest total transfer amount? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) WITH src, SUM(t.amount) AS totalAmount RETURN DISTINCT src.accountId AS AccountId, src.accountType AS AccountType, src.is...
[17]

Relationships:OWN_ACCOUNT

FinBench / Simple Aggregation / easy.Ques- tion:How many accounts are owned by per- son ’Kaewsuktae’? MATCH (p:Person {personName: ’Kaewsuktae’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT COUNT(DISTINCT a) AS AccountCount Structure:single_hop, aggregation. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{AccountCount: 1}; obse...
[18]

Relationships:OWN_ACCOUNT

FinBench / Simple Retrieval / easy.Ques- tion:Which accounts are owned by person ’Barry’? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:single_hop, bounded_result. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/J...
[19]

Relationships:registered_address

ICIJ Offshore Leaks / Boolean Existence / medium.Question:Does offshore entity ’ENTITY_V ALUE_1’ have a registered ad- dress? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) OPTIONAL MATCH (e) -[:registered_address]-> (addr:Address) RETURN DISTINCT COUNT(addr) > 0 AS HasRegisteredAddress Structure:single_hop, aggregation, optional. Relationships:registered_addr...
[20]

Relationships:officer_of

ICIJ Offshore Leaks / Complex Aggrega- tion / easy.Question:How many distinct officers are connected to entities in jurisdic- tion ’JURISDICTION_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) RETURN DISTINCT COUNT(DISTINCT o) AS OfficerCount Structure:single_hop, aggregation. Relationships:officer_of. Gates...
[21]

Relationships:registered_address

ICIJ Offshore Leaks / Complex Retrieval / easy.Question:Which officers share a registered address with offshore entity ’EN- TITY_V ALUE_1’? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) -[:registered_address]-> (addr:Address) <-[:registered_address]- (o:Officer) RETURN DISTINCT o.node_id AS OfficerId, o.name AS OfficerName, addr.address AS RegisteredAddress L...
[22]

Relationships:registered_address

ICIJ Offshore Leaks / Negation Difference / easy.Question:Which offshore entities in jurisdiction ’JURISDICTION_V ALUE_1’ do not have a registered address? MATCH (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) WHERE NOT (e) -[:registered_address]-> (:Address) RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction LIM...
[23]

Relationships:officer_of

ICIJ Offshore Leaks / Path Temporal / medium.Question:Which officers share offshore entities with officer ’OFFI- CER_V ALUE_1’, and when did each connec- tion start? MATCH (src:Officer)-[srcRel:officer_of]-> (entity:Entity) <-[dstRel:officer_of]- (dst:Officer) WHERE trim(src.name) = ’OFFICER_VALUE_1’ AND dst <> src RETURN DISTINCT dst.node_id AS OfficerId...
[24]

Relationships:none

ICIJ Offshore Leaks / Ranking Topk / medium.Question:Which jurisdictions have the most offshore entities? MATCH (e:Entity) WHERE e.jurisdiction IS NOT NULL WITH e.jurisdiction AS jurisdiction, COUNT(DISTINCT e) AS entityCount RETURN DISTINCT jurisdiction, entityCount ORDER BY entityCount DESC LIMIT 10 Structure:node_scan, aggregation, or- der_rank, negati...
[25]

Relationships:officer_of

ICIJ Offshore Leaks / Simple Aggrega- tion / easy.Question:How many off- shore entities are connected to officer ’OF- FICER_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT COUNT(DISTINCT e) AS OffshoreEntityCount Structure:single_hop, aggregation. Relationships:officer_of. Gates:RO/Syn/Schema...
[26]

Relationships:officer_of

ICIJ Offshore Leaks / Simple Retrieval / easy.Question:Which offshore entities is officer ’OFFICER_V ALUE_1’ connected to? MATCH (o:Officer)-[r:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction, r.link AS Link LIMIT 300 Structure:single_hop, bounded_r...
[27]

Relationships:LIKES

SNB / Boolean Existence / medium.Ques- tion:Does person with id 6597069766828 like any post? MATCH (p:Person {id: 6597069766828}) OPTIONAL MATCH (p) -[:LIKES]-> (post:Post) RETURN DISTINCT COUNT(post) > 0 AS LikesAnyPost Structure:single_hop, aggregation, optional. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{LikesAnyPost: True}; ob...
[28]

SNB / Complex Aggregation / medium. Question:How many distinct posts are in forums joined by person with id 6597069766845? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person {id: 6597069766845}) MATCH (forum) -[:CONTAINER_OF]-> (post:Post) RETURN DISTINCT COUNT(DISTINCT post) AS JoinedForumPostCount Structure:join_heavy, aggregation. Relationships:CONTAINER_O...
[29]

Relationships:CONTAINER_OF, HAS_MEMBER, HAS_TAG

SNB / Complex Retrieval / easy.Question: Which people are members of forums contain- ing posts tagged ’Manuel_Noriega’? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person), (forum) -[:CONTAINER_OF]-> (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Manuel_Noriega’}) RETURN DISTINCT p.id AS PersonId LIMIT 200 Structure:join_heavy, bounded_result. Relationships:CONTA...
[30]

Relationships:HAS_CREATOR

SNB / Negation Difference / easy.Question: Which person records are not linked from any message record through :HAS_CREATOR? MATCH (p:Person) WHERE NOT EXISTS((:Message) -[:HAS_CREATOR]-> (p)) RETURN DISTINCT p.id AS PersonId, p.firstName AS PersonFirstName, p.lastName AS PersonLastName Structure:single_hop, negation. Relationships:HAS_CREATOR. Gates:RO/S...
[31]

Relationships:KNOWS

SNB / Path Temporal / easy.Question: Which people are within two knows hops of person with id 4398046511136? MATCH (src:Person {id: 4398046511136}) -[:KNOWS*1..2]-> (dst:Person) RETURN DISTINCT dst.id AS PersonId, dst.firstName AS FirstName, dst.lastName AS LastName LIMIT 200 Structure:single_hop, path, bounded_result. Relationships:KNOWS. Gates:RO/Syn/Sc...
[32]

Relationships:IS_LOCATED_IN

SNB / Ranking Topk / medium.Ques- tion:Which city records are linked from the most organisation records through :IS_LOCATED_IN? MATCH (s:Organisation) -[:IS_LOCATED_IN]-> (e:City) WITH e, COUNT(DISTINCT s) AS relatedCount RETURN DISTINCT e.id AS TargetId, e.name AS TargetName, relatedCount ORDER BY relatedCount DESC LIMIT 10 Structure:single_hop, aggregat...
[33]

Relationships:HAS_TAG

SNB / Simple Aggregation / easy.Question: How many posts are tagged ’Vietnam’? MATCH (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Vietnam’}) RETURN DISTINCT COUNT(DISTINCT post) AS PostCount Structure:single_hop, aggregation. Relationships:HAS_TAG. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostCount: 1}; observed rows: 1
[34]

Relationships:LIKES

SNB / Simple Retrieval / easy.Ques- tion:Which post IDs did person with id 4398046511124 like? MATCH (p:Person {id: 4398046511124}) -[:LIKES]-> (post:Post) RETURN DISTINCT post.id AS PostId LIMIT 200 Structure:single_hop, bounded_result. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostId: 343597385744}; ob- served rows: 5

[1] [1]

Preprint, arXiv:2412.18702

CypherBench: Towards precise retrieval over full-scale modern knowledge graphs in the LLM era. Preprint, arXiv:2412.18702. Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, and 1 others. 2025. Gemma 3 technical repo...

arXiv 2025

[2] [2]

The LDBC financial benchmark.Preprint, arXiv:2306.15975. Qwen. 2026. Qwen3.5-9B. Hugging Face model repos- itory. Checkpoint source. Accessed: 2026-06-06. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natu- r...

arXiv 2026

[3] [3]

InProceedings of the 33rd ACM Inter- national Conference on Information and Knowledge Management, pages 4303–4307

Generating cross-model analytics workloads using LLMs. InProceedings of the 33rd ACM Inter- national Conference on Information and Knowledge Management, pages 4303–4307. ACM. Zijie Zhong, Linqing Zhong, Zhaoze Sun, Qingyun Jin, Zengchang Qin, and Xiaofan Zhang. 2025. Syn- theT2C: Generating synthetic data for fine-tuning large language models on the Text2...

Pith/arXiv arXiv 2025

[4] [4]

and Auto- Cypher (Tiwari et al., 2025) Outcome-aware reverse grounding binds slots through live Cypher be- fore NL realization. Execution validation in Auto-Cypher (Tiwari et al., 2025) and Mind the Query (Chauhan et al., 2025) Execution is one gate in a ledger with direction, read-only, value, non-empty, and judge evidence. Schema/value checks in Mind th...

2025

[5] [5]

SHA-256:10b25b

Template generation.Stage: Workload pro- posal. SHA-256:10b25b. Contract.Schema-only labels, relationships, properties, and categorical values; Realistic enterprise analyst wording; At most two typed slots and JSON-only output

[6] [6]

SHA-256:48e949

Reverse binding.Stage: Graph grounding. SHA-256:48e949. Contract.Read-only MATCH/WHERE/RETURN DIS- TINCT/LIMIT only; Slot variables named exactly as requested; Forward relationship directions from the schema

[7] [7]

SHA-256:61c557

Cypher generation.Stage: Candidate query. SHA-256:61c557. Contract.Only schema-visible constructs and observed directions; RETURN DISTINCT for set returns and exact equality for quoted val- ues; Context columns, categorical hints, place- holderized retrieval, and no writes

[8] [8]

SHA- 256:56c419

Repair.Stage: Validation feedback. SHA- 256:56c419. Contract.Preserve question intent while fix- ing validation or execution issues; Keep query read-only and schema-grounded; Return only corrected Cypher

[9] [9]

SHA-256: 421c7b

LLM judge.Stage: Quality gate. SHA-256: 421c7b. Contract.Inputs include question, Cypher, relevant schema excerpt, execution rows, and validation summary; Strict JSON scores for ambiguity, semantic alignment, schema use, and difficulty; Categorical values constrain query literals, not observed result-row val- ues; Pass only useful, unambiguous enterprise ...

[10] [10]

SHA-256:4c07ff

Downstream Text2Cypher.Stage: Model evaluation. SHA-256:4c07ff. Contract.Read-only Cypher only; Schema- visible constructs and exact direction preser- vation; RETURN DISTINCT, count/ranking rules, and no explanations I.1 LLM Judge Prompt Used in Reported Runs The judge runs only after deterministic validation and live execution. For each reviewed example,...

[11] [11]

Relationships:TRANSFER_TO

FinBench / Boolean Existence / easy.Ques- tion:Does account ’187743809466009406’ have any outgoing transfer? MATCH (src:Account {accountId: ’187743809466009406’}) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT COUNT(DISTINCT src) > 0 AS HasOutgoingTransfer Structure:single_hop, aggregation. Relationships:TRANSFER_TO. Gates:RO/Syn/Schema/Exec/Judge. Result s...

[12] [12]

FinBench / Complex Aggregation / medium. Question:What is the total transferred amount from accounts owned by person ’Gwar’? MATCH (p:Person {personName: ’Gwar’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) RETURN DISTINCT SUM(t.amount) AS TotalTransferredAmount Structure:join_heavy, aggregation. Relationships:OWN_ACCOUNT, TRANS- FER_TO. ...

[13] [13]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Complex Retrieval / easy.Ques- tion:Which accounts received transfers from accounts owned by person ’Zof’? MATCH (p:Person {personName: ’Zof’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, bou...

[14] [14]

FinBench / Negation Difference / medium. Question:Which accounts owned by person ’Kant’ have not sent any transfers? MATCH (p:Person {personName: ’Kant’}) -[:OWN_ACCOUNT]-> (a:Account) WHERE NOT (a) -[:TRANSFER_TO]-> (:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:join_heavy, ...

[15] [15]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Path Temporal / medium.Ques- tion:Which accounts can receive money within two transfer hops from accounts owned by person ’Sossamon’? MATCH (p:Person {personName: ’Sossamon’}) -[:OWN_ACCOUNT]-> (src:Account) -[:TRANSFER_TO *1..2]-> (dst:Account) RETURN DISTINCT dst.accountId AS AccountId, dst.accountType AS AccountType, dst.isBlocked AS IsBlock...

[16] [16]

Relationships:OWN_ACCOUNT, TRANS- FER_TO

FinBench / Ranking Topk / medium.Ques- tion:For accounts owned by person ’Barry’, which account sent the highest total transfer amount? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (src:Account)-[t:TRANSFER_TO]-> (:Account) WITH src, SUM(t.amount) AS totalAmount RETURN DISTINCT src.accountId AS AccountId, src.accountType AS AccountType, src.is...

[17] [17]

Relationships:OWN_ACCOUNT

FinBench / Simple Aggregation / easy.Ques- tion:How many accounts are owned by per- son ’Kaewsuktae’? MATCH (p:Person {personName: ’Kaewsuktae’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT COUNT(DISTINCT a) AS AccountCount Structure:single_hop, aggregation. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{AccountCount: 1}; obse...

[18] [18]

Relationships:OWN_ACCOUNT

FinBench / Simple Retrieval / easy.Ques- tion:Which accounts are owned by person ’Barry’? MATCH (p:Person {personName: ’Barry’}) -[:OWN_ACCOUNT]-> (a:Account) RETURN DISTINCT a.accountId AS AccountId, a.accountType AS AccountType, a.isBlocked AS IsBlocked LIMIT 300 Structure:single_hop, bounded_result. Relationships:OWN_ACCOUNT. Gates:RO/Syn/Schema/Exec/J...

[19] [19]

Relationships:registered_address

ICIJ Offshore Leaks / Boolean Existence / medium.Question:Does offshore entity ’ENTITY_V ALUE_1’ have a registered ad- dress? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) OPTIONAL MATCH (e) -[:registered_address]-> (addr:Address) RETURN DISTINCT COUNT(addr) > 0 AS HasRegisteredAddress Structure:single_hop, aggregation, optional. Relationships:registered_addr...

[20] [20]

Relationships:officer_of

ICIJ Offshore Leaks / Complex Aggrega- tion / easy.Question:How many distinct officers are connected to entities in jurisdic- tion ’JURISDICTION_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) RETURN DISTINCT COUNT(DISTINCT o) AS OfficerCount Structure:single_hop, aggregation. Relationships:officer_of. Gates...

[21] [21]

Relationships:registered_address

ICIJ Offshore Leaks / Complex Retrieval / easy.Question:Which officers share a registered address with offshore entity ’EN- TITY_V ALUE_1’? MATCH (e:Entity {name: ’ENTITY_VALUE_1’}) -[:registered_address]-> (addr:Address) <-[:registered_address]- (o:Officer) RETURN DISTINCT o.node_id AS OfficerId, o.name AS OfficerName, addr.address AS RegisteredAddress L...

[22] [22]

Relationships:registered_address

ICIJ Offshore Leaks / Negation Difference / easy.Question:Which offshore entities in jurisdiction ’JURISDICTION_V ALUE_1’ do not have a registered address? MATCH (e:Entity {jurisdiction: ’JURISDICTION_VALUE_1’}) WHERE NOT (e) -[:registered_address]-> (:Address) RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction LIM...

[23] [23]

Relationships:officer_of

ICIJ Offshore Leaks / Path Temporal / medium.Question:Which officers share offshore entities with officer ’OFFI- CER_V ALUE_1’, and when did each connec- tion start? MATCH (src:Officer)-[srcRel:officer_of]-> (entity:Entity) <-[dstRel:officer_of]- (dst:Officer) WHERE trim(src.name) = ’OFFICER_VALUE_1’ AND dst <> src RETURN DISTINCT dst.node_id AS OfficerId...

[24] [24]

Relationships:none

ICIJ Offshore Leaks / Ranking Topk / medium.Question:Which jurisdictions have the most offshore entities? MATCH (e:Entity) WHERE e.jurisdiction IS NOT NULL WITH e.jurisdiction AS jurisdiction, COUNT(DISTINCT e) AS entityCount RETURN DISTINCT jurisdiction, entityCount ORDER BY entityCount DESC LIMIT 10 Structure:node_scan, aggregation, or- der_rank, negati...

[25] [25]

Relationships:officer_of

ICIJ Offshore Leaks / Simple Aggrega- tion / easy.Question:How many off- shore entities are connected to officer ’OF- FICER_V ALUE_1’? MATCH (o:Officer) -[:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT COUNT(DISTINCT e) AS OffshoreEntityCount Structure:single_hop, aggregation. Relationships:officer_of. Gates:RO/Syn/Schema...

[26] [26]

Relationships:officer_of

ICIJ Offshore Leaks / Simple Retrieval / easy.Question:Which offshore entities is officer ’OFFICER_V ALUE_1’ connected to? MATCH (o:Officer)-[r:officer_of]-> (e:Entity) WHERE trim(o.name) = ’OFFICER_VALUE_1’ RETURN DISTINCT e.node_id AS EntityId, e.name AS EntityName, e.jurisdiction AS Jurisdiction, r.link AS Link LIMIT 300 Structure:single_hop, bounded_r...

[27] [27]

Relationships:LIKES

SNB / Boolean Existence / medium.Ques- tion:Does person with id 6597069766828 like any post? MATCH (p:Person {id: 6597069766828}) OPTIONAL MATCH (p) -[:LIKES]-> (post:Post) RETURN DISTINCT COUNT(post) > 0 AS LikesAnyPost Structure:single_hop, aggregation, optional. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{LikesAnyPost: True}; ob...

[28] [28]

SNB / Complex Aggregation / medium. Question:How many distinct posts are in forums joined by person with id 6597069766845? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person {id: 6597069766845}) MATCH (forum) -[:CONTAINER_OF]-> (post:Post) RETURN DISTINCT COUNT(DISTINCT post) AS JoinedForumPostCount Structure:join_heavy, aggregation. Relationships:CONTAINER_O...

[29] [29]

Relationships:CONTAINER_OF, HAS_MEMBER, HAS_TAG

SNB / Complex Retrieval / easy.Question: Which people are members of forums contain- ing posts tagged ’Manuel_Noriega’? MATCH (forum:Forum) -[:HAS_MEMBER]-> (p:Person), (forum) -[:CONTAINER_OF]-> (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Manuel_Noriega’}) RETURN DISTINCT p.id AS PersonId LIMIT 200 Structure:join_heavy, bounded_result. Relationships:CONTA...

[30] [30]

Relationships:HAS_CREATOR

SNB / Negation Difference / easy.Question: Which person records are not linked from any message record through :HAS_CREATOR? MATCH (p:Person) WHERE NOT EXISTS((:Message) -[:HAS_CREATOR]-> (p)) RETURN DISTINCT p.id AS PersonId, p.firstName AS PersonFirstName, p.lastName AS PersonLastName Structure:single_hop, negation. Relationships:HAS_CREATOR. Gates:RO/S...

[31] [31]

Relationships:KNOWS

SNB / Path Temporal / easy.Question: Which people are within two knows hops of person with id 4398046511136? MATCH (src:Person {id: 4398046511136}) -[:KNOWS*1..2]-> (dst:Person) RETURN DISTINCT dst.id AS PersonId, dst.firstName AS FirstName, dst.lastName AS LastName LIMIT 200 Structure:single_hop, path, bounded_result. Relationships:KNOWS. Gates:RO/Syn/Sc...

[32] [32]

Relationships:IS_LOCATED_IN

SNB / Ranking Topk / medium.Ques- tion:Which city records are linked from the most organisation records through :IS_LOCATED_IN? MATCH (s:Organisation) -[:IS_LOCATED_IN]-> (e:City) WITH e, COUNT(DISTINCT s) AS relatedCount RETURN DISTINCT e.id AS TargetId, e.name AS TargetName, relatedCount ORDER BY relatedCount DESC LIMIT 10 Structure:single_hop, aggregat...

[33] [33]

Relationships:HAS_TAG

SNB / Simple Aggregation / easy.Question: How many posts are tagged ’Vietnam’? MATCH (post:Post) -[:HAS_TAG]-> (tag:Tag {name: ’Vietnam’}) RETURN DISTINCT COUNT(DISTINCT post) AS PostCount Structure:single_hop, aggregation. Relationships:HAS_TAG. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostCount: 1}; observed rows: 1

[34] [34]

Relationships:LIKES

SNB / Simple Retrieval / easy.Ques- tion:Which post IDs did person with id 4398046511124 like? MATCH (p:Person {id: 4398046511124}) -[:LIKES]-> (post:Post) RETURN DISTINCT post.id AS PostId LIMIT 200 Structure:single_hop, bounded_result. Relationships:LIKES. Gates:RO/Syn/Schema/Exec/Judge. Result sample:{PostId: 343597385744}; ob- served rows: 5