RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

Abhas Ricky; Minseok Jung; Muhammad Rameez Chatni

arxiv: 2605.22937 · v1 · pith:SPNZYYBZnew · submitted 2026-05-21 · 💻 cs.CL

RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

Minseok Jung , Abhas Ricky , Muhammad Rameez Chatni This is my paper

Pith reviewed 2026-05-25 05:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords Text2Cypherin-context learningquery generationinference-time scalingCypher queriesNeo4jreflectionerror feedback

0 comments

The pith

Feeding database error messages back into prompts cuts Cypher query execution errors by 41-50% at five samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares two ways to spend extra inference compute when generating Cypher queries from text. Independent Scaling draws new samples without memory of prior failures. Reflection-Augmented Scaling inserts the database's own error messages into the next prompt through in-context learning. On three Neo4j datasets and five code models, the reflection method lowers the fraction of non-executable queries from the 32-38% range achieved by independent resampling down to 41-50% relative reduction at five attempts. The work treats execution errors as usable signals rather than waste.

Core claim

Across three Neo4j datasets and five code-specialized language models, Reflection-Augmented Scaling that conditions each new attempt on prior execution feedback via in-context learning reduces the Query Execution Error Rate by 41--50% at n=5, outperforming Independent Scaling at 32--38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.

What carries the argument

Reflection-Augmented Scaling (RAS), which re-uses database-generated syntax error messages as in-context examples for the next generation attempt.

If this is right

RAS produces more executable queries than memoryless resampling at identical sample budgets.
Execution feedback improves results across multiple code-specialized models without extra training.
The gain holds on three different Neo4j graph datasets.
Error messages can be incorporated through standard in-context learning rather than custom engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same feedback loop could be applied to SQL or other executable languages that return structured errors.
Combining RAS with model fine-tuning on query pairs might produce further gains.
In production graph applications, the method could lower the rate of rejected queries without changing the underlying model.
The approach might extend to other structured generation tasks where an external verifier returns error text.

Load-bearing premise

Database error messages are consistent enough and model-interpretable enough to serve as reliable feedback without introducing new failure modes.

What would settle it

Run the same models and datasets but replace each error message with a random string of equal length and measure whether RAS still beats independent resampling.

Figures

Figures reproduced from arXiv: 2605.22937 by Abhas Ricky, Minseok Jung, Muhammad Rameez Chatni.

**Figure 1.** Figure 1: Our method iteratively generates queries and executes them against the database. When execution fails, the system incorporates execution feedback through reflection-based in-context learning (ICL) to refine subsequent generations. Increasing the inference scale expands the reflection context and improves the probability of producing executable queries models improve outputs using signals from prior attem… view at source ↗

**Figure 2.** Figure 2: Reflection-augmented inference-time scaling for Text2Cypher. Syntax and schema errors surface to users as execution failures that break interactive query workflows, making them a distinct and user-visible failure mode. Left: under single-pass generation, a Cypher query is produced and executed against the graph database; execution failures trigger executionaware reflection via in-context learning. Right: … view at source ↗

**Figure 3.** Figure 3: Query Error Rate (QER) across three graph datasets under inference-time scaling. Q@1 denotes baseline [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Inference-time scaling can reduce errors in structured query generation, but methods to allocate the compute for query code generation remains underexplored. We study Text2Cypher, where language models generate Cypher queries that execute against property graph databases. Non-executable queries constitute a distinct syntactic failure separate from semantic inaccuracy: a syntax error triggers a system-generated error message from the database. These error messages are typically discarded at inference time rather than leveraged through in-context learning (ICL). We compare two inference methods: Independent Scaling (IS), which performs memoryless resampling, and Reflection-Augmented Scaling (RAS), which conditions each new attempt on prior execution feedback via ICL. Across three Neo4j datasets and five code-specialized language models, RAS reduces the Query Execution Error Rate by 41--50% at n{=}5, outperforming IS at 32--38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAS beats independent resampling on Cypher error rates by feeding database messages back into ICL, but the experiments leave open whether the gains come from the feedback itself or from unablated prompt changes.

read the letter

The core result is that conditioning each new Cypher sample on the prior execution error via in-context learning reduces the fraction of non-executable queries more than drawing independent samples. Across three Neo4j datasets and five code models, the reflection method reaches 41-50% error reduction at n=5 while the memoryless baseline reaches only 32-38%. The paper treats syntax errors as a distinct, fixable failure mode rather than generic noise, and it shows that the database's own messages can be turned into usable context without extra training. That comparison is the actual contribution and it is easy to understand. The experiments are run on real datasets and multiple models, which gives the numbers some weight. The soft spot is exactly the one the stress-test note flags: nothing in the abstract or the reported setup checks whether the error messages are consistently informative, whether models actually attend to them, or whether simply lengthening the prompt would produce similar gains. There are also no variance numbers, no temperature controls described, and no ablation that removes the reflection step while keeping prompt length fixed. If those checks are missing from the full paper, the claimed advantage rests on a single untested assumption about feedback quality. This work is aimed at people already running inference-time scaling on code or query generation where an executor returns structured errors. It is narrow but the method is cheap to try, so it deserves a serious referee who can ask for the missing ablations and statistical detail rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces Reflection-Augmented Scaling (RAS), an inference-time scaling method for Text2Cypher that appends Neo4j-generated execution error messages to in-context learning prompts for subsequent query generation attempts. It contrasts this with Independent Scaling (IS), which performs memoryless resampling, and reports that RAS reduces Query Execution Error Rate by 41--50% at n=5 versus 32--38% for IS, across three Neo4j datasets and five code-specialized language models. The central thesis is that database error messages constitute actionable feedback that can be directly leveraged via ICL to improve executability more efficiently than scaling independent samples.

Significance. If the empirical results are robust, the work provides evidence that inference-time compute allocation can be made more effective by conditioning on execution feedback rather than resampling, with potential implications for other structured generation tasks where compilers or databases return informative error signals. The approach is conceptually straightforward and does not introduce new parameters or training.

major comments (3)

[Results] Results section (and abstract): the reported reductions of 41--50% (RAS) versus 32--38% (IS) at n=5 are presented as point estimates without standard deviations across runs, statistical significance tests, or details on variance due to sampling temperature or random seeds, so the magnitude and reliability of the claimed advantage cannot be assessed.
[Experimental setup] Experimental setup and evaluation: no ablation isolates whether gains derive from the semantic content of the error messages versus incidental factors such as prompt length, ordering of examples, or total tokens; this directly bears on the central assumption that Neo4j error messages supply consistent, non-noisy, model-usable feedback across the five models and three datasets.
[Methods] Methods: the description of how error messages are incorporated into ICL prompts lacks controls or analysis for cases where messages are cryptic, model-specific in utility, or ignored, leaving open the possibility that observed differences arise from unablated prompt-engineering variables rather than reflection.

minor comments (2)

Clarify the precise definition of n=5 (number of attempts, samples, or beam size) and whether the same prompt template and temperature are used for both RAS and IS.
The abstract states quantitative improvements but the main text should include a table or figure with per-dataset, per-model breakdowns to allow readers to verify consistency of the 41--50% range.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful feedback highlighting needs for statistical rigor, targeted ablations, and methodological controls. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Results] Results section (and abstract): the reported reductions of 41--50% (RAS) versus 32--38% (IS) at n=5 are presented as point estimates without standard deviations across runs, statistical significance tests, or details on variance due to sampling temperature or random seeds, so the magnitude and reliability of the claimed advantage cannot be assessed.

Authors: We agree that point estimates alone limit evaluation of reliability. In the revision we will report standard deviations over at least three independent runs with different random seeds, include temperature settings, and add statistical significance tests (paired t-test or McNemar’s test) for the error-rate comparisons between RAS and IS. revision: yes
Referee: [Experimental setup] Experimental setup and evaluation: no ablation isolates whether gains derive from the semantic content of the error messages versus incidental factors such as prompt length, ordering of examples, or total tokens; this directly bears on the central assumption that Neo4j error messages supply consistent, non-noisy, model-usable feedback across the five models and three datasets.

Authors: This is a fair criticism; the original experiments lack such an ablation. We will add one in the revision by comparing RAS against a control that appends length-matched placeholder strings instead of real error messages, thereby isolating the contribution of semantic feedback from prompt-length or ordering effects. revision: yes
Referee: [Methods] Methods: the description of how error messages are incorporated into ICL prompts lacks controls or analysis for cases where messages are cryptic, model-specific in utility, or ignored, leaving open the possibility that observed differences arise from unablated prompt-engineering variables rather than reflection.

Authors: We will expand the Methods section with explicit prompt templates, preprocessing steps for error messages, and a short analysis of message types across the five models. We will also report any observed cases where messages appear to be ignored. These additions will clarify that the reported gains are not solely attributable to unexamined prompt variables. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of distinct inference procedures

full rationale

The paper reports an empirical evaluation of two inference-time methods (Independent Scaling vs Reflection-Augmented Scaling) on three external Neo4j datasets using five code models, with Query Execution Error Rate measured directly from execution outcomes. No equations, fitted parameters, predictions derived from inputs by construction, or load-bearing self-citations appear in the provided text. The central claim rests on externally measured performance differences between procedurally distinct sampling strategies rather than any self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the central claim rests on the empirical comparison alone.

pith-pipeline@v0.9.0 · 5719 in / 1063 out tokens · 33642 ms · 2026-05-25T05:58:40.257742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 6 internal anchors

[1]

InProceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1533–1544

Semantic parsing on Freebase from question-answer pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1533–1544. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirho- seini

work page 2013
[2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling infer- ence compute with repeated sampling.arXiv preprint arXiv:2407.21787. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[3]

8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

GraphRAFT: Retrieval augmented fine-tuning for knowledge graphs on graph databases.arXiv preprint arXiv:2504.05478. 8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

work page arXiv
[4]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

work page arXiv
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Deepseek-coder: When the large lan- guage model meets programming – the rise of code intelligence.arXiv preprint arXiv:2401.14196. Gaétan J. D. R. Hains, Youry Khmelevsky, and Thibaut Tachon

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186. M. Jung, A. Ricky, and M. R. Chatni

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen

3d opti- mization for ai inference scaling: Balancing accuracy, cost, and latency.arXiv preprint arXiv:2510.18905. Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen

work page arXiv
[8]

Shiqi Liang, Kurt Stockinger, Tarcisio Mendes de Farias, Maria Anisimova, and Manuel Gil

A sur- vey on complex knowledge base question answering: Methods, challenges and solutions.arXiv preprint arXiv:2105.11644. Shiqi Liang, Kurt Stockinger, Tarcisio Mendes de Farias, Maria Anisimova, and Manuel Gil

work page arXiv
[9]

Code Llama: Open Foundation Models for Code

Code Llama: Open foundation mod- els for code.arXiv preprint arXiv:2308.12950. Torsten Scholak, Nathan Schucher, and Dzmitry Bah- danau

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901

PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao

work page 2021
[11]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar

SM3-Text- to-Query: Supervising, scaling and synthesizing data for few-shot text-to-query.arXiv preprint arXiv:2411.05521. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar

work page arXiv
[12]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724. Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh

work page internal anchor Pith review Pith/arXiv arXiv
[14]

InProceedings of the 2018 9 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-sql task. InProceedings of the 2018 9 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921. A Appendix A.1 Experimental Results Table 4 provides a granular breakdown of the Query Execution Error Rate (QER) across all evaluat...

work page 2018

[1] [1]

InProceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1533–1544

Semantic parsing on Freebase from question-answer pairs. InProceedings of the 2013 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1533–1544. Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V . Le, Christopher Ré, and Azalia Mirho- seini

work page 2013

[2] [2]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large language monkeys: Scaling infer- ence compute with repeated sampling.arXiv preprint arXiv:2407.21787. Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

GraphRAFT: Retrieval augmented fine-tuning for knowledge graphs on graph databases.arXiv preprint arXiv:2504.05478. 8 Xuemei Dong, Chao Zhang, Yuhang Ge, Yuren Mao, Yunjun Gao, Jinshu Lin, Dongfang Lou, et al

work page arXiv

[4] [4]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

C3: Zero-shot text-to-sql with chatgpt.arXiv preprint arXiv:2307.07306. Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou

work page arXiv

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

Deepseek-coder: When the large lan- guage model meets programming – the rise of code intelligence.arXiv preprint arXiv:2401.14196. Gaétan J. D. R. Hains, Youry Khmelevsky, and Thibaut Tachon

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186. M. Jung, A. Ricky, and M. R. Chatni

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen

3d opti- mization for ai inference scaling: Balancing accuracy, cost, and latency.arXiv preprint arXiv:2510.18905. Yunshi Lan, Gaole He, Jinhao Jiang, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen

work page arXiv

[8] [8]

Shiqi Liang, Kurt Stockinger, Tarcisio Mendes de Farias, Maria Anisimova, and Manuel Gil

A sur- vey on complex knowledge base question answering: Methods, challenges and solutions.arXiv preprint arXiv:2105.11644. Shiqi Liang, Kurt Stockinger, Tarcisio Mendes de Farias, Maria Anisimova, and Manuel Gil

work page arXiv

[9] [9]

Code Llama: Open Foundation Models for Code

Code Llama: Open foundation mod- els for code.arXiv preprint arXiv:2308.12950. Torsten Scholak, Nathan Schucher, and Dzmitry Bah- danau

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901

PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R. Narasimhan, and Shunyu Yao

work page 2021

[11] [11]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar

SM3-Text- to-Query: Supervising, scaling and synthesizing data for few-shot text-to-query.arXiv preprint arXiv:2411.05521. Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Ku- mar

work page arXiv

[12] [12]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models.arXiv preprint arXiv:2408.00724. Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

InProceedings of the 2018 9 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-sql task. InProceedings of the 2018 9 Conference on Empirical Methods in Natural Lan- guage Processing, pages 3911–3921. A Appendix A.1 Experimental Results Table 4 provides a granular breakdown of the Query Execution Error Rate (QER) across all evaluat...

work page 2018