Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

Xinyi Huang

arxiv: 2604.10516 · v3 · submitted 2026-04-12 · 💻 cs.CL

Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

Xinyi Huang This is my paper

Pith reviewed 2026-05-10 15:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords knowledge retrievalcode dependenciesmulti-step reasoninglarge language modelsdata analysisgraph-based retrievalretrieval-augmented generation

0 comments

The pith

Code dependency graphs retrieve better knowledge than similarity for multi-step LLM data tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models need precise domain knowledge to handle multi-step data analysis, yet typical retrieval uses only text or embedding similarity. This paper claims that executable code contains the real structure of how computations connect, so knowledge should be pulled along function-call dependency paths instead. SGKR extracts semantic input and output tags from the query, traces the relevant paths in a pre-built code graph, and assembles the linked knowledge plus function code into a focused context. When this context is given to an LLM or coding agent, the generated solutions are more often correct. The result is presented as a direct improvement over both no-retrieval and similarity baselines on standard multi-step benchmarks.

Core claim

SGKR induces a graph from function-call dependencies in domain code. Given a question, it extracts semantic input and output tags, locates the dependency paths that connect those tags, extracts the corresponding subgraph, and supplies the knowledge descriptions together with the actual function implementations as structured context for the LLM to generate code solutions.

What carries the argument

A graph built from function-call dependencies, together with the extraction of semantic input/output tags and the selection of connecting dependency paths to form a task-specific subgraph.

Load-bearing premise

Extracting semantic input and output tags and then tracing dependency paths in the code graph will surface exactly the knowledge needed for the multi-step task without missing critical pieces or adding noise.

What would settle it

A multi-step data benchmark in which the extracted dependency paths either omit required functions or include many irrelevant ones, producing solution accuracy no higher than or lower than the similarity-based baseline.

Figures

Figures reproduced from arXiv: 2604.10516 by Xinyi Huang.

**Figure 2.** Figure 2: Framework of SGKR. Framework of the structure-grounded code retrieval system. In the offline phase, code is parsed via AST to build a semantic graph and a function-call dependency graph, while external knowledge is inserted into relevant functions as comments. Extracted sementic input–output (I/O) tags are added as nodes to augment the graph. At inference time, I/O tags from a query guide the retrieval of … view at source ↗

read the original abstract

Selecting the right knowledge is critical when using large language models (LLMs) to solve domain-specific data analysis tasks. However, most retrieval-augmented approaches rely primarily on lexical or embedding similarity, which is often a weak proxy for the task-critical knowledge needed for multi-step reasoning. In many such tasks, the relevant knowledge is not merely textually related to the query, but is instead grounded in executable code and the dependency structure through which computations are carried out. To address this mismatch, we propose SGKR (Structure-Grounded Knowledge Retrieval), a retrieval framework that organizes domain knowledge with a graph induced by function-call dependencies. Given a question, SGKR extracts semantic input and output tags, identifies dependency paths connecting them, and constructs a task-relevant subgraph. The associated knowledge and corresponding function implementations are then assembled as a structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks show that SGKR consistently improves solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes SGKR, a retrieval framework that organizes domain knowledge using a graph based on function-call dependencies in code. For a given question, it extracts semantic input and output tags, identifies connecting dependency paths to build a task-relevant subgraph, and assembles the knowledge and function implementations into structured context for LLM-based code generation. Experiments on multi-step data analysis benchmarks demonstrate consistent improvements in solution correctness over no-retrieval and similarity-based retrieval baselines for both vanilla LLMs and coding agents.

Significance. If the results hold, this work could have notable significance in retrieval-augmented generation for complex reasoning tasks. By grounding retrieval in code dependency structures rather than pure similarity, it offers a promising direction for improving LLM performance on domain-specific multi-step data analysis. The reported consistent gains over baselines highlight potential practical benefits, and the method's independence from fitted parameters is a strength.

major comments (2)

[Abstract (method description)] The effectiveness of SGKR relies on the reliability of semantic input/output tag extraction and dependency path identification to surface task-critical knowledge. However, no precision or recall figures, error analysis, or ablation studies isolating this component are mentioned, which is critical to substantiate that the improvements stem from the structure-grounded approach rather than incidental factors.
[Abstract (experiments claim)] The abstract reports consistent benchmark improvements but lacks details on exact experimental controls, statistical tests, or potential confounds in tag extraction and subgraph construction. This leaves the support for the central claim at a moderate level and requires clarification to confirm robustness.

minor comments (1)

[Abstract] Consider specifying the particular multi-step data analysis benchmarks used and the quantitative magnitude of the improvements to provide readers with better context on the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive review and the recommendation for major revision. We value the emphasis on validating the core components of SGKR and ensuring experimental robustness. We address each major comment below and commit to revisions that strengthen the manuscript without altering its central claims.

read point-by-point responses

Referee: [Abstract (method description)] The effectiveness of SGKR relies on the reliability of semantic input/output tag extraction and dependency path identification to surface task-critical knowledge. However, no precision or recall figures, error analysis, or ablation studies isolating this component are mentioned, which is critical to substantiate that the improvements stem from the structure-grounded approach rather than incidental factors.

Authors: We agree that the reliability of semantic tag extraction and dependency path identification is foundational, and the absence of isolated metrics leaves room for alternative explanations of the gains. While the manuscript demonstrates end-to-end improvements, we will revise by adding a dedicated component analysis subsection. This will include precision/recall figures for tag extraction on a held-out annotated query set, a categorized error analysis of path identification failures (e.g., missing dependencies due to ambiguous tags), and an ablation replacing the dependency subgraph with a top-k similarity selection of equivalent size. These changes will directly isolate the contribution of the structure-grounded mechanism. revision: yes
Referee: [Abstract (experiments claim)] The abstract reports consistent benchmark improvements but lacks details on exact experimental controls, statistical tests, or potential confounds in tag extraction and subgraph construction. This leaves the support for the central claim at a moderate level and requires clarification to confirm robustness.

Authors: We acknowledge that the abstract and current experimental description provide limited transparency on controls and statistical rigor, which weakens the evidential support. In the revision, we will expand the experiments section with: explicit controls (identical LLM, fixed prompts with content-only variations, and consistent tag extraction across all baselines); standard deviations and results over multiple runs with varied seeds; statistical significance via McNemar's test on paired correctness outcomes; and discussion of confounds such as context length differences, with mitigations like length-normalized prompts. These additions will confirm that gains are attributable to SGKR rather than setup artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines SGKR as an independent retrieval framework that builds a code-dependency graph, extracts semantic I/O tags, identifies connecting paths, and assembles a subgraph for LLM context. This construction is described procedurally without reference to fitted parameters, self-citations, or prior results from the same authors. Performance claims rest on external benchmark comparisons (no-retrieval and similarity baselines) rather than any reduction of the method or its outputs to the inputs by construction. No equations, uniqueness theorems, or ansatzes are invoked that collapse the central claim into a tautology or renamed fit. The derivation is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the premise that code dependency structures encode reasoning paths; no explicit free parameters or invented entities are described, but tag extraction and path identification are treated as reliable operations.

axioms (2)

domain assumption Semantic input and output tags can be accurately extracted from questions to seed relevant dependency paths.
Invoked in the description of how SGKR identifies paths connecting tags to build the subgraph.
domain assumption Function-call dependencies in domain codebases correspond to the computational steps needed for multi-step data reasoning.
Central to organizing knowledge via the induced graph.

pith-pipeline@v0.9.0 · 5469 in / 1234 out tokens · 51512 ms · 2026-05-10T15:32:58.320432+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025

Dabstep: Data agent benchmark for multi-step reasoning.Preprint, arXiv:2506.23719. Konstantin Fedorov, Boris Zarubin, and Vladimir Ivanov. 2025. Gracg: Graph retrieval augmented code generation. In2025 40th IEEE/ACM Interna- tional Conference on Automated Software Engineer- ing Workshops (ASEW), pages 291–298. 10 Zhangyin Feng, Daya Guo, Duyu Tang, Nan Du...

work page arXiv 2025
[2]

Preprint, arXiv:2601.03014

Sentgraph: Hierarchical sentence graph for multi-hop retrieval-augmented question answering. Preprint, arXiv:2601.03014. Costas Mavromatis and George Karypis. 2024. Gnn- rag: Graph neural retrieval for large language model reasoning.Preprint, arXiv:2405.20139. Alberto Sánchez Pérez, Alaa Boukhary, Paolo Papotti, Luis Castejón Lozano, and Adam Elwood. 2025...

work page internal anchor Pith review arXiv 2024
[3]

InText Retrieval Conference

Okapi at trec-3. InText Retrieval Conference. Ahmmad OM Saleh, Gokhan Tur, and Yucel Saygin

work page
[4]

Llm/agent-as-data-analyst: A survey,

Sg-rag: Multi-hop question answering with large language models through knowledge graphs. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), pages 439–448. Mahdi Amiri Shavaki, Pouria Omrani, Ramin Toosi, and Mohammad Ali Akhaee. 2024. Knowledge graph based retrieval-augmented generation for multi-h...

work page arXiv 2024
[5]

Explore:Perform data exploration in the directory {ctx_path} and under- stand what data is available and its lim- itations

work page
[6]

Plan:Draft a high-level plan based on the results of the Explore step

work page
[7]

If the plan fails, restart from the Explore step

Execute:Execute the drafted plan. If the plan fails, restart from the Explore step

work page
[8]

Step Workflow:Thought → Code → Ob- servation

Conclude:Based on the executed plan, summarize the findings into an answer for the task. Step Workflow:Thought → Code → Ob- servation

work page
[9]

Thought:Explain your reasoning and the code you will use

work page
[10]

Code:Write Python code in the fol- lowing format: Code: ```py your_python_code <end_code> Use print() to retain important out- puts

work page
[11]

Not Applicable

Observation:Review the printed out- puts before continuing. Rules • Always check the directory ctx_path for relevant documentation or data be- fore assuming information is unavail- able. • Validate assumptions using the avail- able documentation before executing code. • If all possible solution plans fail, output “Not Applicable” as the final answer. • Us...

work page

[1] [1]

Dabstep: Data agent benchmark for multi-step reasoning.arXiv preprint arXiv:2506.23719, 2025

Dabstep: Data agent benchmark for multi-step reasoning.Preprint, arXiv:2506.23719. Konstantin Fedorov, Boris Zarubin, and Vladimir Ivanov. 2025. Gracg: Graph retrieval augmented code generation. In2025 40th IEEE/ACM Interna- tional Conference on Automated Software Engineer- ing Workshops (ASEW), pages 291–298. 10 Zhangyin Feng, Daya Guo, Duyu Tang, Nan Du...

work page arXiv 2025

[2] [2]

Preprint, arXiv:2601.03014

Sentgraph: Hierarchical sentence graph for multi-hop retrieval-augmented question answering. Preprint, arXiv:2601.03014. Costas Mavromatis and George Karypis. 2024. Gnn- rag: Graph neural retrieval for large language model reasoning.Preprint, arXiv:2405.20139. Alberto Sánchez Pérez, Alaa Boukhary, Paolo Papotti, Luis Castejón Lozano, and Adam Elwood. 2025...

work page internal anchor Pith review arXiv 2024

[3] [3]

InText Retrieval Conference

Okapi at trec-3. InText Retrieval Conference. Ahmmad OM Saleh, Gokhan Tur, and Yucel Saygin

work page

[4] [4]

Llm/agent-as-data-analyst: A survey,

Sg-rag: Multi-hop question answering with large language models through knowledge graphs. In Proceedings of the 7th International Conference on Natural Language and Speech Processing (ICNLSP 2024), pages 439–448. Mahdi Amiri Shavaki, Pouria Omrani, Ramin Toosi, and Mohammad Ali Akhaee. 2024. Knowledge graph based retrieval-augmented generation for multi-h...

work page arXiv 2024

[5] [5]

Explore:Perform data exploration in the directory {ctx_path} and under- stand what data is available and its lim- itations

work page

[6] [6]

Plan:Draft a high-level plan based on the results of the Explore step

work page

[7] [7]

If the plan fails, restart from the Explore step

Execute:Execute the drafted plan. If the plan fails, restart from the Explore step

work page

[8] [8]

Step Workflow:Thought → Code → Ob- servation

Conclude:Based on the executed plan, summarize the findings into an answer for the task. Step Workflow:Thought → Code → Ob- servation

work page

[9] [9]

Thought:Explain your reasoning and the code you will use

work page

[10] [10]

Code:Write Python code in the fol- lowing format: Code: ```py your_python_code <end_code> Use print() to retain important out- puts

work page

[11] [11]

Not Applicable

Observation:Review the printed out- puts before continuing. Rules • Always check the directory ctx_path for relevant documentation or data be- fore assuming information is unavail- able. • Validate assumptions using the avail- able documentation before executing code. • If all possible solution plans fail, output “Not Applicable” as the final answer. • Us...

work page