Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

arxiv: 2604.09470 · v1 · submitted 2026-04-10 · 💻 cs.CL

Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL

Vishnu Murali , Anmol Gulati , Elias Lumer , Kevin Frank , Sindy Campagna , Vamse Kumar Subbiah This is my paper

Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords text-to-JQLagentic LLMsJira Query Languagesemantic retrievalexecution-based benchmarknatural language interfacescategorical value groundinglive query evaluation

0 comments p. Extension

The pith

Equipping LLMs with live Jira execution and semantic retrieval tools raises their accuracy in converting natural language to JQL queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a benchmark of 100,000 natural-language to JQL pairs drawn from a live Jira instance containing over 200,000 issues. It shows that ordinary single-pass language models reach only 43.4 percent execution accuracy on short queries because they lack access to live data and cannot reliably resolve instance-specific values such as component names or fix versions. The authors then test an agentic system that lets models execute queries directly and uses a semantic retrieval tool to ground categorical mentions through embedding similarity. This combination improves accuracy for seven of the nine models examined and produces especially large gains on value resolution. The remaining errors trace mainly to inherent language ambiguities rather than missing data access.

Core claim

Single-pass LLMs average 43.4 percent execution accuracy on the benchmark. Augmenting them with live query execution and a semantic retrieval tool for categorical values improves seven of nine frontier models, with a 9 percent relative gain on the hardest variant. In an ablation, the semantic tool alone lifts categorical-value accuracy from 48.7 percent to 71.7 percent and component-field accuracy from 16.9 percent to 66.2 percent. Dominant remaining failures are semantic ambiguities such as issue-type disambiguation and text-field selection.

What carries the argument

Agentic Jackal, a tool-augmented agent that adds live query execution via the Jira MCP server and JiraAnchor, an embedding-based similarity search tool for resolving natural-language mentions of categorical values.

If this is right

Single-pass models remain limited to 43.4 percent execution accuracy on short natural-language queries.
The agentic approach improves accuracy for seven of the nine models tested.
JiraAnchor alone raises categorical-value accuracy from 48.7 percent to 71.7 percent.
Component-field accuracy rises from 16.9 percent to 66.2 percent when semantic retrieval is added.
Semantic ambiguities such as issue-type choice and text-field selection become the main remaining error sources once data access is solved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable agentic designs that pair live execution with semantic grounding could be tested on other structured query languages such as SQL or search-engine filters.
The released benchmark and transcripts make it possible to measure whether the observed gains hold when queries become longer or contain more nested Boolean conditions.
Future systems may need separate mechanisms for resolving field-selection and issue-type ambiguities beyond the value-grounding tools introduced here.

Load-bearing premise

The 100,000 pairs and the chosen live Jira instance represent the range of real user queries and the categorical value distributions found in typical organizations.

What would settle it

Re-running the full evaluation on a different Jira instance with its own categorical distributions or on a fresh set of longer and more paraphrased queries and finding that the agentic gains disappear would show the improvements do not generalize.

Figures

Figures reproduced from arXiv: 2604.09470 by Anmol Gulati, Elias Lumer, Kevin Frank, Sindy Campagna, Vamse Kumar Subbiah, Vishnu Murali.

**Figure 1.** Figure 1: Naive single-turn approach (left) compared to Agentic Jackal (right). The naive approach generates JQL in a single pass without tool access. The agent iteratively refines queries using JiraSearch and JiraAnchor (both available in Experiments 1 and 2), looping until it produces a verified output. The JiraAnchor tool enables the agent to resolve ambiguous user queries by retrieving relevant unique values in … view at source ↗

**Figure 2.** Figure 2: Overview of the two evaluation datasets. Left: Jackal-1K, a 1,000-query stratified subset of the Jackal benchmark with four NL variants per query, used in Experiment 1. Right: The Field-Value Evaluation Set, a targeted set of 2-clause queries covering three categorical fields, used in Experiment 2. 3. Datasets 3.1. Jackal-1K We draw our primary evaluation data from Jackal (Frank et al., 2025), the first op… view at source ↗

**Figure 3.** Figure 3: JiraAnchor pipeline. Given a field name and natural language value mention, the tool fetches candidate values from the live Jira instance, computes embedding-based similarity scores, and returns top-10 matches (K=10) ranked by cosine similarity. (K=10) with their exact names, ready for direct use in JQL. In our experiments we use the embedding strategy, which computes cosine similarity between the query an… view at source ↗

**Figure 4.** Figure 4: Average execution accuracy by query variant (9 models). Short NL shows the largest agentic gain (improvement of 3.9%), while Sem. Exact is near ceiling for both conditions (improvement of 2.3%). Metrics. All experiments use execution accuracy (EX) as the primary metric. A predicted JQL query is correct if, when executed against the live Jira instance, it returns the same set of issue keys as the gold query… view at source ↗

**Figure 5.** Figure 5: Average execution accuracy by categorical field (9 models). JiraAnchor yields the largest gain on component queries (improvement of 49.3%), where baseline accuracy without value grounding collapses to 16.9%. with gains concentrated on Short NL (improvement of 3.9%). The best individual result is Gemini 3 Flash, which improves from 62.8% to 71.0% overall, an improvement of 8.2%, driven by a large gain on S… view at source ↗

**Figure 6.** Figure 6: Error taxonomy for Agentic Jackal failures by query variant. Each bar shows the proportion of failures attributed to six error categories. On the hardest variants (Short NL and Sem. Similar), semantic interpretation errors dominate: Issue Type Interp. (e.g., “bugs” as issuetype=Bug vs. text search), Text Field Selection (summary vs. description), and Version Confusion (affectedVersion vs. fixVersion). Thes… view at source ↗

read the original abstract

Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jackal gives a usable new execution benchmark for text-to-JQL plus a grounded agent that lifts accuracy on most models via live runs and semantic lookup.

read the letter

This paper's main addition is the Jackal benchmark of 100k validated natural-language to JQL pairs drawn from a live Jira instance with over 200k issues, plus the Agentic Jackal system that adds live query execution through the Jira MCP server and the JiraAnchor embedding tool for resolving categorical values like components or versions. Single-pass LLMs average 43.4% execution accuracy on the short queries in the set, and the agentic version improves seven of the nine models tested, with a 9% relative gain on the hardest variant. The ablation that isolates JiraAnchor shows a clean lift in categorical-value accuracy from 48.7% to 71.7% and component-field accuracy from 16.9% to 66.2%. They release the pairs, transcripts, and evaluation code, which makes the claims checkable. The failure-mode discussion is also direct: semantic ambiguities around issue types and text fields remain the main blockers once value lookup is handled. The work stays empirical and avoids overclaiming. The limits are straightforward. Results are tied to one Jira instance, so the value distributions and ambiguity patterns are specific to that data. Queries are short, and it is unclear how performance would change with longer or more nested requests. The pair-generation and validation steps are not described in enough detail in the abstract to judge coverage or selection effects. Absolute accuracies are still low, and the gains, while consistent, are incremental rather than transformative. This is for engineers and researchers who build LLM tooling for enterprise query languages or project-management systems. It supplies a concrete testbed and a working agent pattern that others can run or extend. I would send it to peer review because the released materials let referees verify the numbers and the central empirical claims hold together on their own terms.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Jackal benchmark, the first large-scale execution-based text-to-JQL dataset with 100,000 validated NL-JQL pairs on a live Jira instance containing over 200,000 issues. It proposes Agentic Jackal, a tool-augmented agent that provides LLMs with live query execution via the Jira MCP server and JiraAnchor, an embedding-based semantic retrieval tool for resolving instance-specific categorical values. Baselines show single-pass frontier LLMs average 43.4% execution accuracy on short queries; the agentic setup improves 7 of 9 models (9.0% relative gain on the hardest variant), and an ablation isolating JiraAnchor lifts categorical-value accuracy from 48.7% to 71.7% and component-field accuracy from 16.9% to 66.2%. The work identifies semantic ambiguities (e.g., issue-type disambiguation) as remaining failure modes and releases the benchmark, transcripts, and code.

Significance. If the empirical results hold under the released artifacts, the paper demonstrates that single-pass LLMs are fundamentally limited on text-to-JQL by lack of instance-specific knowledge and inability to verify execution, while tool use with live execution and semantic grounding yields substantial, targeted gains. The public release of the 100k validated pairs, all agent transcripts, and evaluation code is a clear strength that enables reproducibility and community follow-up on domain-specific semantic parsing.

major comments (2)

[Abstract and benchmark construction section] The description of how the 100,000 NL-JQL pairs were generated, filtered, and validated against the live Jira instance (including any sampling strategy or bias mitigation for the >200k issues) is insufficiently detailed to assess representativeness of real-world queries and categorical distributions; this directly affects confidence in the 43.4% baseline and ablation gains.
[Experiments / ablation subsection] The controlled ablation isolating JiraAnchor reports clear lifts (categorical-value accuracy 48.7% → 71.7%, component-field 16.9% → 66.2%), but the exact protocol for constructing the ablation test set and ensuring no leakage from the full 100k pairs should be stated explicitly to confirm the gains are attributable to semantic retrieval rather than data artifacts.

minor comments (2)

[Abstract] Define 'most linguistically challenging variant' with reference to a specific table, metric, or query-length statistic rather than leaving it implicit.
[Throughout] Expand acronyms (JQL, MCP, NL) on first use and ensure consistent notation for accuracy metrics across text and tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment, recognition of the benchmark's significance, and recommendation for minor revision. The comments are constructive and will improve the clarity of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Abstract and benchmark construction section] The description of how the 100,000 NL-JQL pairs were generated, filtered, and validated against the live Jira instance (including any sampling strategy or bias mitigation for the >200k issues) is insufficiently detailed to assess representativeness of real-world queries and categorical distributions; this directly affects confidence in the 43.4% baseline and ablation gains.

Authors: We agree that the benchmark construction section would benefit from expanded detail on generation, filtering, validation, sampling, and bias mitigation to better demonstrate representativeness. The manuscript describes validation of the 100k pairs on the live instance with >200k issues, but we will revise this section to explicitly cover: (1) the query generation approach and how natural language paraphrases were produced, (2) filtering criteria for executability and syntactic validity, (3) the sampling strategy across issue types, fields, and categorical distributions to reflect real-world usage, and (4) any bias mitigation steps such as stratified sampling. These additions will strengthen confidence in the reported baselines. revision: yes
Referee: [Experiments / ablation subsection] The controlled ablation isolating JiraAnchor reports clear lifts (categorical-value accuracy 48.7% → 71.7%, component-field 16.9% → 66.2%), but the exact protocol for constructing the ablation test set and ensuring no leakage from the full 100k pairs should be stated explicitly to confirm the gains are attributable to semantic retrieval rather than data artifacts.

Authors: We thank the referee for this observation. The ablation isolates JiraAnchor by comparing performance with and without the semantic retrieval tool on queries involving categorical values. To prevent leakage, the ablation test set was drawn from a held-out subset of the benchmark, and the JiraAnchor embedding index was constructed exclusively from issue metadata excluding any test queries. We will revise the experiments/ablation subsection to state this protocol explicitly, including the test set size, the exact split procedure, and confirmation of no overlap with the retrieval index, thereby verifying that the reported accuracy lifts are attributable to the semantic grounding mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces a new execution-based benchmark (Jackal) with 100k NL-JQL pairs and evaluates single-pass vs. agentic LLM performance via live Jira execution and ablations on JiraAnchor. All reported metrics (43.4% baseline accuracy, 9.0% relative gain, categorical-value lift from 48.7% to 71.7%) are direct empirical measurements on the constructed dataset and live instance; no equations, fitted parameters, or predictions are claimed, and no derivation chain reduces to self-definition or self-citation. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

Central claims rest on a newly created benchmark and two new tools (JiraAnchor and the agent loop) whose effectiveness is demonstrated empirically rather than derived from prior axioms; no free parameters are fitted to produce the headline numbers.

axioms (2)

domain assumption Embedding similarity search can reliably surface instance-specific categorical values from natural language mentions
Invoked to justify JiraAnchor design and the observed accuracy jump.
domain assumption Live query execution provides an objective, instance-grounded correctness signal for generated JQL
Underpins the entire execution-accuracy metric and benchmark validation.

invented entities (3)

Jackal benchmark no independent evidence
purpose: Large-scale set of validated NL-JQL pairs for reproducible evaluation
Newly constructed dataset; no prior equivalent cited.
JiraAnchor no independent evidence
purpose: Embedding-based semantic retrieval tool for resolving categorical field values
Novel component introduced to address value-resolution failures.
Agentic Jackal no independent evidence
purpose: Tool-augmented LLM agent combining execution and semantic search
New system architecture proposed and evaluated.

pith-pipeline@v0.9.0 · 5640 in / 1806 out tokens · 62015 ms · 2026-05-10T16:51:51.928970+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

URL https://openreview.net/forum? id=2GmDdhBdDk. Pourreza, M. and Rafiei, D. DIN-SQL: Decomposed in- context learning of text-to-SQL with self-correction. In Advances in Neural Information Processing Systems, 2023. Talaei, S., Pourreza, M., Chang, Y .-C., Mirhoseini, A., and Saberi, A. CHESS: Contextual harnessing for efficient SQL synthesis.arXiv preprin...

work page internal anchor Pith review arXiv 2023
[2]

JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: I’m reviewing all open bugs in the QTBUG project so we can track unresolved issues before the next sprint

work page
[3]

Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language

JQL: created >= -5d AND project = PYSIDE NL: I want to look at issues reported in the last 5 days in the PYSIDE project to see what’s newly come in. Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.2. Short Natural Language (Short NL) Task: Convert a JQL query into a concise...

work page
[4]

P0: Blocker

JQL: project = QDS AND priority = "P0: Blocker" NL: QDS blockers

work page
[5]

Do not include any additional text or explanations

JQL: resolution = Duplicate NL: Duplicate issues Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.3. Semantically Similar Task: Convert the following JQL query into a natural language sentence that expresses the same intent, but uses different wording. Do not directly reuse ...

work page
[6]

JQL: status = "Open" NL: Tickets that are still in progress

work page
[7]

Do not include any additional text or explanations

JQL: resolution = Duplicate NL: Issues already reported before Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. 16 B.4. Semantically Exact Task: Translate a JQL query into a natural language sentence that mirrors the JQL structure and wording as closely as possible. Do not par...

work page
[8]

JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: Project is QTBUG, issue type is Bug, and status is Open

work page
[9]

P1: Critical

JQL: priority = "P1: Critical" NL: Priority is P1: Critical Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. C. Agentic System Prompt The following is an abridged version of the schema-grounded system prompt provided to the agent at every turn. The prompt encodes the complete ...

work page
[10]

Generate valid JQL (Jira Query Language) using only fields from the SCHEMA below

work page
[11]

ALWAYS use ‘jira_search‘ to execute the final query - never return JQL as plain text

work page
[12]

P4", "Critical

Don’t ask clarifying questions - make reasonable assumptions when details are ambiguous C.2. Field Disambiguation Rules (Excerpt) FIELD RULES (for ambiguous natural language): Priority: - Hierarchy: P0 (Blocker) > P1 (Critical) > P2 (Important) > P3 (Somewhat important) > P4 (Low) > P5 (Not important) - Exact value names ("P4", "Critical") -> match exactl...

work page
[13]

Call JiraAnchor to find the exact value name

work page
[14]

5.0", "3.2.1

Then call jira_search with that exact value in JQL CALL JiraAnchor when: - Version numbers: "5.0", "3.2.1" -> search fixVersion/affectedVersion - Technical terms that could be components: "Networking", "Payments" - Tag-like terms: "tech-debt", "flaky-test" -> search labels DO NOT use for existence checks ("issues without labels" -> labels IS EMPTY) C.5. E...

work page

[1] [1]

CHESS: Contextual Harnessing for Efficient SQL Synthesis

URL https://openreview.net/forum? id=2GmDdhBdDk. Pourreza, M. and Rafiei, D. DIN-SQL: Decomposed in- context learning of text-to-SQL with self-correction. In Advances in Neural Information Processing Systems, 2023. Talaei, S., Pourreza, M., Chang, Y .-C., Mirhoseini, A., and Saberi, A. CHESS: Contextual harnessing for efficient SQL synthesis.arXiv preprin...

work page internal anchor Pith review arXiv 2023

[2] [2]

JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: I’m reviewing all open bugs in the QTBUG project so we can track unresolved issues before the next sprint

work page

[3] [3]

Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language

JQL: created >= -5d AND project = PYSIDE NL: I want to look at issues reported in the last 5 days in the PYSIDE project to see what’s newly come in. Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.2. Short Natural Language (Short NL) Task: Convert a JQL query into a concise...

work page

[4] [4]

P0: Blocker

JQL: project = QDS AND priority = "P0: Blocker" NL: QDS blockers

work page

[5] [5]

Do not include any additional text or explanations

JQL: resolution = Duplicate NL: Duplicate issues Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.3. Semantically Similar Task: Convert the following JQL query into a natural language sentence that expresses the same intent, but uses different wording. Do not directly reuse ...

work page

[6] [6]

JQL: status = "Open" NL: Tickets that are still in progress

work page

[7] [7]

Do not include any additional text or explanations

JQL: resolution = Duplicate NL: Issues already reported before Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. 16 B.4. Semantically Exact Task: Translate a JQL query into a natural language sentence that mirrors the JQL structure and wording as closely as possible. Do not par...

work page

[8] [8]

JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: Project is QTBUG, issue type is Bug, and status is Open

work page

[9] [9]

P1: Critical

JQL: priority = "P1: Critical" NL: Priority is P1: Critical Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. C. Agentic System Prompt The following is an abridged version of the schema-grounded system prompt provided to the agent at every turn. The prompt encodes the complete ...

work page

[10] [10]

Generate valid JQL (Jira Query Language) using only fields from the SCHEMA below

work page

[11] [11]

ALWAYS use ‘jira_search‘ to execute the final query - never return JQL as plain text

work page

[12] [12]

P4", "Critical

Don’t ask clarifying questions - make reasonable assumptions when details are ambiguous C.2. Field Disambiguation Rules (Excerpt) FIELD RULES (for ambiguous natural language): Priority: - Hierarchy: P0 (Blocker) > P1 (Critical) > P2 (Important) > P3 (Somewhat important) > P4 (Low) > P5 (Not important) - Exact value names ("P4", "Critical") -> match exactl...

work page

[13] [13]

Call JiraAnchor to find the exact value name

work page

[14] [14]

5.0", "3.2.1

Then call jira_search with that exact value in JQL CALL JiraAnchor when: - Version numbers: "5.0", "3.2.1" -> search fixVersion/affectedVersion - Technical terms that could be components: "Networking", "Payments" - Tag-like terms: "tech-debt", "flaky-test" -> search labels DO NOT use for existence checks ("issues without labels" -> labels IS EMPTY) C.5. E...

work page