Agentic Jackal: Live Execution and Semantic Value Grounding for Text-to-JQL
Pith reviewed 2026-05-10 16:51 UTC · model grok-4.3
The pith
Equipping LLMs with live Jira execution and semantic retrieval tools raises their accuracy in converting natural language to JQL queries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Single-pass LLMs average 43.4 percent execution accuracy on the benchmark. Augmenting them with live query execution and a semantic retrieval tool for categorical values improves seven of nine frontier models, with a 9 percent relative gain on the hardest variant. In an ablation, the semantic tool alone lifts categorical-value accuracy from 48.7 percent to 71.7 percent and component-field accuracy from 16.9 percent to 66.2 percent. Dominant remaining failures are semantic ambiguities such as issue-type disambiguation and text-field selection.
What carries the argument
Agentic Jackal, a tool-augmented agent that adds live query execution via the Jira MCP server and JiraAnchor, an embedding-based similarity search tool for resolving natural-language mentions of categorical values.
If this is right
- Single-pass models remain limited to 43.4 percent execution accuracy on short natural-language queries.
- The agentic approach improves accuracy for seven of the nine models tested.
- JiraAnchor alone raises categorical-value accuracy from 48.7 percent to 71.7 percent.
- Component-field accuracy rises from 16.9 percent to 66.2 percent when semantic retrieval is added.
- Semantic ambiguities such as issue-type choice and text-field selection become the main remaining error sources once data access is solved.
Where Pith is reading between the lines
- Comparable agentic designs that pair live execution with semantic grounding could be tested on other structured query languages such as SQL or search-engine filters.
- The released benchmark and transcripts make it possible to measure whether the observed gains hold when queries become longer or contain more nested Boolean conditions.
- Future systems may need separate mechanisms for resolving field-selection and issue-type ambiguities beyond the value-grounding tools introduced here.
Load-bearing premise
The 100,000 pairs and the chosen live Jira instance represent the range of real user queries and the categorical value distributions found in typical organizations.
What would settle it
Re-running the full evaluation on a different Jira instance with its own categorical distributions or on a fresh set of longer and more paraphrased queries and finding that the agentic gains disappear would show the improvements do not generalize.
Figures
read the original abstract
Translating natural language into Jira Query Language (JQL) requires resolving ambiguous field references, instance-specific categorical values, and complex Boolean predicates. Single-pass LLMs cannot discover which categorical values (e.g., component names or fix versions) actually exist in a given Jira instance, nor can they verify generated queries against a live data source, limiting accuracy on paraphrased or ambiguous requests. No open, execution-based benchmark exists for mapping natural language to JQL. We introduce Jackal, the first large-scale, execution-based text-to-JQL benchmark comprising 100,000 validated NL-JQL pairs on a live Jira instance with over 200,000 issues. To establish baselines on Jackal, we propose Agentic Jackal, a tool-augmented agent that equips LLMs with live query execution via the Jira MCP server and JiraAnchor, a semantic retrieval tool that resolves natural-language mentions of categorical values through embedding-based similarity search. Among 9 frontier LLMs evaluated, single-pass models average only 43.4% execution accuracy on short natural-language queries, highlighting that text-to-JQL remains an open challenge. The agentic approach improves 7 of 9 models, with a 9.0% relative gain on the most linguistically challenging variant; in a controlled ablation isolating JiraAnchor, categorical-value accuracy rises from 48.7% to 71.7%, with component-field accuracy jumping from 16.9% to 66.2%. Our analysis identifies inherent semantic ambiguities, such as issue-type disambiguation and text-field selection, as the dominant failure modes rather than value-resolution errors, pointing to concrete directions for future work. We publicly release the benchmark, all agent transcripts, and evaluation code to support reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Jackal benchmark, the first large-scale execution-based text-to-JQL dataset with 100,000 validated NL-JQL pairs on a live Jira instance containing over 200,000 issues. It proposes Agentic Jackal, a tool-augmented agent that provides LLMs with live query execution via the Jira MCP server and JiraAnchor, an embedding-based semantic retrieval tool for resolving instance-specific categorical values. Baselines show single-pass frontier LLMs average 43.4% execution accuracy on short queries; the agentic setup improves 7 of 9 models (9.0% relative gain on the hardest variant), and an ablation isolating JiraAnchor lifts categorical-value accuracy from 48.7% to 71.7% and component-field accuracy from 16.9% to 66.2%. The work identifies semantic ambiguities (e.g., issue-type disambiguation) as remaining failure modes and releases the benchmark, transcripts, and code.
Significance. If the empirical results hold under the released artifacts, the paper demonstrates that single-pass LLMs are fundamentally limited on text-to-JQL by lack of instance-specific knowledge and inability to verify execution, while tool use with live execution and semantic grounding yields substantial, targeted gains. The public release of the 100k validated pairs, all agent transcripts, and evaluation code is a clear strength that enables reproducibility and community follow-up on domain-specific semantic parsing.
major comments (2)
- [Abstract and benchmark construction section] The description of how the 100,000 NL-JQL pairs were generated, filtered, and validated against the live Jira instance (including any sampling strategy or bias mitigation for the >200k issues) is insufficiently detailed to assess representativeness of real-world queries and categorical distributions; this directly affects confidence in the 43.4% baseline and ablation gains.
- [Experiments / ablation subsection] The controlled ablation isolating JiraAnchor reports clear lifts (categorical-value accuracy 48.7% → 71.7%, component-field 16.9% → 66.2%), but the exact protocol for constructing the ablation test set and ensuring no leakage from the full 100k pairs should be stated explicitly to confirm the gains are attributable to semantic retrieval rather than data artifacts.
minor comments (2)
- [Abstract] Define 'most linguistically challenging variant' with reference to a specific table, metric, or query-length statistic rather than leaving it implicit.
- [Throughout] Expand acronyms (JQL, MCP, NL) on first use and ensure consistent notation for accuracy metrics across text and tables.
Simulated Author's Rebuttal
We thank the referee for the positive assessment, recognition of the benchmark's significance, and recommendation for minor revision. The comments are constructive and will improve the clarity of our work. We address each major comment below and will incorporate the requested details in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract and benchmark construction section] The description of how the 100,000 NL-JQL pairs were generated, filtered, and validated against the live Jira instance (including any sampling strategy or bias mitigation for the >200k issues) is insufficiently detailed to assess representativeness of real-world queries and categorical distributions; this directly affects confidence in the 43.4% baseline and ablation gains.
Authors: We agree that the benchmark construction section would benefit from expanded detail on generation, filtering, validation, sampling, and bias mitigation to better demonstrate representativeness. The manuscript describes validation of the 100k pairs on the live instance with >200k issues, but we will revise this section to explicitly cover: (1) the query generation approach and how natural language paraphrases were produced, (2) filtering criteria for executability and syntactic validity, (3) the sampling strategy across issue types, fields, and categorical distributions to reflect real-world usage, and (4) any bias mitigation steps such as stratified sampling. These additions will strengthen confidence in the reported baselines. revision: yes
-
Referee: [Experiments / ablation subsection] The controlled ablation isolating JiraAnchor reports clear lifts (categorical-value accuracy 48.7% → 71.7%, component-field 16.9% → 66.2%), but the exact protocol for constructing the ablation test set and ensuring no leakage from the full 100k pairs should be stated explicitly to confirm the gains are attributable to semantic retrieval rather than data artifacts.
Authors: We thank the referee for this observation. The ablation isolates JiraAnchor by comparing performance with and without the semantic retrieval tool on queries involving categorical values. To prevent leakage, the ablation test set was drawn from a held-out subset of the benchmark, and the JiraAnchor embedding index was constructed exclusively from issue metadata excluding any test queries. We will revise the experiments/ablation subsection to state this protocol explicitly, including the test set size, the exact split procedure, and confirmation of no overlap with the retrieval index, thereby verifying that the reported accuracy lifts are attributable to the semantic grounding mechanism. revision: yes
Circularity Check
No significant circularity
full rationale
The paper introduces a new execution-based benchmark (Jackal) with 100k NL-JQL pairs and evaluates single-pass vs. agentic LLM performance via live Jira execution and ablations on JiraAnchor. All reported metrics (43.4% baseline accuracy, 9.0% relative gain, categorical-value lift from 48.7% to 71.7%) are direct empirical measurements on the constructed dataset and live instance; no equations, fitted parameters, or predictions are claimed, and no derivation chain reduces to self-definition or self-citation. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Embedding similarity search can reliably surface instance-specific categorical values from natural language mentions
- domain assumption Live query execution provides an objective, instance-grounded correctness signal for generated JQL
invented entities (3)
-
Jackal benchmark
no independent evidence
-
JiraAnchor
no independent evidence
-
Agentic Jackal
no independent evidence
Reference graph
Works this paper leans on
-
[1]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
URL https://openreview.net/forum? id=2GmDdhBdDk. Pourreza, M. and Rafiei, D. DIN-SQL: Decomposed in- context learning of text-to-SQL with self-correction. In Advances in Neural Information Processing Systems, 2023. Talaei, S., Pourreza, M., Chang, Y .-C., Mirhoseini, A., and Saberi, A. CHESS: Contextual harnessing for efficient SQL synthesis.arXiv preprin...
work page internal anchor Pith review arXiv 2023
-
[2]
JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: I’m reviewing all open bugs in the QTBUG project so we can track unresolved issues before the next sprint
-
[3]
Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language
JQL: created >= -5d AND project = PYSIDE NL: I want to look at issues reported in the last 5 days in the PYSIDE project to see what’s newly come in. Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.2. Short Natural Language (Short NL) Task: Convert a JQL query into a concise...
- [4]
-
[5]
Do not include any additional text or explanations
JQL: resolution = Duplicate NL: Duplicate issues Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. B.3. Semantically Similar Task: Convert the following JQL query into a natural language sentence that expresses the same intent, but uses different wording. Do not directly reuse ...
-
[6]
JQL: status = "Open" NL: Tickets that are still in progress
-
[7]
Do not include any additional text or explanations
JQL: resolution = Duplicate NL: Issues already reported before Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. 16 B.4. Semantically Exact Task: Translate a JQL query into a natural language sentence that mirrors the JQL structure and wording as closely as possible. Do not par...
-
[8]
JQL: project = QTBUG AND issuetype = Bug AND status = "Open" NL: Project is QTBUG, issue type is Bug, and status is Open
-
[9]
JQL: priority = "P1: Critical" NL: Priority is P1: Critical Given this JQL: {jql} OUTPUT FORMAT: Only respond with the natural language. Do not include any additional text or explanations. C. Agentic System Prompt The following is an abridged version of the schema-grounded system prompt provided to the agent at every turn. The prompt encodes the complete ...
-
[10]
Generate valid JQL (Jira Query Language) using only fields from the SCHEMA below
-
[11]
ALWAYS use ‘jira_search‘ to execute the final query - never return JQL as plain text
-
[12]
Don’t ask clarifying questions - make reasonable assumptions when details are ambiguous C.2. Field Disambiguation Rules (Excerpt) FIELD RULES (for ambiguous natural language): Priority: - Hierarchy: P0 (Blocker) > P1 (Critical) > P2 (Important) > P3 (Somewhat important) > P4 (Low) > P5 (Not important) - Exact value names ("P4", "Critical") -> match exactl...
-
[13]
Call JiraAnchor to find the exact value name
-
[14]
Then call jira_search with that exact value in JQL CALL JiraAnchor when: - Version numbers: "5.0", "3.2.1" -> search fixVersion/affectedVersion - Technical terms that could be components: "Networking", "Payments" - Tag-like terms: "tech-debt", "flaky-test" -> search labels DO NOT use for existence checks ("issues without labels" -> labels IS EMPTY) C.5. E...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.