arxiv: 2604.13048 · v1 · submitted 2026-03-15 · 💻 cs.DB · cs.AI· cs.SE

Recognition: 2 theorem links

· Lean Theorem

From Natural Language to PromQL: A Catalog-Driven Framework with Dynamic Temporal Resolution for Cloud-Native Observability

Twinkll Sisodia

Authors on Pith no claims yet

Pith reviewed 2026-05-15 10:50 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.SE

keywords natural language to PromQLcatalog-driven frameworkPrometheus observabilitycloud-native monitoringdynamic temporal resolutionintent classificationKubernetes deploymentAI inference workloads

0 comments

The pith

A catalog-driven framework translates natural language questions into executable PromQL queries for cloud observability in about 1.1 seconds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that lets users ask questions in plain English and receive working PromQL queries without needing to know the query language. It builds a hybrid catalog of around 2000 metrics that mixes fixed entries with runtime discoveries for hardware signals. A multi-stage pipeline classifies intent, routes to relevant categories, and scores metrics semantically while handling time references dynamically. This setup runs fast enough for production use on Kubernetes clusters handling AI inference tasks. If it works reliably, it removes a major barrier for engineers monitoring complex cloud systems.

Core claim

The central discovery is a catalog-driven framework that combines a statically curated base of approximately 2,000 metrics with runtime discovery of hardware-specific signals, uses a multi-stage query pipeline with intent classification, category-aware metric routing, and multi-dimensional semantic scoring, and incorporates dynamic temporal resolution to map natural language time expressions to PromQL syntax, achieving sub-second metric discovery and full pipeline completion in approximately 1.1 seconds.

What carries the argument

The hybrid metrics catalog combined with the multi-stage pipeline that performs intent classification, category-aware routing, and semantic scoring to generate PromQL from natural language.

Load-bearing premise

The multi-stage pipeline will consistently produce correct PromQL queries for diverse natural language inputs across production scenarios.

What would settle it

Running the system on a benchmark set of 100 varied natural language questions with corresponding ground-truth PromQL queries and measuring the accuracy rate of generated queries.

Figures

Figures reproduced from arXiv: 2604.13048 by Twinkll Sisodia.

read the original abstract

Modern cloud-native platforms expose thousands of time series metrics through systems like Prometheus, yet formulating correct queries in domain-specific languages such as PromQL remains a significant barrier for platform engineers and site reliability teams. We present a catalog-driven framework that translates natural language questions into executable PromQL queries, bridging the gap between human intent and observability data. Our approach introduces three contributions: (1) a hybrid metrics catalog that combines a statically curated base of approximately 2,000 metrics with runtime discovery of hardware-specific signals across GPU vendors, (2) a multi-stage query pipeline with intent classification, category-aware metric routing, and multi-dimensional semantic scoring, and (3) a dynamic temporal resolution mechanism that interprets diverse natural language time expressions and maps them to appropriate PromQL duration syntax. We integrate the framework with the Model Context Protocol (MCP) to enable tool-augmented LLM interactions across multiple providers. The catalog-driven approach achieves sub-second metric discovery through pre-computed category indices, with the full pipeline completing in approximately 1.1 seconds via the catalog path. The system has been deployed on production Kubernetes clusters managing AI inference workloads, where it supports natural language querying across approximately 2,000 metrics spanning cluster health, GPU utilization, and model-serving performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a practical catalog-driven system for turning natural language into PromQL with reported low latency and a production deployment, but supplies no accuracy metrics or baselines to show the translations are reliable.

read the letter

The main point is that this work builds a hybrid metrics catalog plus multi-stage pipeline to map natural language questions to executable PromQL, including dynamic temporal resolution for time expressions. It reports sub-second discovery via pre-computed indices and roughly 1.1-second end-to-end runs, plus a deployment on Kubernetes clusters handling AI inference workloads covering about 2000 metrics. That combination of static curation with runtime GPU-specific signals and the MCP integration for LLM tool calls is the concrete engineering contribution here. The architecture choices around intent classification, category routing, and semantic scoring look like a reasonable way to keep the system grounded in the actual metric set rather than relying purely on LLM generation. The latency numbers are useful for anyone thinking about real-time observability interfaces. The clear gap is the missing evaluation on correctness. The abstract and stress-test note both highlight that there are no accuracy figures, no test-set size, no error rates, and no comparison against simpler baselines like direct LLM prompting. Without those, it is difficult to judge whether the catalog path actually improves query quality or just speeds up the process. This paper is aimed at engineers building monitoring tools for cloud platforms who need concrete implementation ideas for NL interfaces. A reader focused on practical system design could extract useful details on the catalog and pipeline structure. It deserves peer review if the authors add a proper evaluation section with accuracy results and failure analysis, because the underlying problem of lowering PromQL barriers is genuine even if the current evidence remains thin on whether the system solves it.

Referee Report

1 major / 1 minor

Summary. The paper proposes a catalog-driven framework for translating natural language questions into PromQL queries for Prometheus-based observability in cloud-native environments. It describes a hybrid metrics catalog with ~2000 metrics, a multi-stage pipeline involving intent classification, category-aware routing, and semantic scoring, plus dynamic temporal resolution for time expressions. The system integrates with the Model Context Protocol and reports sub-second metric discovery and ~1.1 second full pipeline latency, with deployment on production Kubernetes clusters for AI workloads.

Significance. If the translation accuracy is high, this work could meaningfully improve accessibility of observability data for non-expert users in production environments. The pre-computed category indices for fast lookup and the handling of dynamic temporal expressions address practical pain points. The production deployment note suggests real-world applicability. However, the lack of any reported accuracy or correctness metrics makes it difficult to gauge the actual impact or reliability of the approach compared to simpler LLM-based methods.

major comments (1)

[Abstract] Abstract: The manuscript states performance numbers (sub-second metric discovery through pre-computed category indices, full pipeline completion in approximately 1.1 seconds) and a production deployment on Kubernetes clusters but supplies no evaluation methodology, test-set size, accuracy metrics (e.g., exact-match rate or execution success rate), baseline comparisons against direct LLM generation, or failure cases. This leaves the central claim that the multi-stage pipeline (intent classification + category-aware routing + multi-dimensional semantic scoring) produces correct, executable PromQL queries unsupported by visible evidence.

minor comments (1)

The manuscript would benefit from a dedicated evaluation section that defines success criteria, reports quantitative results, and includes representative examples of natural-language inputs and generated PromQL outputs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for identifying the need for stronger empirical support in the manuscript. We address the major comment below and will revise the paper accordingly to include the requested evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript states performance numbers (sub-second metric discovery through pre-computed category indices, full pipeline completion in approximately 1.1 seconds) and a production deployment on Kubernetes clusters but supplies no evaluation methodology, test-set size, accuracy metrics (e.g., exact-match rate or execution success rate), baseline comparisons against direct LLM generation, or failure cases. This leaves the central claim that the multi-stage pipeline (intent classification + category-aware routing + multi-dimensional semantic scoring) produces correct, executable PromQL queries unsupported by visible evidence.

Authors: We agree that the manuscript as submitted lacks a dedicated evaluation of query correctness. The latency figures (sub-second metric discovery and ~1.1 s end-to-end) and the production deployment statement are based on runtime measurements from our Kubernetes clusters, but no test-set size, accuracy rates, baseline comparisons, or failure analysis appear in the current text. This is a genuine gap that leaves the correctness claim under-supported. In the revised version we will add a new Evaluation section that reports: (1) the size and composition of the natural-language test set, (2) exact-match and execution-success rates against ground-truth PromQL, (3) a direct-LLM baseline comparison, (4) categorized failure cases, and (5) the methodology used to obtain these figures. The latency numbers will be explicitly tied to the same experimental setup. revision: yes

Circularity Check

0 steps flagged

No circularity: system architecture description with no derivations or fitted quantities

full rationale

The manuscript presents a catalog-driven NL-to-PromQL framework as an engineered pipeline (intent classification, category routing, semantic scoring, dynamic temporal resolution) together with latency measurements and a production deployment note. No equations, parameters fitted to data subsets, uniqueness theorems, or self-citation chains appear in the provided text. All performance claims (sub-second discovery, ~1.1 s pipeline) are reported as direct observations rather than predictions derived from prior quantities inside the paper. The central claim therefore does not reduce to any of the enumerated circularity patterns and remains self-contained as a systems contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the unverified assumption that the described pipeline components will produce accurate mappings from natural language to PromQL without providing independent evidence or test data in the abstract.

axioms (1)

domain assumption Intent classification, category-aware routing, and multi-dimensional semantic scoring will reliably select correct metrics and generate valid PromQL for arbitrary natural language inputs.
This assumption underpins the entire multi-stage query pipeline and is required for the claimed functionality and performance.

pith-pipeline@v0.9.0 · 5527 in / 1210 out tokens · 38987 ms · 2026-05-15T10:50:21.159794+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing
cs.SE 2026-04 unverdicted novelty 3.0

This analysis synthesizes recent LLM observability research into a five-layer framework and identifies the integration of model signals with infrastructure anomalies as the central open problem.

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · cited by 1 Pith paper

[1]

Model context protocol

Anthropic . Model context protocol. https://modelcontextprotocol.io, 2024. Open standard for AI tool integration

work page 2024
[2]

KubeIntellect : A modular LLM -orchestrated agent framework for end-to-end Kubernetes management

Mohsen Seyedkazemi Ardebili and Andrea Bartolini. KubeIntellect : A modular LLM -orchestrated agent framework for end-to-end Kubernetes management. arXiv preprint arXiv:2509.02449, 2025

work page arXiv 2025
[3]

STRATUS : A multi-agent system for autonomous reliability engineering of modern clouds

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, and Tianyin Xu. STRATUS : A multi-agent system for autonomous reliability engineering of modern clouds. In Advances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[4]

SMCP : Secure model context protocol

Xinyi Hou, Shenao Wang, Yifan Zhang, Ziluo Xue, Yanjie Zhao, Cai Fu, and Haoyu Wang. SMCP : Secure model context protocol. arXiv preprint arXiv:2602.01129, 2026

work page arXiv 2026
[5]

arXiv preprint arXiv:2411.07763 (2024)

Fangyu Lei, Jixuan Huang, Yuxin Zhong, Yifan Qin, et al. Spider 2.0: Evaluating language models on real-world enterprise text-to- SQL workflows. In Proceedings of the International Conference on Learning Representations (ICLR), 2025. arXiv:2411.07763

work page arXiv 2025
[6]

BIRD : A big benchmark for large-scale database grounded text-to- SQL evaluation

Jinyang Li, Binyuan Hui, et al. BIRD : A big benchmark for large-scale database grounded text-to- SQL evaluation. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[7]

Agent bain vs

Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay D \"o nder, Sungyong Chang, David Mimno, and Unso Eun Seo Jo. Agent bain vs. agent mckinsey: A new text-to- SQL benchmark for the business domain. arXiv preprint arXiv:2510.07309, 2025

work page arXiv 2025
[8]

Prometheus -- monitoring system and time series database

Prometheus Authors . Prometheus -- monitoring system and time series database. https://prometheus.io, 2015. Cloud Native Computing Foundation project

work page 2015
[9]

Integrating large language models with cloud-native observability for automated root cause analysis and remediation

Chen Wang, Tingzhou Yuan, Cancan Hua, Lu Chang, Xiao Yang, and Zhimin Qiu. Integrating large language models with cloud-native observability for automated root cause analysis and remediation. Preprints, 2025. doi:10.20944/preprints202512.1926.v1

work page doi:10.20944/preprints202512.1926.v1 2025
[10]

Simplifying root cause analysis in Kubernetes with StateGraph and LLM

Yong Xiang, Charley Peter Chen, Liyi Zeng, Wei Yin, Xin Liu, Hu Li, and Wei Xu. Simplifying root cause analysis in Kubernetes with StateGraph and LLM . arXiv preprint arXiv:2506.02490, 2025

work page arXiv 2025
[11]

PromCopilot : Simplifying Prometheus metric querying in cloud native online service systems via large language models

Chenxi Zhang, Bicheng Zhang, Dingyu Yang, Xin Peng, Miao Chen, Senyu Xie, Gang Chen, Wei Bi, and Wei Li. PromCopilot : Simplifying Prometheus metric querying in cloud native online service systems via large language models. arXiv preprint arXiv:2503.03114, 2025. doi:10.48550/arXiv.2503.03114

work page doi:10.48550/arxiv.2503.03114 2025