pith. sign in

arxiv: 2602.17001 · v2 · pith:FRUDRZYFnew · submitted 2026-02-19 · 💻 cs.AI · cs.CL· cs.DB

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Pith reviewed 2026-05-21 12:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.DB
keywords natural language queryingtime series databasesneuro-symbolic methodssearch then verifyNLQTSBenchtemporal pattern retrieval
0
0 comments X

The pith

Sonar-TS retrieves time series events by first searching candidates with SQL then verifying them with generated Python programs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Sonar-TS to let non-expert users pose natural language questions about events, intervals, and summaries in large time series databases. Existing text-to-SQL approaches cannot handle continuous patterns such as shapes or anomalies, while time series models cannot process ultra-long histories. Sonar-TS works by using a feature index to locate candidate windows through SQL queries, then producing Python programs that check those windows directly against the raw data. The authors also release NLQTSBench, the first large-scale benchmark for this type of querying. The result is a pipeline that succeeds on complex temporal questions where prior methods fall short.

Core claim

Sonar-TS is a neuro-symbolic framework for natural language querying over time series databases that follows a Search-Then-Verify pipeline: a feature index first pings candidate windows via SQL, after which generated Python programs lock onto and verify those candidates against the raw signals, evaluated on the new NLQTSBench benchmark.

What carries the argument

Search-Then-Verify pipeline that combines SQL-based candidate retrieval from a feature index with generated Python program verification for morphological intents such as shapes or anomalies.

If this is right

  • Non-experts can extract specific temporal events from massive records without writing queries or code.
  • Queries about shapes, anomalies, and other morphological features become practical on histories too long for direct processing.
  • NLQTSBench supplies a common testbed for measuring progress on natural language time series retrieval.
  • The hybrid SQL-plus-code design scales verification cost with the number of candidates rather than the full data length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage pattern could be adapted to natural language queries over other ordered data such as video frames or sensor streams.
  • Stronger code-generation models would directly raise verification accuracy on edge-case patterns.
  • Better feature indexes could shrink the set of candidates passed to the Python stage and lower overall latency.

Load-bearing premise

The Python programs generated on the fly will correctly decide whether each candidate window matches the user's intended continuous pattern even when the full history is extremely long.

What would settle it

A test set from NLQTSBench in which the generated verification programs systematically accept windows that lack the queried shape or anomaly, or reject windows that contain it.

Figures

Figures reproduced from arXiv: 2602.17001 by Chang Xu, Ming Jin, Shirui Pan, Shiyu Wang, Xiping Liu, Yiji Zhao, Yuxuan Liang, Zhao Tan.

Figure 1
Figure 1. Figure 1: Comparison of querying paradigms. While Text-to-SQL fails to express morphological intents and Time Series Models are limited by context length, Sonar-TS adopts a “Search-Then-Verify” pipeline: it uses SQL to search a symbolic index for candidates and Python to verify them on raw data. cant barrier for non-expert users. Unlike simple numerical lookups (e.g., “maximum value in May”), users often pri￾oritize… view at source ↗
Figure 2
Figure 2. Figure 2: The hierarchical taxonomy of tasks in NLQTSBench. The benchmark ranges from Level 1 (Basic Operations) which tests numerical filtering, to Level 2 (Pattern Recognition) for morphological grounding, Level 3 (Semantic Reasoning) for logical composition, and finally Level 4 (Insight Synthesis) for narrative reporting. context processing to active database evidence localiza￾tion. Since preprocessing such as do… view at source ↗
Figure 3
Figure 3. Figure 3: The overview of the Sonar-TS framework. The workflow is organized into three stages: (1) Offline Data Processing constructs compact multi-scale Feature Tables to serve as a queryable index; (2) Online Querying, where the Task Planner and Code Generator synthesize SQL for rapid candidate search and Python for exact verification, supported by a closed-loop Prompt Cold Start mechanism that evolves analysis in… view at source ↗
Figure 4
Figure 4. Figure 4: Case Study. Text-to-SQL (Left) lacks morphological expressivity, and TS Models (Right) fail the logical constraint. Sonar-TS (Middle) succeeds via Search-Then-Verify. that fail to capture the geometric pattern. Conversely, time series models correctly recognize typical plateau shapes within short contexts but fail to align with the “longest” intent, lacking the global reasoning to compare durations. Sonar-… view at source ↗
Figure 5
Figure 5. Figure 5: The human verification interface. Annotators inspect both global context and local details (where the injected signal in orange is overlaid on the raw data in blue) to validate the ground truth. A.3. Evaluation Implementation Given the diversity of output formats (scalars, intervals, sets, and natural language reports), we implement a robust evaluation suite consisting of four specialized metrics. 1. Scala… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of Multi-Scale SAX Representations. The framework discretizes time series data across hierarchical granularities to support pattern matching at different resolutions: (Top) The Daily View captures high-frequency local fluctuations; (Middle) The Monthly View summarizes intermediate trends; (Bottom) The Yearly View abstracts long-term seasonality. The colored horizontal bars represent the assig… view at source ↗
read the original abstract

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Sonar-TS, a neuro-symbolic Search-Then-Verify framework for natural language querying over time series databases (NLQ4TSDB). It first applies SQL queries against a feature index to retrieve candidate windows, then uses LLM-generated Python programs to verify those candidates against raw signals for complex morphological intents such as shapes and anomalies. The paper presents NLQTSBench as a new large-scale benchmark and states that experiments demonstrate Sonar-TS successfully handles queries where traditional Text-to-SQL and time-series methods fail, positioning the work as the first systematic study of NLQ4TSDB.

Significance. If the verify stage is shown to scale reliably, Sonar-TS could meaningfully advance accessible querying of massive temporal datasets by non-experts. The introduction of NLQTSBench supplies a concrete evaluation standard that future work can build upon.

major comments (2)
  1. [Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.
  2. [Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.
minor comments (1)
  1. [Abstract] Abstract: consider adding one concrete example query and its expected output to illustrate the morphological intents that defeat pure Text-to-SQL approaches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, agreeing where additional quantitative support is warranted and outlining the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.

    Authors: We agree that the abstract would benefit from explicit quantitative support to ground the claims. The full manuscript (Section 5) reports precision/recall metrics, runtime comparisons against Text-to-SQL and time-series baselines, and overall success rates on NLQTSBench, showing Sonar-TS outperforming alternatives on morphological queries. We will revise the abstract to include key figures (e.g., accuracy improvements and failure modes of baselines) while preserving brevity. revision: yes

  2. Referee: [Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.

    Authors: We acknowledge that dedicated metrics for the verification stage would strengthen the neuro-symbolic claims. While overall pipeline results are presented, we will add a new subsection in the experiments (or appendix) reporting success rates of the LLM-generated verification programs, false-negative analysis on shape/anomaly detection, and scalability tests across varying history lengths on raw signals. revision: yes

Circularity Check

0 steps flagged

Sonar-TS introduces a new neuro-symbolic Search-Then-Verify framework with no circular derivation

full rationale

The paper proposes Sonar-TS as an original construction: a feature-index SQL search stage followed by LLM-generated Python verification programs on raw signals. No equations, fitted parameters, or predictions appear. The central claims rest on the empirical performance of this pipeline on the newly introduced NLQTSBench benchmark rather than on any self-referential definitions, self-citation chains, or renamings of prior results. The derivation is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated premise that SQL-based feature indexing plus LLM-generated Python verification can be made accurate and efficient for arbitrary morphological queries; no free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Existing Text-to-SQL methods cannot handle continuous morphological intents such as shapes or anomalies.
    Stated directly in the abstract as the motivation for the new approach.
  • domain assumption Time series models cannot scale to ultra-long histories.
    Stated directly in the abstract as a limitation of prior work.

pith-pipeline@v0.9.0 · 5738 in / 1286 out tokens · 63525 ms · 2026-05-21T12:07:47.548052+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Pourreza, M

    doi: 10.14778/2824032.2824078. Pourreza, M. and Rafiei, D. Din-sql: Decomposed in- context learning of text-to-sql with self-correction.Ad- vances in Neural Information Processing Systems, 36: 36339–36348, 2023. Pourreza, M., Li, H., Sun, R., Chung, Y ., Talaei, S., Kakkar, G. T., Gan, Y ., Saberi, A., Ozcan, F., and Arik, S. Chase- sql: Multi-path reason...

  2. [2]

    These slots are dynamically filled using distribution-aware sampling (e.g., percentiles) to instantiate concrete query intents

    Template Formulation:We define semantic templates containing parameter slots (e.g., {window size}, {threshold}). These slots are dynamically filled using distribution-aware sampling (e.g., percentiles) to instantiate concrete query intents

  3. [3]

    Depending on the task level, G branches into two distinct strategies to produce(X, A): •Strategy I: Direct Extraction(for Level 1) prioritizes fidelity to raw data

    Data Synthesis:This is the core generation phase. Depending on the task level, G branches into two distinct strategies to produce(X, A): •Strategy I: Direct Extraction(for Level 1) prioritizes fidelity to raw data. •Strategy II: Signal Injection(for Levels 2-4) prioritizes controllability via mathematical signal superimposition. 11 Sonar-TS: Search-Then-V...

  4. [4]

    We apply automated SNR checks to ensure pattern distinctness, followed by expert human review to filter out ambiguous cases or artifacts

    Human Visual Verification:Finally, synthesized samples undergo a rigorous audit. We apply automated SNR checks to ensure pattern distinctness, followed by expert human review to filter out ambiguous cases or artifacts. A.2.2. TEMPLATEFORMULATION In the first stage of the pipeline, we define semantic templates that encapsulate specific analytical intents. ...

  5. [5]

    Significant Outlier Audit (ignoring minor noise). Args:Scenario Generation→Chained injection of 2–4 distinct primitives (e.g., Linear Trend→Stable→Oscillation); Ground Truth→Structured Fact Sheet defining the exact start/end and type of every stage. 13 Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases A.2.3. DATASYNTHESIS We...

  6. [6]

    These primitives are parametric and continuous: •Transient Patterns:Modeled via Gaussian kernels, e.g.,f spike(t) =A·e −λt2

    Atomic Primitives (P).We define a library of differentiable base functions f(t) to model fundamental visual concepts. These primitives are parametric and continuous: •Transient Patterns:Modeled via Gaussian kernels, e.g.,f spike(t) =A·e −λt2 . •State Shifts:Modeled via dual-Sigmoid activations, e.g.,f box(t) =σ(k(t−t s))−σ(k(t−t e)). •Oscillations:Modeled...

  7. [7]

    Additive Composition Model.Complex scenarios are synthesized by recursively composing these primitives. The synthetic seriesX syn is formulated as: Xsyn(t) =X bg(t) +α· T(f(t))(5) where Xbg is a background window selected for stationarity (minimizing variance), T represents temporal transformations (scaling, shifting), andαis an adaptive gain factor. The ...

  8. [8]

    •Scalars:We calculate theRelative Accuracyto handle varying scales

    Scalar & Timestamp Accuracy.For atomic retrieval tasks (Level 1) returning numerical values or timestamps, we employ a tolerance-based accuracy metric. •Scalars:We calculate theRelative Accuracyto handle varying scales. Given a ground truthyand predictionˆy: Scorescalar(y,ˆy) = max 0,1− |y−ˆy| |y|+ϵ (6) whereϵ= 10 −9 prevents division by zero. • Timestamp...

  9. [9]

    Non-overlapping intervals yield a score of 0

    Interval Intersection over Union (IoU).For time range identification tasks (Level 1 & 2), we evaluate the temporal overlap between the predicted intervalI p = [tp start, tp end]and the ground truthI g = [tg start, tg end]: IoU(Ip, Ig) = duration(Ip ∩I g) duration(Ip ∪I g) (7) where duration(·)denotes the time difference in seconds. Non-overlapping interva...

  10. [10]

    Set F1-Score.For tasks requiring a list of discrete dates (e.g., Top-K search in Level 3), we treat the output as an unordered set and compute the F1-score: F1 = 2· Precision·Recall Precision+Recall (8) where Precision is the fraction of correct dates in the prediction, and Recall is the fraction of ground truth dates successfully retrieved

  11. [11]

    rapid rise

    Composite Report Score (Level 4).Evaluating natural language reports is challenging. Instead of relying on generic text metrics like BLEU or ROUGE, we utilize a structure-awareComposite Score. Crucially, as defined in Level 4 (see §A.2.2), our prompts explicitly enforce a strict output schema and a controlled vocabulary (e.g., mandating phrases like “rapi...

  12. [12]

    , cn} within a window into a vector of length w, denoted as ¯C={¯c 1,

    Piecewise Aggregate Approximation (PAA).We first reduce the dimensionality of the raw time series sequence C={c 1, . . . , cn} within a window into a vector of length w, denoted as ¯C={¯c 1, . . . ,¯cw}. The choice of w adapts to the temporal granularity to preserve semantic interpretability: •Yearly View:w= 12, aligning with months. •Monthly View:wis dyn...

  13. [13]

    shape matching

    Symbolic Mapping.The PAA vector ¯C is Z-normalized to have a mean of zero and a standard deviation of one. We then map each coefficient¯ci to a symbol si from an alphabetΣ of size α (in our implementation, α= 5 , Σ ={ ′a′,′ b′,′ c′,′ d′,′ e′}). The mapping is defined by a set of breakpoints β={β 0, . . . , βα}, which divide the area under the Normal distr...

  14. [14]

    High IO” raw data table (Wide) and the “Low IO

    Task Planner Prompt.The Planner is responsible for selecting the execution mode and the data source. As shown in the prompt below, the model is explicitly guided to distinguish between the “High IO” raw data table (Wide) and the “Low IO” feature tables (Long) to optimize retrieval efficiency. Task Planner Prompt You are a Task Planner for Time Series Anal...

  15. [15]

    Use ONLY when the time range is explicitly known and precise values are needed

    Raw Data (data): High I/O cost. Use ONLY when the time range is explicitly known and precise values are needed

  16. [16]

    Contains precomputed stats (avg, std) and shape descriptors (sax)

    Feature Tables (feature *): Low I/O cost. Contains precomputed stats (avg, std) and shape descriptors (sax). Use for searching patterns or scanning large historical ranges. — EXPERIENCES — {experiences} — User Question — {Input Question} — PLANNING STRATEGY — Analyze the query constraints to decide the Pipeline Mode: Mode A: DIRECT ACCESS (Fetch→Compute) ...

  17. [17]

    It operates in two modes: Generation (synthesizing the initial code) and Refinement (fixing errors based on runtime feedback)

    Code Generator Prompt.Following the plan, the Code Generator constructs the executable queries. It operates in two modes: Generation (synthesizing the initial code) and Refinement (fixing errors based on runtime feedback). In the generation phase, strict environmental constraints and a domain-specific operator library are enforced to ground the model’s lo...

  18. [18]

    The variable ”conn” is already defined and connected to the database

    Execution Environment: You have access to pandas (pd), numpy (np), and scipy. The variable ”conn” is already defined and connected to the database. - DO NOT create a new connection. DO NOT close ”conn”. - Execute SQL using: df = pd.read sql query(sql, conn)

  19. [19]

    If a task cannot be solved by these operators, synthesize standard pandas/numpy logic

    Operator Library (Module: sonar ops): Prioritize using the following pre-defined atomic operators for verification. If a task cannot be solved by these operators, synthesize standard pandas/numpy logic. - detect period(data, max lag): Estimates the dominant cycle length. - find best match(query, search, metric): Finds the most similar subsequence via DTW....

  20. [20]

    It analyzes the full execution trajectory—including the initial plan, the challenges faced (error history), and the final working code—to extract a single, high-value insight

    Experience Summarizer Prompt.This module acts as a ”Technical Lead” conducting a post-mortem. It analyzes the full execution trajectory—including the initial plan, the challenges faced (error history), and the final working code—to extract a single, high-value insight. This ensures that the system learns not just from success, but from the corrections app...

  21. [21]

    The model is instructed to perform semantic operations—Add(if new),Merge(if similar), orDiscard(if redundant)—ensuring the knowledge base remains compact and highly relevant

    Experience Updater promptTo prevent the experience list from growing indefinitely, the Updater functions as a ”Knowledge Curator.” It takes the newly extracted insight and merges it into the existing Global Experience Pool. The model is instructed to perform semantic operations—Add(if new),Merge(if similar), orDiscard(if redundant)—ensuring the knowledge ...

  22. [22]

    These insights range from low-level syntax corrections to high-level robustness strategies

    Experience Snapshot.Finally, to illustrate the nature of the learned knowledge, we display a subset of the actual experiences. These insights range from low-level syntax corrections to high-level robustness strategies. Snapshot of Injected Experiences • For structural trend reporting, do not calculate a global slope. Instead, use a divide-and-conquer appr...