Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases
Pith reviewed 2026-05-21 12:07 UTC · model grok-4.3
The pith
Sonar-TS retrieves time series events by first searching candidates with SQL then verifying them with generated Python programs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sonar-TS is a neuro-symbolic framework for natural language querying over time series databases that follows a Search-Then-Verify pipeline: a feature index first pings candidate windows via SQL, after which generated Python programs lock onto and verify those candidates against the raw signals, evaluated on the new NLQTSBench benchmark.
What carries the argument
Search-Then-Verify pipeline that combines SQL-based candidate retrieval from a feature index with generated Python program verification for morphological intents such as shapes or anomalies.
If this is right
- Non-experts can extract specific temporal events from massive records without writing queries or code.
- Queries about shapes, anomalies, and other morphological features become practical on histories too long for direct processing.
- NLQTSBench supplies a common testbed for measuring progress on natural language time series retrieval.
- The hybrid SQL-plus-code design scales verification cost with the number of candidates rather than the full data length.
Where Pith is reading between the lines
- The same two-stage pattern could be adapted to natural language queries over other ordered data such as video frames or sensor streams.
- Stronger code-generation models would directly raise verification accuracy on edge-case patterns.
- Better feature indexes could shrink the set of candidates passed to the Python stage and lower overall latency.
Load-bearing premise
The Python programs generated on the fly will correctly decide whether each candidate window matches the user's intended continuous pattern even when the full history is extremely long.
What would settle it
A test set from NLQTSBench in which the generated verification programs systematically accept windows that lack the queried shape or anomaly, or reject windows that contain it.
Figures
read the original abstract
Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Sonar-TS, a neuro-symbolic Search-Then-Verify framework for natural language querying over time series databases (NLQ4TSDB). It first applies SQL queries against a feature index to retrieve candidate windows, then uses LLM-generated Python programs to verify those candidates against raw signals for complex morphological intents such as shapes and anomalies. The paper presents NLQTSBench as a new large-scale benchmark and states that experiments demonstrate Sonar-TS successfully handles queries where traditional Text-to-SQL and time-series methods fail, positioning the work as the first systematic study of NLQ4TSDB.
Significance. If the verify stage is shown to scale reliably, Sonar-TS could meaningfully advance accessible querying of massive temporal datasets by non-experts. The introduction of NLQTSBench supplies a concrete evaluation standard that future work can build upon.
major comments (2)
- [Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.
- [Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.
minor comments (1)
- [Abstract] Abstract: consider adding one concrete example query and its expected output to illustrate the morphological intents that defeat pure Text-to-SQL approaches.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, agreeing where additional quantitative support is warranted and outlining the revisions we will make.
read point-by-point responses
-
Referee: [Abstract] Abstract: the statement that 'experiments demonstrate effectiveness' and that Sonar-TS 'effectively navigates complex temporal queries where traditional methods fail' is unsupported by any quantitative metrics, error rates, runtime figures, or analysis of the verification programs, leaving the central claim without visible evidence.
Authors: We agree that the abstract would benefit from explicit quantitative support to ground the claims. The full manuscript (Section 5) reports precision/recall metrics, runtime comparisons against Text-to-SQL and time-series baselines, and overall success rates on NLQTSBench, showing Sonar-TS outperforming alternatives on morphological queries. We will revise the abstract to include key figures (e.g., accuracy improvements and failure modes of baselines) while preserving brevity. revision: yes
-
Referee: [Verification stage description] Verification component: the claim that LLM-generated Python programs reliably 'lock on' to shapes, anomalies, and other continuous patterns on ultra-long histories is load-bearing for the neuro-symbolic pipeline, yet no success rates, false-negative analysis, or scalability results on raw signals are reported.
Authors: We acknowledge that dedicated metrics for the verification stage would strengthen the neuro-symbolic claims. While overall pipeline results are presented, we will add a new subsection in the experiments (or appendix) reporting success rates of the LLM-generated verification programs, false-negative analysis on shape/anomaly detection, and scalability tests across varying history lengths on raw signals. revision: yes
Circularity Check
Sonar-TS introduces a new neuro-symbolic Search-Then-Verify framework with no circular derivation
full rationale
The paper proposes Sonar-TS as an original construction: a feature-index SQL search stage followed by LLM-generated Python verification programs on raw signals. No equations, fitted parameters, or predictions appear. The central claims rest on the empirical performance of this pipeline on the newly introduced NLQTSBench benchmark rather than on any self-referential definitions, self-citation chains, or renamings of prior results. The derivation is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Existing Text-to-SQL methods cannot handle continuous morphological intents such as shapes or anomalies.
- domain assumption Time series models cannot scale to ultra-long histories.
Reference graph
Works this paper leans on
-
[1]
doi: 10.14778/2824032.2824078. Pourreza, M. and Rafiei, D. Din-sql: Decomposed in- context learning of text-to-sql with self-correction.Ad- vances in Neural Information Processing Systems, 36: 36339–36348, 2023. Pourreza, M., Li, H., Sun, R., Chung, Y ., Talaei, S., Kakkar, G. T., Gan, Y ., Saberi, A., Ozcan, F., and Arik, S. Chase- sql: Multi-path reason...
-
[2]
Template Formulation:We define semantic templates containing parameter slots (e.g., {window size}, {threshold}). These slots are dynamically filled using distribution-aware sampling (e.g., percentiles) to instantiate concrete query intents
-
[3]
Data Synthesis:This is the core generation phase. Depending on the task level, G branches into two distinct strategies to produce(X, A): •Strategy I: Direct Extraction(for Level 1) prioritizes fidelity to raw data. •Strategy II: Signal Injection(for Levels 2-4) prioritizes controllability via mathematical signal superimposition. 11 Sonar-TS: Search-Then-V...
-
[4]
Human Visual Verification:Finally, synthesized samples undergo a rigorous audit. We apply automated SNR checks to ensure pattern distinctness, followed by expert human review to filter out ambiguous cases or artifacts. A.2.2. TEMPLATEFORMULATION In the first stage of the pipeline, we define semantic templates that encapsulate specific analytical intents. ...
-
[5]
Significant Outlier Audit (ignoring minor noise). Args:Scenario Generation→Chained injection of 2–4 distinct primitives (e.g., Linear Trend→Stable→Oscillation); Ground Truth→Structured Fact Sheet defining the exact start/end and type of every stage. 13 Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases A.2.3. DATASYNTHESIS We...
-
[6]
Atomic Primitives (P).We define a library of differentiable base functions f(t) to model fundamental visual concepts. These primitives are parametric and continuous: •Transient Patterns:Modeled via Gaussian kernels, e.g.,f spike(t) =A·e −λt2 . •State Shifts:Modeled via dual-Sigmoid activations, e.g.,f box(t) =σ(k(t−t s))−σ(k(t−t e)). •Oscillations:Modeled...
-
[7]
Additive Composition Model.Complex scenarios are synthesized by recursively composing these primitives. The synthetic seriesX syn is formulated as: Xsyn(t) =X bg(t) +α· T(f(t))(5) where Xbg is a background window selected for stationarity (minimizing variance), T represents temporal transformations (scaling, shifting), andαis an adaptive gain factor. The ...
-
[8]
•Scalars:We calculate theRelative Accuracyto handle varying scales
Scalar & Timestamp Accuracy.For atomic retrieval tasks (Level 1) returning numerical values or timestamps, we employ a tolerance-based accuracy metric. •Scalars:We calculate theRelative Accuracyto handle varying scales. Given a ground truthyand predictionˆy: Scorescalar(y,ˆy) = max 0,1− |y−ˆy| |y|+ϵ (6) whereϵ= 10 −9 prevents division by zero. • Timestamp...
-
[9]
Non-overlapping intervals yield a score of 0
Interval Intersection over Union (IoU).For time range identification tasks (Level 1 & 2), we evaluate the temporal overlap between the predicted intervalI p = [tp start, tp end]and the ground truthI g = [tg start, tg end]: IoU(Ip, Ig) = duration(Ip ∩I g) duration(Ip ∪I g) (7) where duration(·)denotes the time difference in seconds. Non-overlapping interva...
-
[10]
Set F1-Score.For tasks requiring a list of discrete dates (e.g., Top-K search in Level 3), we treat the output as an unordered set and compute the F1-score: F1 = 2· Precision·Recall Precision+Recall (8) where Precision is the fraction of correct dates in the prediction, and Recall is the fraction of ground truth dates successfully retrieved
-
[11]
Composite Report Score (Level 4).Evaluating natural language reports is challenging. Instead of relying on generic text metrics like BLEU or ROUGE, we utilize a structure-awareComposite Score. Crucially, as defined in Level 4 (see §A.2.2), our prompts explicitly enforce a strict output schema and a controlled vocabulary (e.g., mandating phrases like “rapi...
-
[12]
, cn} within a window into a vector of length w, denoted as ¯C={¯c 1,
Piecewise Aggregate Approximation (PAA).We first reduce the dimensionality of the raw time series sequence C={c 1, . . . , cn} within a window into a vector of length w, denoted as ¯C={¯c 1, . . . ,¯cw}. The choice of w adapts to the temporal granularity to preserve semantic interpretability: •Yearly View:w= 12, aligning with months. •Monthly View:wis dyn...
-
[13]
Symbolic Mapping.The PAA vector ¯C is Z-normalized to have a mean of zero and a standard deviation of one. We then map each coefficient¯ci to a symbol si from an alphabetΣ of size α (in our implementation, α= 5 , Σ ={ ′a′,′ b′,′ c′,′ d′,′ e′}). The mapping is defined by a set of breakpoints β={β 0, . . . , βα}, which divide the area under the Normal distr...
-
[14]
High IO” raw data table (Wide) and the “Low IO
Task Planner Prompt.The Planner is responsible for selecting the execution mode and the data source. As shown in the prompt below, the model is explicitly guided to distinguish between the “High IO” raw data table (Wide) and the “Low IO” feature tables (Long) to optimize retrieval efficiency. Task Planner Prompt You are a Task Planner for Time Series Anal...
-
[15]
Use ONLY when the time range is explicitly known and precise values are needed
Raw Data (data): High I/O cost. Use ONLY when the time range is explicitly known and precise values are needed
-
[16]
Contains precomputed stats (avg, std) and shape descriptors (sax)
Feature Tables (feature *): Low I/O cost. Contains precomputed stats (avg, std) and shape descriptors (sax). Use for searching patterns or scanning large historical ranges. — EXPERIENCES — {experiences} — User Question — {Input Question} — PLANNING STRATEGY — Analyze the query constraints to decide the Pipeline Mode: Mode A: DIRECT ACCESS (Fetch→Compute) ...
work page 2023
-
[17]
Code Generator Prompt.Following the plan, the Code Generator constructs the executable queries. It operates in two modes: Generation (synthesizing the initial code) and Refinement (fixing errors based on runtime feedback). In the generation phase, strict environmental constraints and a domain-specific operator library are enforced to ground the model’s lo...
-
[18]
The variable ”conn” is already defined and connected to the database
Execution Environment: You have access to pandas (pd), numpy (np), and scipy. The variable ”conn” is already defined and connected to the database. - DO NOT create a new connection. DO NOT close ”conn”. - Execute SQL using: df = pd.read sql query(sql, conn)
-
[19]
If a task cannot be solved by these operators, synthesize standard pandas/numpy logic
Operator Library (Module: sonar ops): Prioritize using the following pre-defined atomic operators for verification. If a task cannot be solved by these operators, synthesize standard pandas/numpy logic. - detect period(data, max lag): Estimates the dominant cycle length. - find best match(query, search, metric): Finds the most similar subsequence via DTW....
-
[20]
Experience Summarizer Prompt.This module acts as a ”Technical Lead” conducting a post-mortem. It analyzes the full execution trajectory—including the initial plan, the challenges faced (error history), and the final working code—to extract a single, high-value insight. This ensures that the system learns not just from success, but from the corrections app...
-
[21]
Experience Updater promptTo prevent the experience list from growing indefinitely, the Updater functions as a ”Knowledge Curator.” It takes the newly extracted insight and merges it into the existing Global Experience Pool. The model is instructed to perform semantic operations—Add(if new),Merge(if similar), orDiscard(if redundant)—ensuring the knowledge ...
-
[22]
These insights range from low-level syntax corrections to high-level robustness strategies
Experience Snapshot.Finally, to illustrate the nature of the learned knowledge, we display a subset of the actual experiences. These insights range from low-level syntax corrections to high-level robustness strategies. Snapshot of Injected Experiences • For structural trend reporting, do not calculate a global slope. Instead, use a divide-and-conquer appr...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.