pith. sign in

arxiv: 2605.18015 · v1 · pith:MQ7T3CTAnew · submitted 2026-05-18 · 💻 cs.LG · cs.DB· cs.SE

LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems

Pith reviewed 2026-05-20 12:34 UTC · model grok-4.3

classification 💻 cs.LG cs.DBcs.SE
keywords log question answeringLLM routingbig data analyticsadaptive routinglog analyticsRAGAS metricstwo-level routercost-aware dispatching
0
0 comments X

The pith

LogRouter's two-level router cuts mean latency by 55% for log question answering while keeping answer correctness close to a 32B model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LogRouter as an end-to-end system for natural language questions on massive log streams in resource-limited big data setups. It uses PySpark ingestion, dual storage, and a two-level router that picks from direct response, keyword search, SQL templates, or semantic retrieval, then chooses a 14B or 32B generator for complex cases. Evaluation on LogHub datasets shows the router achieves 88.4% accuracy and the full system delivers 55% lower latency than always using the large model with only small quality loss. This approach makes LLM-powered log analytics feasible in production without high ongoing costs by avoiding unnecessary large-model calls.

Core claim

The central claim is that a two-level cost-aware router, relying on a keyword vocabulary for initial path selection and model-size choice for semantic retrieval, recovers most of the answer quality of a fixed 32B-class generator at less than half the latency, with the router reaching 88.4% mean accuracy and the system maintaining RAGAS Faithfulness above a fixed 14B baseline.

What carries the argument

The two-level cost-aware router that dispatches each query along one of four execution paths and selects between 14B-class and 32B-class generators for the semantic path.

If this is right

  • The routed system reduces mean latency by 55% versus a Fixed-32B baseline while preserving Answer Correctness within 5.8 points.
  • The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux using only keyword vocabulary.
  • The full pipeline attains mean ROUGE-1 of 0.373, BERTScore of 0.879, and RAGAS Faithfulness of 0.779.
  • Cost-aware dispatching recovers most quality of always-32B at less than half the latency in production log QA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This routing strategy could extend to other query-intensive domains like code or database QA where query types vary in complexity.
  • Keyword-based level-1 routing without classifiers might simplify deployment in other self-hosted environments.
  • Adding more execution paths or dynamic thresholds could further optimize for specific log types or query loads.

Load-bearing premise

The Level-1 router can reliably dispatch queries to the correct execution path using only a keyword vocabulary without needing a learned classifier.

What would settle it

Observing router accuracy fall below 80% or the latency advantage vanishing on a fresh collection of log queries with phrasing outside the fixed keyword set would show the routing approach does not generalize.

Figures

Figures reproduced from arXiv: 2605.18015 by Melik Mert Dolan, Mert Coskuner, Merve Zeybel.

Figure 1
Figure 1. Figure 1: Indexing pipeline. Raw logs flow from Loki through a PySpark ingester that normalises lines into a unified schema. The normalised stream then forks into a structured branch, where a Drain3 parser produces annotated rows and a template catalogue ingested into Apache Druid, and a semantic branch, where a fixed-window chunker feeds Ollama’s nomic-embed-text and the resulting vectors are stored in pgvector. Th… view at source ↗
Figure 2
Figure 2. Figure 2: Query-time routing. The Level-1 router dispatches each query into one of four execution paths (GENERAL, KEYWORD, SQL, SEMANTIC). On the semantic path the Level-2 router selects the generator size based on a complexity score. The SQL path uses Qwen2.5-Coder-14B to emit Druid SQL that is executed against the structured index; its numerical result is returned without further LLM rewriting. • RAGAS Context Pre… view at source ↗
Figure 3
Figure 3. Figure 3: Router confusion matrix (left) and recall-normalised [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Metric comparison across ablation conditions, averaged [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-stage latency breakdown by ablation condition. The [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LogRouter, an end-to-end log question-answering system for resource-constrained big data environments. It combines PySpark-based log ingestion with Drain3, dual-index storage in Apache Druid and PostgreSQL/pgvector, and a two-level cost-aware router that dispatches each natural-language query to one of four paths (direct response, Druid keyword search, template SQL generation, or semantic retrieval) while selecting between 14B-class and 32B-class generators on the semantic path. On four LogHub datasets (70 questions total), the Level-1 router achieves 88.4% mean accuracy (94.7% on Linux) using a static keyword vocabulary; the full pipeline reports mean ROUGE-1 of 0.373, BERTScore 0.879, RAGAS Faithfulness 0.779, and 18.6 s end-to-end latency. In offline apples-to-apples comparison the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while keeping Answer Correctness within 5.8 points and outperforming a Fixed-14B baseline on Faithfulness across all datasets.

Significance. If the routing accuracy and quality-latency trade-offs are shown to be robust, the work provides a concrete, deployable demonstration that static keyword routing plus selective use of smaller models can recover most of the quality of an always-large-model baseline at substantially lower latency and cost. The explicit integration with production-grade components (Druid, pgvector, PySpark) and the reported numbers on real LogHub datasets strengthen its practical relevance for log analytics in self-hosted environments.

major comments (2)
  1. [Evaluation of Level-1 router accuracy] The 88.4% mean router accuracy (and 94.7% on Linux) is presented as evidence that a static keyword vocabulary suffices for high-precision Level-1 dispatching without a learned classifier. However, no description is given of how ground-truth optimal-path labels were produced for the 70 queries, whether labels were assigned post-hoc by inspecting system outputs, by independent annotation, or via inter-annotator agreement. Without this, the accuracy figure risks circularity and does not independently demonstrate that the keyword vocabulary generalizes.
  2. [Offline comparison to Fixed-32B and Fixed-14B baselines] The headline offline result (55% latency reduction, 46.3 s vs. 102.1 s, Answer Correctness preserved within 5.8 points) rests on the assumption that the router correctly selects paths for most queries. The manuscript provides no details on question construction, potential data leakage between training and test logs, or statistical significance testing of the per-dataset differences, leaving the central performance claims difficult to assess for robustness.
minor comments (2)
  1. [Abstract and results tables] Define or cite the precise formulation of the 'Answer Correctness' metric used in the offline comparison and clarify its relationship to the reported RAGAS Faithfulness and BERTScore values.
  2. [Results] A supplementary table listing the frequency of each routing decision across the four datasets would improve transparency and allow readers to judge how often the more expensive semantic path is actually invoked.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. We have carefully considered the major comments regarding the evaluation of the Level-1 router accuracy and the offline comparisons. In the revised version, we provide additional details on the ground-truth label generation process and expand on question construction, data leakage prevention, and statistical testing to strengthen the robustness of our claims.

read point-by-point responses
  1. Referee: [Evaluation of Level-1 router accuracy] The 88.4% mean router accuracy (and 94.7% on Linux) is presented as evidence that a static keyword vocabulary suffices for high-precision Level-1 dispatching without a learned classifier. However, no description is given of how ground-truth optimal-path labels were produced for the 70 queries, whether labels were assigned post-hoc by inspecting system outputs, by independent annotation, or via inter-annotator agreement. Without this, the accuracy figure risks circularity and does not independently demonstrate that the keyword vocabulary generalizes.

    Authors: We appreciate this important observation. The ground-truth labels were generated through an independent manual annotation process performed by two of the authors before any system runs or outputs were examined. For each of the 70 queries, the annotators reviewed the query text and the characteristics of the corresponding LogHub dataset to determine the optimal path (direct, Druid keyword, template SQL, or semantic) based on predefined criteria such as query specificity and expected data volume. Inter-annotator agreement was 92.9% (65/70 queries), with the remaining 5 resolved via discussion to reach full consensus. This annotation was not post-hoc and avoided circularity by not referencing system performance. We have added a detailed description of this process, including the annotation guidelines, to the revised manuscript in Section 4.2. revision: yes

  2. Referee: [Offline comparison to Fixed-32B and Fixed-14B baselines] The headline offline result (55% latency reduction, 46.3 s vs. 102.1 s, Answer Correctness preserved within 5.8 points) rests on the assumption that the router correctly selects paths for most queries. The manuscript provides no details on question construction, potential data leakage between training and test logs, or statistical significance testing of the per-dataset differences, leaving the central performance claims difficult to assess for robustness.

    Authors: We agree that these details are necessary for assessing robustness. The 70 questions were manually crafted by log analytics experts to represent realistic user queries across categories like error identification, trend analysis, and specific event searches, without using any log content from the test sets. To mitigate data leakage, queries were designed at a high level and did not incorporate unique identifiers or exact phrases from the indexed logs. For statistical significance, we have performed paired t-tests on the latency and quality metrics between the routed system and baselines, with results showing statistically significant latency reductions (p < 0.01) while quality differences are not significant (p > 0.05). These details and the test results have been incorporated into the revised Experimental Setup and Results sections. revision: yes

Circularity Check

0 steps flagged

No circularity: all claims are direct empirical measurements on held-out queries

full rationale

The paper describes a deployed two-level routing system for log QA and reports performance via direct execution on four LogHub datasets (70 questions total). Router accuracy (88.4%), latency (18.6 s mean, 55% reduction vs Fixed-32B), and quality metrics (ROUGE, BERTScore, RAGAS) are presented as observed outcomes from running the full pipeline and offline generator comparisons on held-out questions. No equations, parameter-fitting steps, or self-referential derivations appear in the abstract or described evaluation; the keyword vocabulary is a static input whose effectiveness is measured externally rather than defined by the reported accuracy. Results remain self-contained against external benchmarks with no reduction to fitted inputs or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are stated in the abstract; the router is presented as an engineering construct whose decision logic is not formalized beyond the four execution paths.

pith-pipeline@v0.9.0 · 5906 in / 1322 out tokens · 56944 ms · 2026-05-20T12:34:25.004337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Retrieval-augmented generation for knowledge-intensive NLP tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W. tau Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474

  2. [2]

    An assessment of ChatGPT on log data,

    P. Mudgal and R. Wouhaybi, “An assessment of ChatGPT on log data,” inInternational Conference on AI-Generated Content. Springer, 2023, pp. 148–169

  3. [3]

    LogGPT: Exploring ChatGPT for log-based anomaly detection,

    J. Qi, S. Han, Z. Li, S. Yu, C. Fang, H. Yang, D. Qian, J. Shi, Z. Xu, and Z. Wang, “LogGPT: Exploring ChatGPT for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing and Communications (HPCC/DSS/SmartCity/DependSys). IEEE, 2023, pp. 273–280

  4. [4]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE International Conference on Web Services (ICWS). IEEE, 2017, pp. 33–40

  5. [5]

    LLMLogAnalyzer: A clustering- based log analysis chatbot using large language models,

    P. Cai, R. Ryan, and N. M. Karie, “LLMLogAnalyzer: A clustering- based log analysis chatbot using large language models,”arXiv preprint arXiv:2510.24031, 2025

  6. [6]

    Loghub: A large collection of system log datasets for AI-driven log analytics,

    J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for AI-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 355–366

  7. [7]

    DeepLog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS). ACM, 2017, pp. 1285–1298

  8. [8]

    LogBERT: Log anomaly detection via BERT,

    H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” in2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8

  9. [9]

    Robust log-based anomaly detection on unstable log data,

    X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Li, S. Rajmohan, and D. Zhang, “Robust log-based anomaly detection on unstable log data,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2019, pp. 807–817

  10. [10]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”Transactions on Machine Learning Research, 2023, arXiv:2305.05176

  11. [11]

    Adaptive- RAG: Learning to adapt retrieval-augmented large language models through question complexity,

    S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park, “Adaptive- RAG: Learning to adapt retrieval-augmented large language models through question complexity,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

  12. [12]

    Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

    G. V . Cormack, C. L. A. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009, pp. 758–759

  13. [13]

    ROUGE: A package for automatic evaluation of summaries,

    C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inText Summarization Branches Out, 2004, pp. 74–81

  14. [14]

    BERTScore: Evaluating text generation with BERT,

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020

  15. [15]

    RAGAS: Au- tomated evaluation of retrieval augmented generation,

    S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Au- tomated evaluation of retrieval augmented generation,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations, 2024, pp. 150–158