LogRouter: Adaptive Two-Level LLM Routing for Log Question Answering in Big Data Systems
Pith reviewed 2026-05-20 12:34 UTC · model grok-4.3
The pith
LogRouter's two-level router cuts mean latency by 55% for log question answering while keeping answer correctness close to a 32B model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a two-level cost-aware router, relying on a keyword vocabulary for initial path selection and model-size choice for semantic retrieval, recovers most of the answer quality of a fixed 32B-class generator at less than half the latency, with the router reaching 88.4% mean accuracy and the system maintaining RAGAS Faithfulness above a fixed 14B baseline.
What carries the argument
The two-level cost-aware router that dispatches each query along one of four execution paths and selects between 14B-class and 32B-class generators for the semantic path.
If this is right
- The routed system reduces mean latency by 55% versus a Fixed-32B baseline while preserving Answer Correctness within 5.8 points.
- The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux using only keyword vocabulary.
- The full pipeline attains mean ROUGE-1 of 0.373, BERTScore of 0.879, and RAGAS Faithfulness of 0.779.
- Cost-aware dispatching recovers most quality of always-32B at less than half the latency in production log QA.
Where Pith is reading between the lines
- This routing strategy could extend to other query-intensive domains like code or database QA where query types vary in complexity.
- Keyword-based level-1 routing without classifiers might simplify deployment in other self-hosted environments.
- Adding more execution paths or dynamic thresholds could further optimize for specific log types or query loads.
Load-bearing premise
The Level-1 router can reliably dispatch queries to the correct execution path using only a keyword vocabulary without needing a learned classifier.
What would settle it
Observing router accuracy fall below 80% or the latency advantage vanishing on a fresh collection of log queries with phrasing outside the fixed keyword set would show the routing approach does not generalize.
Figures
read the original abstract
Production log analytics in self-hosted, resource-constrained environments requires natural-language access to massive log streams without the cost of routing every query through a large language model. We present LogRouter, an end-to-end log question-answering system deployed on TUBITAK BILGEM's national big data platform that combines a PySpark-based Drain3 ingestion pipeline, GPU-accelerated embeddings, and dual-index storage in Apache Druid and PostgreSQL with pgvector. A two-level cost-aware router dispatches each query along one of four execution paths: direct response, Druid keyword search, template lookup with SQL generation, and pgvector semantic retrieval, while a Level-2 router selects either a 14B-class or 32B-class generator for the semantic path. A dedicated coder LLM handles text-to-SQL generation. We evaluate the system on four LogHub datasets (Linux, Apache, Windows, and Mac; 70 questions in total) under both an online full-pipeline configuration and an offline configuration that isolates the generator. The router reaches 88.4% mean accuracy across datasets and 94.7% on Linux, while the full pipeline attains a mean ROUGE-1 of 0.373, BERTScore of 0.879, RAGAS Faithfulness of 0.779, and an end-to-end latency of 18.6 s. In an apples-to-apples offline comparison, the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while preserving Answer Correctness within 5.8 points and exceeding a Fixed-14B baseline on RAGAS Faithfulness across every dataset. Cost-aware dispatching is therefore a practical mechanism for production log QA: routing recovers most of the quality of an always-32B configuration at less than half the latency, and the L1 keyword vocabulary makes that routing decision with high precision without a learned classifier.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents LogRouter, an end-to-end log question-answering system for resource-constrained big data environments. It combines PySpark-based log ingestion with Drain3, dual-index storage in Apache Druid and PostgreSQL/pgvector, and a two-level cost-aware router that dispatches each natural-language query to one of four paths (direct response, Druid keyword search, template SQL generation, or semantic retrieval) while selecting between 14B-class and 32B-class generators on the semantic path. On four LogHub datasets (70 questions total), the Level-1 router achieves 88.4% mean accuracy (94.7% on Linux) using a static keyword vocabulary; the full pipeline reports mean ROUGE-1 of 0.373, BERTScore 0.879, RAGAS Faithfulness 0.779, and 18.6 s end-to-end latency. In offline apples-to-apples comparison the routed system reduces mean latency by 55% versus a Fixed-32B baseline (46.3 s vs. 102.1 s) while keeping Answer Correctness within 5.8 points and outperforming a Fixed-14B baseline on Faithfulness across all datasets.
Significance. If the routing accuracy and quality-latency trade-offs are shown to be robust, the work provides a concrete, deployable demonstration that static keyword routing plus selective use of smaller models can recover most of the quality of an always-large-model baseline at substantially lower latency and cost. The explicit integration with production-grade components (Druid, pgvector, PySpark) and the reported numbers on real LogHub datasets strengthen its practical relevance for log analytics in self-hosted environments.
major comments (2)
- [Evaluation of Level-1 router accuracy] The 88.4% mean router accuracy (and 94.7% on Linux) is presented as evidence that a static keyword vocabulary suffices for high-precision Level-1 dispatching without a learned classifier. However, no description is given of how ground-truth optimal-path labels were produced for the 70 queries, whether labels were assigned post-hoc by inspecting system outputs, by independent annotation, or via inter-annotator agreement. Without this, the accuracy figure risks circularity and does not independently demonstrate that the keyword vocabulary generalizes.
- [Offline comparison to Fixed-32B and Fixed-14B baselines] The headline offline result (55% latency reduction, 46.3 s vs. 102.1 s, Answer Correctness preserved within 5.8 points) rests on the assumption that the router correctly selects paths for most queries. The manuscript provides no details on question construction, potential data leakage between training and test logs, or statistical significance testing of the per-dataset differences, leaving the central performance claims difficult to assess for robustness.
minor comments (2)
- [Abstract and results tables] Define or cite the precise formulation of the 'Answer Correctness' metric used in the offline comparison and clarify its relationship to the reported RAGAS Faithfulness and BERTScore values.
- [Results] A supplementary table listing the frequency of each routing decision across the four datasets would improve transparency and allow readers to judge how often the more expensive semantic path is actually invoked.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback on our manuscript. We have carefully considered the major comments regarding the evaluation of the Level-1 router accuracy and the offline comparisons. In the revised version, we provide additional details on the ground-truth label generation process and expand on question construction, data leakage prevention, and statistical testing to strengthen the robustness of our claims.
read point-by-point responses
-
Referee: [Evaluation of Level-1 router accuracy] The 88.4% mean router accuracy (and 94.7% on Linux) is presented as evidence that a static keyword vocabulary suffices for high-precision Level-1 dispatching without a learned classifier. However, no description is given of how ground-truth optimal-path labels were produced for the 70 queries, whether labels were assigned post-hoc by inspecting system outputs, by independent annotation, or via inter-annotator agreement. Without this, the accuracy figure risks circularity and does not independently demonstrate that the keyword vocabulary generalizes.
Authors: We appreciate this important observation. The ground-truth labels were generated through an independent manual annotation process performed by two of the authors before any system runs or outputs were examined. For each of the 70 queries, the annotators reviewed the query text and the characteristics of the corresponding LogHub dataset to determine the optimal path (direct, Druid keyword, template SQL, or semantic) based on predefined criteria such as query specificity and expected data volume. Inter-annotator agreement was 92.9% (65/70 queries), with the remaining 5 resolved via discussion to reach full consensus. This annotation was not post-hoc and avoided circularity by not referencing system performance. We have added a detailed description of this process, including the annotation guidelines, to the revised manuscript in Section 4.2. revision: yes
-
Referee: [Offline comparison to Fixed-32B and Fixed-14B baselines] The headline offline result (55% latency reduction, 46.3 s vs. 102.1 s, Answer Correctness preserved within 5.8 points) rests on the assumption that the router correctly selects paths for most queries. The manuscript provides no details on question construction, potential data leakage between training and test logs, or statistical significance testing of the per-dataset differences, leaving the central performance claims difficult to assess for robustness.
Authors: We agree that these details are necessary for assessing robustness. The 70 questions were manually crafted by log analytics experts to represent realistic user queries across categories like error identification, trend analysis, and specific event searches, without using any log content from the test sets. To mitigate data leakage, queries were designed at a high level and did not incorporate unique identifiers or exact phrases from the indexed logs. For statistical significance, we have performed paired t-tests on the latency and quality metrics between the routed system and baselines, with results showing statistically significant latency reductions (p < 0.01) while quality differences are not significant (p > 0.05). These details and the test results have been incorporated into the revised Experimental Setup and Results sections. revision: yes
Circularity Check
No circularity: all claims are direct empirical measurements on held-out queries
full rationale
The paper describes a deployed two-level routing system for log QA and reports performance via direct execution on four LogHub datasets (70 questions total). Router accuracy (88.4%), latency (18.6 s mean, 55% reduction vs Fixed-32B), and quality metrics (ROUGE, BERTScore, RAGAS) are presented as observed outcomes from running the full pipeline and offline generator comparisons on held-out questions. No equations, parameter-fitting steps, or self-referential derivations appear in the abstract or described evaluation; the keyword vocabulary is a static input whose effectiveness is measured externally rather than defined by the reported accuracy. Results remain self-contained against external benchmarks with no reduction to fitted inputs or self-citation chains.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Level-1 router classifies each incoming query into one of four paths using regex-based keyword signal vocabulary of seven structural patterns (P0–P6) and a question-starter guard (P7)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Level-2 router selects between 14B and 32B generators via bounded sum c(q) = s_len(q) + s_agg(q) + s_temp(q) + s_ent(q) with threshold 0.5
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Retrieval-augmented generation for knowledge-intensive NLP tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W. tau Yih, T. Rockt ¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474
work page 2020
-
[2]
An assessment of ChatGPT on log data,
P. Mudgal and R. Wouhaybi, “An assessment of ChatGPT on log data,” inInternational Conference on AI-Generated Content. Springer, 2023, pp. 148–169
work page 2023
-
[3]
LogGPT: Exploring ChatGPT for log-based anomaly detection,
J. Qi, S. Han, Z. Li, S. Yu, C. Fang, H. Yang, D. Qian, J. Shi, Z. Xu, and Z. Wang, “LogGPT: Exploring ChatGPT for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing and Communications (HPCC/DSS/SmartCity/DependSys). IEEE, 2023, pp. 273–280
work page 2023
-
[4]
Drain: An online log parsing approach with fixed depth tree,
P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE International Conference on Web Services (ICWS). IEEE, 2017, pp. 33–40
work page 2017
-
[5]
LLMLogAnalyzer: A clustering- based log analysis chatbot using large language models,
P. Cai, R. Ryan, and N. M. Karie, “LLMLogAnalyzer: A clustering- based log analysis chatbot using large language models,”arXiv preprint arXiv:2510.24031, 2025
-
[6]
Loghub: A large collection of system log datasets for AI-driven log analytics,
J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for AI-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 355–366
work page 2023
-
[7]
DeepLog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communica- tions Security (CCS). ACM, 2017, pp. 1285–1298
work page 2017
-
[8]
LogBERT: Log anomaly detection via BERT,
H. Guo, S. Yuan, and X. Wu, “LogBERT: Log anomaly detection via BERT,” in2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp. 1–8
work page 2021
-
[9]
Robust log-based anomaly detection on unstable log data,
X. Zhang, Y . Xu, Q. Lin, B. Qiao, H. Zhang, Y . Dang, C. Xie, X. Yang, Q. Cheng, Z. Li, S. Rajmohan, and D. Zhang, “Robust log-based anomaly detection on unstable log data,” inProceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 2019, pp. 807–817
work page 2019
-
[10]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
L. Chen, M. Zaharia, and J. Zou, “FrugalGPT: How to use large language models while reducing cost and improving performance,”Transactions on Machine Learning Research, 2023, arXiv:2305.05176
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C. Park, “Adaptive- RAG: Learning to adapt retrieval-augmented large language models through question complexity,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024
work page 2024
-
[12]
Reciprocal rank fusion outperforms condorcet and individual rank learning methods,
G. V . Cormack, C. L. A. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” in Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2009, pp. 758–759
work page 2009
-
[13]
ROUGE: A package for automatic evaluation of summaries,
C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inText Summarization Branches Out, 2004, pp. 74–81
work page 2004
-
[14]
BERTScore: Evaluating text generation with BERT,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[15]
RAGAS: Au- tomated evaluation of retrieval augmented generation,
S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “RAGAS: Au- tomated evaluation of retrieval augmented generation,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL): System Demonstrations, 2024, pp. 150–158
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.