arxiv: 2604.24219 · v1 · submitted 2026-04-27 · 💻 cs.AI

Recognition: unknown

Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

Hee-Kyong Yoo , Wonbae Kim , Hyocheol Ahn

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:33 UTC · model grok-4.3

classification 💻 cs.AI

keywords adaptive tree retrievalmulti-intent NLUquery complexity indexhierarchical decompositionadaptive pruningPareto-optimal retrievalNLU++ benchmarkcomplexity-aware routing

0 comments

The pith

Adaptive ToR routes multi-intent queries via a complexity index to single-step or hierarchical tree paths, yielding higher accuracy at lower cost than fixed-depth retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adaptive Tree-of-Retrieval as a way to handle multi-intent natural language understanding queries that vary in difficulty. It classifies each query by a complexity score drawn from linguistic features and sends simple ones straight to single-step retrieval while breaking complex ones into recursive sub-queries. Pruning then limits tree growth through similarity thresholds followed by semantic checks, and a reranking stage cleans duplicates before final scoring. On the NLU++ set of 2,693 queries this produces 29.07 percent subset accuracy and 71.79 percent micro-F1 while cutting latency, LLM calls, and tokens relative to baselines that always use the same depth. A reader would care because production NLU systems must serve both easy and hard inputs without uniform waste of compute or uniform loss of recall.

Core claim

Adaptive ToR computes a Query Complexity Index from weighted linguistic signals to route queries to either a rapid single-step path or an adaptive-depth hierarchical path. For complex queries the Tree-Based Retrieval module recursively decomposes them into focused sub-queries, the Adaptive Pruning Module applies two-stage filtering of quantitative similarity gating plus semantic relevance evaluation to suppress exponential node growth, and a Retrieval Reranking Layer uses a deduplicator-first pipeline with global LLM rescoring. On the NLU++ benchmark of 2,693 multi-intent queries across Banking and Hotel domains the system records 29.07 percent Subset Accuracy and 71.79 percent Micro-F1, a 9

What carries the argument

The Query Tree Classifier that computes a Query Complexity Index from weighted linguistic signals to route queries dynamically to single-step or hierarchical retrieval paths.

Load-bearing premise

The Query Complexity Index must correctly identify which queries benefit from hierarchical decomposition without errors that the two-stage pruning cannot fix, and the filtering must preserve all relevant intent information.

What would settle it

A direct comparison on the same queries showing that forced deeper-tree retrieval raises subset accuracy more than the reported mixed routing, or that the pruning step removes ground-truth intents on a measurable fraction of cases.

read the original abstract

Multi-intent natural language understanding requires retrieval systems that simultaneously achieve high accuracy and computational efficiency, yet existing approaches apply either uniform single-step retrieval that compromises recall or fixed-depth hierarchical decomposition that introduces excessive latency regardless of query complexity. This paper proposes Adaptive Tree-of-Retrieval (Adaptive ToR), a complexity-aware retrieval architecture that dynamically configures retrieval topology based on query characteristics. The system integrates four components: (1) a Query Tree Classifier computing a Query Complexity Index from weighted linguistic signals to route queries to either a rapid single-step path or an adaptive-depth hierarchical path; (2) a Tree-Based Retrieval module that recursively decomposes complex queries into focused sub-queries calibrated to predicted complexity; (3) an Adaptive Pruning Module employing two-stage filtering combining quantitative similarity gating with semantic relevance evaluation to suppress exponential node growth; and (4) a Retrieval Reranking Layer featuring a deduplicator-first pipeline and global LLM rescoring for production efficiency. Evaluation on the NLU++ benchmark (2,693 multi-intent queries across Banking and Hotel domains) yields 29.07% Subset Accuracy and 71.79% Micro-F1, a 9.7% relative improvement over fixed-depth baselines, while reducing latency by 37.6%, LLM invocations by 43.0%, and token consumption by 9.8%. Depth-wise analysis reveals that 26.92% of queries resolve within three seconds (2.45s mean latency) via single-step routing (d=0: 37.9% Subset Accuracy, 74.8% Micro-F1), while token consumption scales by 4.9x across depths, validating complexity-aware resource allocation and establishing Pareto-optimal balance across accuracy, latency, and computational efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adaptive ToR adds a complexity classifier to route multi-intent queries between flat and tree retrieval with pruning, delivering reported efficiency gains on NLU++ but without evidence that the routing itself drives the results.

read the letter

The main point is that this paper builds a retrieval system that first scores query complexity to pick either a single-step path or a deeper tree decomposition, then applies two-stage pruning and reranking to control cost. On the NLU++ benchmark it claims 29.07% subset accuracy and 71.79% micro-F1 with 37.6% lower latency and 43% fewer LLM calls than fixed-depth baselines, plus a breakdown showing 26.92% of queries handled at depth zero in 2.45 seconds mean time.

Referee Report

3 major / 2 minor

Summary. The paper proposes Adaptive ToR, a complexity-aware retrieval architecture for multi-intent NLU that uses a Query Tree Classifier to compute a Query Complexity Index and route queries to either single-step (d=0) or adaptive-depth hierarchical paths, combined with tree-based decomposition, two-stage adaptive pruning, and reranking. On the NLU++ benchmark (2,693 queries), it reports 29.07% Subset Accuracy and 71.79% Micro-F1 (9.7% relative gain over fixed-depth baselines), with 37.6% latency reduction, 43.0% fewer LLM invocations, and 9.8% lower token use; depth-wise results show 26.92% of queries handled at d=0 with 37.9% Subset Accuracy.

Significance. If the empirical results hold after addressing validation gaps, the work would be significant for efficient multi-intent NLU systems by demonstrating how query-complexity routing can deliver Pareto-optimal tradeoffs between accuracy and computational cost, moving beyond uniform single-step or fixed-depth methods.

major comments (3)

[Abstract] Abstract and depth-wise analysis: no accuracy, precision, or error-rate metrics are reported for the Query Tree Classifier's depth predictions or Query Complexity Index. This is load-bearing for the central claim, as the 9.7% Subset Accuracy lift and efficiency gains are attributed to adaptive routing between d=0 and deeper paths; without classifier performance data, the improvements could stem primarily from the pruning/reranking modules rather than adaptivity, and misrouted queries would erode the claimed benefits.
[Evaluation] Evaluation description: the abstract states specific numerical improvements (29.07% Subset Accuracy, 71.79% Micro-F1, 37.6% latency reduction) but provides no details on baseline definitions, statistical significance tests, error bars, data splits, or how the Query Complexity Index was trained/validated. This leaves the central empirical claims weakly supported.
[Method] Adaptive Pruning Module description: the claim that two-stage filtering (quantitative similarity gating + semantic relevance) suppresses exponential growth while preserving all relevant information lacks supporting ablations or error analysis showing information loss rates across depths.

minor comments (2)

[Abstract] The depth-wise analysis states '26.92% of queries resolve within three seconds (2.45s mean latency)' for d=0; clarify whether the 2.45s figure applies only to the single-step subset or the overall system.
[Evaluation] Notation for 'd=0' and depth scaling (token consumption scales by 4.9x across depths) should be defined explicitly in the main text with a table or figure reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to strengthen the empirical support and transparency of our claims.

read point-by-point responses

Referee: [Abstract] Abstract and depth-wise analysis: no accuracy, precision, or error-rate metrics are reported for the Query Tree Classifier's depth predictions or Query Complexity Index. This is load-bearing for the central claim, as the 9.7% Subset Accuracy lift and efficiency gains are attributed to adaptive routing between d=0 and deeper paths; without classifier performance data, the improvements could stem primarily from the pruning/reranking modules rather than adaptivity, and misrouted queries would erode the claimed benefits.

Authors: We acknowledge that direct performance metrics for the Query Tree Classifier (accuracy, precision, recall, or error rates on depth predictions and the Query Complexity Index) are not reported, which limits the ability to fully isolate the contribution of adaptive routing. The provided depth-wise breakdowns offer indirect support, but we agree this is insufficient. In the revised manuscript, we will add a dedicated evaluation subsection reporting the classifier's accuracy, F1 scores, and misrouting analysis on a held-out validation set, along with a breakdown of how routing decisions correlate with the observed gains. This will clarify that improvements stem from complexity-aware routing rather than pruning or reranking alone. revision: yes
Referee: [Evaluation] Evaluation description: the abstract states specific numerical improvements (29.07% Subset Accuracy, 71.79% Micro-F1, 37.6% latency reduction) but provides no details on baseline definitions, statistical significance tests, error bars, data splits, or how the Query Complexity Index was trained/validated. This leaves the central empirical claims weakly supported.

Authors: We agree that the evaluation section lacks sufficient detail to fully substantiate the claims. While the manuscript describes the NLU++ benchmark and fixed-depth comparisons, we will expand it to explicitly define all baselines (single-step retrieval and fixed-depth variants at d=1,2,3), report statistical significance via paired t-tests with p-values, include error bars from multiple random seeds or cross-validation, detail the train/validation/test splits, and describe the supervised training of the Query Complexity Index using annotated linguistic features. These additions will be incorporated to improve reproducibility and rigor. revision: yes
Referee: [Method] Adaptive Pruning Module description: the claim that two-stage filtering (quantitative similarity gating + semantic relevance) suppresses exponential growth while preserving all relevant information lacks supporting ablations or error analysis showing information loss rates across depths.

Authors: This is a valid observation. The method describes the two-stage pruning but does not include ablations or quantitative analysis of information preservation. We will add ablation studies in the experiments section comparing the full module to variants omitting each stage, and report metrics such as recall of relevant sub-queries (via overlap with ground-truth intents) and information loss rates at different depths. This will provide concrete evidence that relevant information is retained while controlling computational growth. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results presented without self-referential derivations

full rationale

The paper describes an empirical retrieval architecture with four components (Query Tree Classifier, Tree-Based Retrieval, Adaptive Pruning, Reranking) and reports benchmark metrics (29.07% Subset Accuracy, 71.79% Micro-F1, latency reductions) as direct evaluation outcomes on NLU++ data. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. Depth-wise breakdowns and complexity routing are presented as observed results rather than derived by construction from inputs. The central claims rest on external benchmark evaluation, not internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

Review limited to abstract; no explicit free parameters, axioms, or independently evidenced entities are stated. The Query Complexity Index implies fitted weights on linguistic signals, and the new modules are introduced without external validation.

free parameters (1)

Query Complexity Index weights
Weighted linguistic signals are used to compute the index for routing; these weights are not specified and are presumed fitted to training data.

invented entities (2)

Query Complexity Index no independent evidence
purpose: To quantify query complexity from linguistic signals and decide retrieval path
Core routing mechanism introduced in the system description; no independent evidence or external validation provided.
Adaptive Pruning Module no independent evidence
purpose: To apply two-stage quantitative and semantic filtering to control tree expansion
New component described to suppress exponential growth; no external evidence of its effectiveness outside this work.

pith-pipeline@v0.9.0 · 5631 in / 1626 out tokens · 52703 ms · 2026-05-08T03:33:51.936431+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

A. Singh, H. Tiwari, A. Kedia, A. Mishra, and T. Chakraborty, "Agentic retrieval-augmented generation: A survey on agentic RAG," 2025. [Online]. Available: https://arxiv.org/abs/2501.09136

work page internal anchor Pith review arXiv 2025
[2]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, "Retrieval-augmented generation for knowledge-intensive NLP tasks," in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474

2020
[3]

ToR -RAG: A tree -of-retrieval-based retrieval -augmented generation for complex question processing,

H.-K. Yoo and N. Moon, "ToR -RAG: A tree -of-retrieval-based retrieval -augmented generation for complex question processing," Journal of The Korea Society of Computer and Information, vol. 30, no. 10, pp. 23 –31, Oct. 2025

2025
[4]

Adaptive -RAG: Learning to adapt retrieval -augmented large language models through question complexity,

S. Jeong, S. Baek, S. Cho, S. J. Hwang, and J. C. Park, "Adaptive -RAG: Learning to adapt retrieval -augmented large language models through question complexity," in Proc. NAACL, Mexico City, Mexico, 2024, pp. 7036–7050

2024
[5]

Corrective Retrieval Augmented Generation

S. Yan, J. Gu, Y. Zhu, and Z. Ling, "Corrective retrieval augmented generation," 2024. [Online]. Available: https://arxiv.org/abs/2401.15884

work page internal anchor Pith review arXiv 2024
[6]

NLU++: A multi-label, slot-rich, generalisable dataset for natural language understanding in task -oriented dialogue,

I. Casanueva, T. Temčinas, D. Gerz, M. Henderson, and I. Vulić, "NLU++: A multi-label, slot-rich, generalisable dataset for natural language understanding in task -oriented dialogue," in Findings of NAACL 2022, Seattle, USA, 2022, pp. 1998–2013

2022
[7]

Design and verification of Adaptive ToR structure via query complexity awareness,

H.-K. Yoo, "Design and verification of Adaptive ToR structure via query complexity awareness," . Ph.D. Thesis, Hoseo University, Seoul, Republic of Kor ea, February 2026. Available online: http://www.dcollection.net/handler/hoseo/200000961944 [in Korean]

work page arXiv 2026
[8]

ToR -Lite: A lightweight semantic query decomposition for multi -hop retrieval-augmented generation in cloud -based AI systems,

H.-K. Yoo, W. Kim, and N. Moon, "ToR -Lite: A lightweight semantic query decomposition for multi -hop retrieval-augmented generation in cloud -based AI systems," Applied Sciences, vol. 16, no. 8, article 3966, 2026. https://doi.org/10.3390/app16083966

work page doi:10.3390/app16083966 2026
[9]

Atlas: Few-shot learning with retrieval augmented language models,

G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi -Yu, A. Joulin, S. Riedel, and E. Grave, "Atlas: Few-shot learning with retrieval augmented language models," Journal of Machine Learning Research, vol. 24, no. 251, pp. 1–43, 2023

2023
[10]

Judging LLM -as-a-judge with MT -bench and chatbot arena,

L. Zheng, W. L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, "Judging LLM -as-a-judge with MT -bench and chatbot arena," in Advances in Neural Information Processing Systems (NeurIPS), vol. 37, 2024, pp. 46595–46623

2024
[11]

Self -RAG: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, "Self -RAG: Learning to retrieve, generate, and critique through self-reflection," in Proc. ICLR, Vienna, Austria, 2024

2024
[12]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, and J. Larson, "From local to global: A graph RAG approach to query-focused summarization," 2024. [Online]. Available: https://arxiv.org/abs/2404.16130

work page internal anchor Pith review arXiv 2024
[13]

Query2doc: Query expansion with large language models,

L. Wang, N. Yang, and F. Wei, "Query2doc: Query expansion with large language models," in Proc. EMNLP, Singapore, 2023, pp. 9414–9423

2023
[14]

Interleaving retrieval with chain -of-thought reasoning for knowledge-intensive multi-step questions,

H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal, "Interleaving retrieval with chain -of-thought reasoning for knowledge-intensive multi-step questions," in Proc. ACL, Toronto, Canada, 2023, pp. 10014 –10037

2023
[15]

Measuring and narrowing the compositionality gap in language models,

O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis, "Measuring and narrowing the compositionality gap in language models," in Findings of EMNLP 2023, Singapore, 2023, pp. 5687 –5711

2023
[16]

CFT-RAG: An entity tree based retrieval augmented generation algorithm with Cuckoo Filter,

X. Li, W. Zhang, Y. Chen, and H. Liu, "CFT-RAG: An entity tree based retrieval augmented generation algorithm with Cuckoo Filter," 2025. [Online]. Available: https://arxiv.org/abs/2501.15098. Adaptive ToR 17

work page arXiv 2025
[17]

T -RAG: Lessons from the LLM trenches,

M. Fatehkia, J. Kim, and S. Lee, "T -RAG: Lessons from the LLM trenches," 2024. [Online]. Available: https://arxiv.org/abs/2402.07483

work page arXiv 2024
[18]

A tree -based RAG-agent recommendation system: A case study in medical test data,

J. Yang and X. Huang, "A tree -based RAG-agent recommendation system: A case study in medical test data,"
[19]

Available: https://arxiv.org/abs/2501.02727

[Online]. Available: https://arxiv.org/abs/2501.02727

work page arXiv
[20]

REIS: A high -performance and energy -efficient retrieval system with in -storage processing,

K. Chen, A. K. Kakolyris, R. Nadig, N. Mansouri Ghiasi, M. Frouzakis, H. Mao, J. Gómez -Luna, M. Alser, and O. Mutlu, "REIS: A high -performance and energy -efficient retrieval system with in -storage processing," in Proc. ISCA, Buenos Aires, Argentina, 2025, pp. 1171–1192

2025
[21]

arXiv preprint arXiv:2212.14024 (2022)

O. Khattab, K. Santhanam, X. L. Li, D. Hall, P. Liang, C. Potts, and M. Zaharia, "Demonstrate -search-predict: Composing retrieval and language models for knowledge -intensive NLP," 2023. [Online]. Available: https://arxiv.org/abs/2212.14024

work page arXiv 2023
[22]

Recent advancements in human -centered dialog systems: A survey,

R. Oruche, V. Guda, and P. Pathak, "Recent advancements in human -centered dialog systems: A survey," ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023

2023