A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

Phongsakon Mark Konrad; Riccardo Terrenzi; Serkan Ayvaz; Tim Lukas Adam

arxiv: 2604.16394 · v1 · submitted 2026-03-28 · 💻 cs.IR · cs.AI

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

Riccardo Terrenzi , Phongsakon Mark Konrad , Tim Lukas Adam , Serkan Ayvaz This is my paper

Pith reviewed 2026-05-14 21:08 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords agentic retrievalhybrid searchdataset searchLLM orchestrationreference architecturereciprocal rank fusionmetadata augmentationReAct agent

0 comments

The pith

A reference architecture for agentic hybrid retrieval combines BM25 lexical search with dense embeddings via reciprocal rank fusion, orchestrated by an LLM agent that plans queries, evaluates results, and reranks candidates while augmenting

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to reposition dataset search as a software architecture problem rather than a pure information-retrieval task. It claims that an LLM-orchestrated hybrid system, which repeatedly plans queries, checks result sufficiency, and fuses lexical and embedding rankings, can close the gap between underspecified natural-language queries and sparse provider metadata. An offline step that generates pseudo-queries for each dataset record further reduces vocabulary mismatch before any user query arrives. The design is deliberately bounded and auditable so that nondeterministic LLM behavior can still be governed and observed. Two concrete styles are compared: a single ReAct agent and a multi-agent horizontal setup with feedback control, with explicit analysis of trade-offs in modifiability, observability, performance, and governance.

Core claim

By treating dataset search as an architecture problem, the authors introduce a bounded reference architecture that augments each metadata record offline with LLM-generated pseudo-queries, then runs hybrid retrieval (BM25 plus dense embeddings fused by reciprocal rank fusion) under the control of an LLM agent that plans, evaluates sufficiency, and reranks; the architecture is instantiated in both single-agent and multi-agent forms and equipped with an evaluation framework of seven variants that isolate each design decision.

What carries the argument

The LLM agent that repeatedly plans queries, evaluates result sufficiency, and reranks candidates, combined with offline pseudo-query augmentation of the indexes and reciprocal rank fusion of BM25 and dense scores.

If this is right

The seven-variant evaluation framework isolates the contribution of each architectural choice so future work can measure incremental gains.
Explicit governance tactics bound the nondeterministic LLM components, making the system auditable for production dataset catalogs.
Single ReAct versus multi-agent horizontal styles produce different quality-attribute profiles for modifiability and observability.
Offline metadata augmentation becomes a reusable preprocessing step that can be applied to any existing retrieval index.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounded orchestration pattern could be tested on other sparse-metadata domains such as scientific publication search or open-data portals.
If the agent reliably detects insufficiency, the architecture naturally supports iterative query refinement loops that current one-shot retrievers lack.
The reference design supplies a concrete template for inserting governance checkpoints into any LLM-driven retrieval pipeline.

Load-bearing premise

The assumption that an LLM agent can reliably judge whether retrieved results are sufficient and that the offline pseudo-queries will reduce vocabulary mismatch without adding new errors.

What would settle it

Running the seven defined system variants on a standard dataset-search benchmark and finding no measurable lift in standard retrieval metrics (such as nDCG or recall) when the LLM orchestration or pseudo-query augmentation is added.

Figures

Figures reproduced from arXiv: 2604.16394 by Phongsakon Mark Konrad, Riccardo Terrenzi, Serkan Ayvaz, Tim Lukas Adam.

**Figure 2.** Figure 2: Offline metadata augmentation pipeline. The LLM Augmentor [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Multi Specialized-Agent Pipeline. Edge labels are typed inter-agent [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a software-architecture problem and propose a bounded, auditable reference architecture for agentic hybrid retrieval that combines BM25 lexical search with dense-embedding retrieval via reciprocal rank fusion (RRF), orchestrated by a large language model (LLM) agent that repeatedly plans queries, evaluates the sufficiency of results, and reranks candidates. To reduce the vocabulary mismatch between user intent and provider-authored metadata, we introduce an offline metadata augmentation step in which an LLM generates pseudo-queries for each dataset record, augmenting both retrieval indexes before query time. Two architectural styles are examined: a single ReAct agent and a multi-agent horizontal architecture with Feedback Control. Their quality-attribute tradeoffs are analyzed with respect to modifiability, observability, performance, and governance. An evaluation framework comprising seven system variants is defined to isolate the contribution of each architectural decision. The architecture is presented as an extensible reference design for the software architecture community, incorporating explicit governance tactics to bound and audit nondeterministic LLM components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean reference architecture for LLM-driven hybrid retrieval in dataset search with two agent styles and a seven-variant evaluation plan, but the LLM sufficiency checks lack any concrete rubric or prompt details.

read the letter

The main takeaway is a reference architecture that treats dataset search as an engineering problem: it wires BM25 and dense retrieval together via reciprocal rank fusion, lets an LLM agent plan queries and decide when results are good enough, and adds offline pseudo-query generation to fix metadata vocabulary gaps. They contrast a single ReAct agent with a multi-agent feedback-control version and lay out seven system variants to test each piece against modifiability, observability, performance, and governance.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a reference architecture for agentic hybrid retrieval in dataset search. It combines BM25 lexical search with dense-embedding retrieval using reciprocal rank fusion (RRF), orchestrated by an LLM agent that plans queries, evaluates result sufficiency, and reranks candidates. An offline step generates pseudo-queries for metadata augmentation. Two styles are examined: single ReAct agent and multi-agent with Feedback Control. Quality attributes like modifiability, observability, performance, and governance are analyzed, and an evaluation framework with seven variants is defined to isolate contributions of each decision. The architecture is presented as extensible with governance tactics to bound nondeterminism.

Significance. If the architecture can be realized with concrete, auditable LLM controls, the work would offer a useful extensible reference design for the software-architecture and information-retrieval communities. The explicit treatment of quality-attribute tradeoffs and governance tactics for nondeterministic components provides a structured way to address vocabulary mismatch in sparse dataset metadata, even if immediate empirical gains remain to be demonstrated.

major comments (2)

[Evaluation Framework] The manuscript defines an evaluation framework comprising seven system variants to isolate the contribution of each architectural decision, yet supplies no retrieval metrics, ablation results, or error analysis. This leaves the central claim that the architecture improves dataset search unverified.
[Agentic Orchestration] The LLM agent's repeated evaluation of result sufficiency (in both ReAct and multi-agent variants) is described as a core loop, but no explicit decision procedure, scoring rubric, threshold, or prompt template is supplied. Without these, the governance tactics cannot enforce the claimed bounds on nondeterminism.

minor comments (1)

[Abstract] The abstract is concise but could state the number of variants and the four quality attributes earlier to better orient readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments. Below we provide point-by-point responses to the major comments and describe the revisions we intend to make in the next version of the manuscript.

read point-by-point responses

Referee: The manuscript defines an evaluation framework comprising seven system variants to isolate the contribution of each architectural decision, yet supplies no retrieval metrics, ablation results, or error analysis. This leaves the central claim that the architecture improves dataset search unverified.

Authors: The manuscript is framed as a reference-architecture contribution whose primary deliverables are the bounded design, the two orchestration styles, the quality-attribute analysis, and the seven-variant evaluation framework itself. No empirical claim of improvement is made; the framework is defined precisely so that future work can isolate each decision through controlled ablations. We will add an explicit scope statement in the abstract, introduction, and conclusion clarifying that empirical validation lies outside the present paper and is reserved for subsequent studies. revision: partial
Referee: The LLM agent's repeated evaluation of result sufficiency (in both ReAct and multi-agent variants) is described as a core loop, but no explicit decision procedure, scoring rubric, threshold, or prompt template is supplied. Without these, the governance tactics cannot enforce the claimed bounds on nondeterminism.

Authors: We agree that the sufficiency-evaluation loop requires concrete specification to make the governance tactics fully auditable. In the revision we will add (1) the exact prompt template used for the sufficiency judgment, (2) a deterministic decision procedure that thresholds on result cardinality, RRF aggregate score, and a binary metadata-coverage flag, and (3) a short scoring rubric. These additions will be placed in a new subsection on governance controls and will be referenced from the ReAct and multi-agent descriptions. revision: yes

Circularity Check

0 steps flagged

No circularity: forward architectural proposal with no derivations or fitted predictions

full rationale

The paper presents a reference architecture for agentic hybrid retrieval combining BM25, dense embeddings, RRF, and LLM-orchestrated planning without any equations, parameter fitting, or predictive derivations. All elements (ReAct/multi-agent styles, offline pseudo-query augmentation, governance tactics) are introduced as explicit design decisions rather than outputs derived from the same data or self-referential loops. No self-citations serve as load-bearing uniqueness theorems, and the evaluation framework isolates architectural variants without reducing claims to fitted inputs. The design is therefore self-contained as a software-architecture contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The architecture rests on domain assumptions about LLM capabilities rather than new mathematical axioms or fitted parameters. No free parameters or invented entities are introduced.

axioms (2)

domain assumption LLM agents can repeatedly plan queries, evaluate result sufficiency, and rerank candidates effectively enough to improve retrieval
Invoked in the description of the agent orchestration step.
domain assumption Offline LLM-generated pseudo-queries reduce vocabulary mismatch between user intent and provider metadata
Central justification for the metadata augmentation step.

pith-pipeline@v0.9.0 · 5510 in / 1447 out tokens · 33913 ms · 2026-05-14T21:08:34.746556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Google dataset search: Building a search engine for datasets in an open web ecosystem,

N. Noy, M. Burgess, and D. Brickley, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” inProceedings of The Web Conference (WWW), 2019, pp. 1365–1375. 2https://openai.com/ 3https://www.anthropic.com/ 4https://www.google.com/ 5https://www.kimi.com/ 6https://qwen.ai

work page 2019
[2]

Auctus: A dataset search engine for data discovery and augmentation,

S. Castelo, R. Rampin, A. Santos, A. Freire, and J. Freire, “Auctus: A dataset search engine for data discovery and augmentation,”Proceedings of the VLDB Endowment, vol. 14, no. 12, pp. 2791–2794, 2021

work page 2021
[3]

Dataset search: A survey,

A. Chapman and E. Simperl, “Dataset search: A survey,” inProceedings of the 2019 International Conference on Information and Knowledge Management (CIKM), 2019

work page 2019
[4]

ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

N. W. Paton, J. Chen, and Z. Wu, “Dataset discovery and exploration: A survey,”ACM Comput. Surv., vol. 56, no. 4, pp. 102:1–102:37, 2024. [Online]. Available: https://doi.org/10.1145/3626521

work page doi:10.1145/3626521 2024
[5]

Keywords are not always the key: A metadata field analysis for natural language search on open data portals,

L.-Y . Gan, A. Das, J. Walker, and E. Simperl, “Keywords are not always the key: A metadata field analysis for natural language search on open data portals,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2509.14457

work page arXiv 2025
[6]

Is ChatGPT good at search? investigating large language models as re-ranking agents,

W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is ChatGPT good at search? investigating large language models as re-ranking agents,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 14 918–14 937

work page 2023
[7]

Towards responsible generative AI: A reference architecture for designing foundation model based agents,

Q. Lu, L. Zhu, X. Xu, Z. Xing, S. Harrer, and J. Whittle, “Towards responsible generative AI: A reference architecture for designing foundation model based agents,”arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2311.13148

work page arXiv 2024
[8]

It took longer than I was expecting: Why is dataset search still so hard?

M. Hulsebos, W. Lin, S. Shankar, and A. G. Parameswaran, “It took longer than I was expecting: Why is dataset search still so hard?” inProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD). ACM, 2024, pp. 1–4. [Online]. Available: https://doi.org/10.1145/3665939.3665959

work page doi:10.1145/3665939.3665959 2024
[9]

Contrastive trajectory similarity learning with dual-feature attention

S. Galhotra, Y . Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in39th IEEE International Conference on Data Engineering (ICDE). IEEE, 2023, pp. 2780–2793. [Online]. Available: https://doi.org/10.1109/ICDE55515.2023.00213

work page doi:10.1109/icde55515.2023.00213 2023
[10]

In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025

M. Esmailoghli, C. Schnell, R. J. Miller, and Z. Abedjan, “BLEND: A unified data discovery system,” in41st IEEE International Conference on Data Engineering (ICDE). IEEE, 2025, pp. 737–750. [Online]. Available: https://doi.org/10.1109/ICDE65448.2025.00061

work page doi:10.1109/icde65448.2025.00061 2025
[11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Cluster-based partial dense retrieval fused with sparse text retrieval,

Y . Yang, P. Carlson, S. He, Y . Qiao, and T. Yang, “Cluster-based partial dense retrieval fused with sparse text retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Jul. 2024. [Online]. Available: https://doi.org/10.1145/3626772.3657972

work page doi:10.1145/3626772.3657972 2024
[13]

Query rewriting in retrieval-augmented large language models,

X. Ma, Y . Gong, P. He, H. Zhao, and N. Duan, “Query rewriting in retrieval-augmented large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Dec. 2023, pp. 5303–5315. [Online]. Available: https://aclanthology. org/2023.emnlp-main.322/

work page 2023
[14]

Rag-fusion: a new take on retrieval-augmented generation,

Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented generation,”arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2402. 03367

work page 2024
[15]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inInternational Conference on Learning Representations (ICLR),

work page
[16]

Available: https://proceedings.iclr.cc/paper files/paper/ 2024/file/25f7be9694d7b32d5cc670927b8091e1-Paper-Conference.pdf

[Online]. Available: https://proceedings.iclr.cc/paper files/paper/ 2024/file/25f7be9694d7b32d5cc670927b8091e1-Paper-Conference.pdf

work page 2024
[17]

Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents,

Y . Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and J. Whittle, “Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents,”Journal of Systems and Software, vol. 220, p. 112278, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121224003224

work page 2025
[18]

Agentarceval: An architecture evaluation method for foundation model based agents,

Q. Lu, D. Zhao, Y . Liu, H. Zhang, L. Zhu, X. Xu, A. Shi, T. Tan, and R. Kazman, “Agentarceval: An architecture evaluation method for foundation model based agents,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2510.21031

work page arXiv 2025
[19]

Mixture-of-agents enhances large language model capabilities,

J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-agents enhances large language model capabilities,” in International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://proceedings.iclr.cc/paper files/paper/2025/ file/5434be94e82c54327bb9dcaf7fca52b6-Paper-Conference.pdf

work page 2025
[20]

Gradientsys: A multi-agent llm scheduler with react orchestration,

X. Song, H. Wang, Y . Chenet al., “Gradientsys: A multi-agent llm scheduler with react orchestration,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2507.06520

work page arXiv 2025
[21]

Wang, Trisha Singhal, Ameya Kelkar, and Jason Tuo

C. L. Wang, T. Singhal, A. Kelkar, and J. Tuo, “MI9 – agent intelligence protocol: Runtime governance for agentic AI systems,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2508.03858

work page arXiv 2025
[22]

X-WebAgentBench: A multilingual interactive web benchmark for evaluating global agentic system,

P. Wang, R. Tao, Q. Chen, M. Hu, and L. Qin, “X-WebAgentBench: A multilingual interactive web benchmark for evaluating global agentic system,” inFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 19 320–19 335. [Online]. Available: https: //aclanthology.org/2025.f...

work page 2025
[23]

Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,

J. Qiu, Z. Liu, Z. Liu, R. Murthy, J. Zhang, H. Chen, S. Wang, M. Zhu, L. Yang, J. Tan, R. Ram, A. Prabhakar, T. Awalgaonkar, Z. Chen, Z. Cen, C. Qian, S. Heinecke, W. Yao, S. Savarese, C. Xiong, and H. Wang, “Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,”arXiv, 2025. [Online]. Available: https://arxiv.org/...

work page arXiv 2025
[24]

RA-ISF: Learning to answer and understand from retrieval augmentation via it- erative self-feedback,

Y . Liu, X. Peng, X. Zhang, W. Liu, J. Yin, J. Cao, and T. Du, “RA-ISF: Learning to answer and understand from retrieval augmentation via it- erative self-feedback,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 4730–4749

work page 2024
[25]

Auto-RAG: Autonomous retrieval- augmented generation for large language models,

T. Yu, S. Zhang, and Y . Feng, “Auto-RAG: Autonomous retrieval- augmented generation for large language models,” 2024

work page 2024
[26]

Crafting the path: Structured query rewriting for robust information retrieval,

S. Mackie, D. Liu, and S. Culpepper, “Crafting the path: Structured query rewriting for robust information retrieval,” arXiv:2407.12529, 2024

work page arXiv 2024
[27]

A test collection for ad-hoc dataset retrieval,

M. P. Kato, H. Ohshima, Y . Liu, and H. Chen, “A test collection for ad-hoc dataset retrieval,” inProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2021, pp. 2450–2456

work page 2021
[28]

ACORDAR 2.0: The largest test collection for ad hoc dataset retrieval,

M. Risch, N. Reusch, A. Schneiberg, and P. M ¨uller, “ACORDAR 2.0: The largest test collection for ad hoc dataset retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

work page 2024
[29]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023
[30]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

G. V . Cormack, C. L. A. Clarke, and S. B ¨uttcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009, pp. 758–759

work page 2009
[31]

An analysis of fusion functions for hybrid retrieval,

S. Bruch, S. Gai, and A. Ingber, “An analysis of fusion functions for hybrid retrieval,”ACM Transactions on Information Systems, vol. 42, no. 1, pp. 1–35, 2023

work page 2023
[32]

Autoddg: Automated dataset description generation using large language models,

H. Zhang, Y . Liu, A. Santos, W.-L. A. Hung, and J. Freire, “Autoddg: Automated dataset description generation using large language models,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2502.01050

work page arXiv 2025
[33]

Improving table retrieval with question generation from partial tables,

H.-P. Liang, C.-W. Chang, and Y .-C. Fan, “Improving table retrieval with question generation from partial tables,” inProceedings of the 4th Table Representation Learning Workshop, 2025, pp. 217–228

work page 2025
[34]

Precise zero-shot dense retrieval without relevance labels,

W. Gaoet al., “Precise zero-shot dense retrieval without relevance labels,” arXiv:2212.10496, 2022

work page arXiv 2022
[35]

When single-agent with skills replace multi-agent systems and when they fail.arXiv preprint arXiv:2601.04748, 2026

X. Liet al., “When single-agent with skills replace multi-agent systems and when they fail,”arXiv, 2026. [Online]. Available: https://arxiv.org/abs/2601.04748

work page arXiv 2026
[36]

Cumulated gain-based evaluation of IR techniques,

K. J ¨arvelin and J. Kek ¨al¨ainen, “Cumulated gain-based evaluation of IR techniques,”ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002

work page 2002
[37]

TARGET: A benchmark for table retrieval for genera- tive tasks,

Y . Zhanget al., “TARGET: A benchmark for table retrieval for genera- tive tasks,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024

[1] [1]

Google dataset search: Building a search engine for datasets in an open web ecosystem,

N. Noy, M. Burgess, and D. Brickley, “Google dataset search: Building a search engine for datasets in an open web ecosystem,” inProceedings of The Web Conference (WWW), 2019, pp. 1365–1375. 2https://openai.com/ 3https://www.anthropic.com/ 4https://www.google.com/ 5https://www.kimi.com/ 6https://qwen.ai

work page 2019

[2] [2]

Auctus: A dataset search engine for data discovery and augmentation,

S. Castelo, R. Rampin, A. Santos, A. Freire, and J. Freire, “Auctus: A dataset search engine for data discovery and augmentation,”Proceedings of the VLDB Endowment, vol. 14, no. 12, pp. 2791–2794, 2021

work page 2021

[3] [3]

Dataset search: A survey,

A. Chapman and E. Simperl, “Dataset search: A survey,” inProceedings of the 2019 International Conference on Information and Knowledge Management (CIKM), 2019

work page 2019

[4] [4]

ACM Computing Surveys56(4), 1–37 (Apr 2024).https://doi.org/10.1145/3626521

N. W. Paton, J. Chen, and Z. Wu, “Dataset discovery and exploration: A survey,”ACM Comput. Surv., vol. 56, no. 4, pp. 102:1–102:37, 2024. [Online]. Available: https://doi.org/10.1145/3626521

work page doi:10.1145/3626521 2024

[5] [5]

Keywords are not always the key: A metadata field analysis for natural language search on open data portals,

L.-Y . Gan, A. Das, J. Walker, and E. Simperl, “Keywords are not always the key: A metadata field analysis for natural language search on open data portals,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2509.14457

work page arXiv 2025

[6] [6]

Is ChatGPT good at search? investigating large language models as re-ranking agents,

W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is ChatGPT good at search? investigating large language models as re-ranking agents,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023, pp. 14 918–14 937

work page 2023

[7] [7]

Towards responsible generative AI: A reference architecture for designing foundation model based agents,

Q. Lu, L. Zhu, X. Xu, Z. Xing, S. Harrer, and J. Whittle, “Towards responsible generative AI: A reference architecture for designing foundation model based agents,”arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2311.13148

work page arXiv 2024

[8] [8]

It took longer than I was expecting: Why is dataset search still so hard?

M. Hulsebos, W. Lin, S. Shankar, and A. G. Parameswaran, “It took longer than I was expecting: Why is dataset search still so hard?” inProceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics (HILDA@SIGMOD). ACM, 2024, pp. 1–4. [Online]. Available: https://doi.org/10.1145/3665939.3665959

work page doi:10.1145/3665939.3665959 2024

[9] [9]

Contrastive trajectory similarity learning with dual-feature attention

S. Galhotra, Y . Gong, and R. C. Fernandez, “Metam: Goal-oriented data discovery,” in39th IEEE International Conference on Data Engineering (ICDE). IEEE, 2023, pp. 2780–2793. [Online]. Available: https://doi.org/10.1109/ICDE55515.2023.00213

work page doi:10.1109/icde55515.2023.00213 2023

[10] [10]

In41st IEEE International Conference on Data Engineering, ICDE 2025, Hong Kong, May 19-23, 2025

M. Esmailoghli, C. Schnell, R. J. Miller, and Z. Abedjan, “BLEND: A unified data discovery system,” in41st IEEE International Conference on Data Engineering (ICDE). IEEE, 2025, pp. 737–750. [Online]. Available: https://doi.org/10.1109/ICDE65448.2025.00061

work page doi:10.1109/icde65448.2025.00061 2025

[11] [11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,”arXiv, 2023. [Online]. Available: https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Cluster-based partial dense retrieval fused with sparse text retrieval,

Y . Yang, P. Carlson, S. He, Y . Qiao, and T. Yang, “Cluster-based partial dense retrieval fused with sparse text retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’24), Jul. 2024. [Online]. Available: https://doi.org/10.1145/3626772.3657972

work page doi:10.1145/3626772.3657972 2024

[13] [13]

Query rewriting in retrieval-augmented large language models,

X. Ma, Y . Gong, P. He, H. Zhao, and N. Duan, “Query rewriting in retrieval-augmented large language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Dec. 2023, pp. 5303–5315. [Online]. Available: https://aclanthology. org/2023.emnlp-main.322/

work page 2023

[14] [14]

Rag-fusion: a new take on retrieval-augmented generation,

Z. Rackauckas, “Rag-fusion: a new take on retrieval-augmented generation,”arXiv, 2024. [Online]. Available: https://arxiv.org/abs/2402. 03367

work page 2024

[15] [15]

Self-rag: Learning to retrieve, generate, and critique through self-reflection,

A. Asai, Z. Wu, Y . Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” inInternational Conference on Learning Representations (ICLR),

work page

[16] [16]

Available: https://proceedings.iclr.cc/paper files/paper/ 2024/file/25f7be9694d7b32d5cc670927b8091e1-Paper-Conference.pdf

[Online]. Available: https://proceedings.iclr.cc/paper files/paper/ 2024/file/25f7be9694d7b32d5cc670927b8091e1-Paper-Conference.pdf

work page 2024

[17] [17]

Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents,

Y . Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and J. Whittle, “Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents,”Journal of Systems and Software, vol. 220, p. 112278, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0164121224003224

work page 2025

[18] [18]

Agentarceval: An architecture evaluation method for foundation model based agents,

Q. Lu, D. Zhao, Y . Liu, H. Zhang, L. Zhu, X. Xu, A. Shi, T. Tan, and R. Kazman, “Agentarceval: An architecture evaluation method for foundation model based agents,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2510.21031

work page arXiv 2025

[19] [19]

Mixture-of-agents enhances large language model capabilities,

J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou, “Mixture-of-agents enhances large language model capabilities,” in International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://proceedings.iclr.cc/paper files/paper/2025/ file/5434be94e82c54327bb9dcaf7fca52b6-Paper-Conference.pdf

work page 2025

[20] [20]

Gradientsys: A multi-agent llm scheduler with react orchestration,

X. Song, H. Wang, Y . Chenet al., “Gradientsys: A multi-agent llm scheduler with react orchestration,”arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2507.06520

work page arXiv 2025

[21] [21]

Wang, Trisha Singhal, Ameya Kelkar, and Jason Tuo

C. L. Wang, T. Singhal, A. Kelkar, and J. Tuo, “MI9 – agent intelligence protocol: Runtime governance for agentic AI systems,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2508.03858

work page arXiv 2025

[22] [22]

X-WebAgentBench: A multilingual interactive web benchmark for evaluating global agentic system,

P. Wang, R. Tao, Q. Chen, M. Hu, and L. Qin, “X-WebAgentBench: A multilingual interactive web benchmark for evaluating global agentic system,” inFindings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 19 320–19 335. [Online]. Available: https: //aclanthology.org/2025.f...

work page 2025

[23] [23]

Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,

J. Qiu, Z. Liu, Z. Liu, R. Murthy, J. Zhang, H. Chen, S. Wang, M. Zhu, L. Yang, J. Tan, R. Ram, A. Prabhakar, T. Awalgaonkar, Z. Chen, Z. Cen, C. Qian, S. Heinecke, W. Yao, S. Savarese, C. Xiong, and H. Wang, “Locobench-agent: An interactive benchmark for LLM agents in long-context software engineering,”arXiv, 2025. [Online]. Available: https://arxiv.org/...

work page arXiv 2025

[24] [24]

RA-ISF: Learning to answer and understand from retrieval augmentation via it- erative self-feedback,

Y . Liu, X. Peng, X. Zhang, W. Liu, J. Yin, J. Cao, and T. Du, “RA-ISF: Learning to answer and understand from retrieval augmentation via it- erative self-feedback,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 4730–4749

work page 2024

[25] [25]

Auto-RAG: Autonomous retrieval- augmented generation for large language models,

T. Yu, S. Zhang, and Y . Feng, “Auto-RAG: Autonomous retrieval- augmented generation for large language models,” 2024

work page 2024

[26] [26]

Crafting the path: Structured query rewriting for robust information retrieval,

S. Mackie, D. Liu, and S. Culpepper, “Crafting the path: Structured query rewriting for robust information retrieval,” arXiv:2407.12529, 2024

work page arXiv 2024

[27] [27]

A test collection for ad-hoc dataset retrieval,

M. P. Kato, H. Ohshima, Y . Liu, and H. Chen, “A test collection for ad-hoc dataset retrieval,” inProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, 2021, pp. 2450–2456

work page 2021

[28] [28]

ACORDAR 2.0: The largest test collection for ad hoc dataset retrieval,

M. Risch, N. Reusch, A. Schneiberg, and P. M ¨uller, “ACORDAR 2.0: The largest test collection for ad hoc dataset retrieval,” inProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2024

work page 2024

[29] [29]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

work page 2023

[30] [30]

Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

G. V . Cormack, C. L. A. Clarke, and S. B ¨uttcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009, pp. 758–759

work page 2009

[31] [31]

An analysis of fusion functions for hybrid retrieval,

S. Bruch, S. Gai, and A. Ingber, “An analysis of fusion functions for hybrid retrieval,”ACM Transactions on Information Systems, vol. 42, no. 1, pp. 1–35, 2023

work page 2023

[32] [32]

Autoddg: Automated dataset description generation using large language models,

H. Zhang, Y . Liu, A. Santos, W.-L. A. Hung, and J. Freire, “Autoddg: Automated dataset description generation using large language models,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2502.01050

work page arXiv 2025

[33] [33]

Improving table retrieval with question generation from partial tables,

H.-P. Liang, C.-W. Chang, and Y .-C. Fan, “Improving table retrieval with question generation from partial tables,” inProceedings of the 4th Table Representation Learning Workshop, 2025, pp. 217–228

work page 2025

[34] [34]

Precise zero-shot dense retrieval without relevance labels,

W. Gaoet al., “Precise zero-shot dense retrieval without relevance labels,” arXiv:2212.10496, 2022

work page arXiv 2022

[35] [35]

When single-agent with skills replace multi-agent systems and when they fail.arXiv preprint arXiv:2601.04748, 2026

X. Liet al., “When single-agent with skills replace multi-agent systems and when they fail,”arXiv, 2026. [Online]. Available: https://arxiv.org/abs/2601.04748

work page arXiv 2026

[36] [36]

Cumulated gain-based evaluation of IR techniques,

K. J ¨arvelin and J. Kek ¨al¨ainen, “Cumulated gain-based evaluation of IR techniques,”ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002

work page 2002

[37] [37]

TARGET: A benchmark for table retrieval for genera- tive tasks,

Y . Zhanget al., “TARGET: A benchmark for table retrieval for genera- tive tasks,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

work page 2024