Bringing Agentic Search to Earth Observation Data Discovery

Chugang Yi; Haizhao Yang; Minghan Yu; Yixin Wen; Youran Sun

arxiv: 2607.02387 · v1 · pith:3T3WRBHSnew · submitted 2026-07-02 · 💻 cs.IR · cs.LG

Bringing Agentic Search to Earth Observation Data Discovery

Minghan Yu , Youran Sun , Chugang Yi , Yixin Wen , Haizhao Yang This is my paper

Pith reviewed 2026-07-03 06:41 UTC · model grok-4.3

classification 💻 cs.IR cs.LG

keywords agentic searchearth observationknowledge graphinformation retrievalNASA datasetsLLM rerankingbenchmarkscore fusion

0 comments

The pith

Agentic search combining neural scoring, BM25 fusion, and zero-shot LLM reranking improves Earth observation data retrieval by over 5x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an agentic search service that accepts natural language queries and returns relevant NASA Earth observation datasets and tools drawn from the NASA EO-KG. A benchmark of 47k query-dataset pairs supports training a neural scorer whose fusion with BM25 raises both Recall@10 and MRR by more than five times over simple baselines. Adding a zero-shot agentic reranking stage that uses LLM reasoning without further training then increases MRR by an additional 28 percent on a held-out subset, showing that the two retrieval approaches are complementary.

Core claim

The central claim is that the latent value of knowledge graphs for geoscience data discovery can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph the authors derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs including 21k task-based queries. A neural scorer fine-tuned on this benchmark beats cosine and BM25 baselines; score fusion with BM25 raises R@10 and MRR by over 5x; and a zero-shot agentic reranking stage lifts MRR by 28 percent on a stratified N=200 subset.

What carries the argument

The hybrid retrieval pipeline that fuses a fine-tuned neural scorer with BM25 and then applies zero-shot LLM agentic reranking on top of the NASA Earth Observation Knowledge Graph.

If this is right

The deployed public service can directly help domain experts locate matching datasets and tools from natural language research questions.
Synthetic query pairs generated from the knowledge graph enable effective supervised training for retrieval in this domain.
LLM reasoning in the reranking stage adds measurable value beyond what supervised methods alone achieve.
The performance gains are observed on both the full benchmark and a stratified subset of task-based queries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hybrid supervised-plus-zero-shot pattern could be tested on knowledge graphs from other scientific fields that face similar discovery problems.
Real-world usage logs from the public service could serve as a natural next test set to check how well synthetic queries generalize.
The observed complementarity suggests that future systems may routinely combine trained scorers with lightweight agentic stages rather than relying on either approach alone.

Load-bearing premise

The NASA EO-KG faithfully captures all relevant dataset-tool relationships and the 47k synthetic query pairs are representative of real user information needs.

What would settle it

Running the full pipeline on a collection of actual user queries collected from geoscience researchers and measuring whether the reported gains in Recall@10 and MRR still appear.

Figures

Figures reproduced from arXiv: 2607.02387 by Chugang Yi, Haizhao Yang, Minghan Yu, Yixin Wen, Youran Sun.

read the original abstract

NASA and its data centers hold thousands of geoscience datasets and tools like Worldview, Giovanni, the Science Discovery Engine, and Harmony. Finding the right one is hard even for domain experts. We present an agentic search system, deployed as a public service for the geoscience community, that takes a natural-language research query and returns the matching datasets and tools. We demonstrate that, in the era of large language models, the latent value of knowledge graphs (KGs) can be substantially amplified through agentic search. From the NASA Earth Observation Knowledge Graph (NASA EO-KG) we derive NASA-EO-Bench, an open benchmark of 47k query-dataset pairs (21k task-based queries). A neural scorer fine-tuned on NASA-EO-Bench beats cosine and BM25 baselines. Further combining it with BM25 via score fusion raises both Recall@10 (R@10) and MRR by over 5x. On top of this supervised pipeline, we add a zero-shot agentic reranking stage that, without any additional training, lifts MRR by 28% on a stratified N=200 subset, showing that LLM reasoning is complementary to supervised retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new EO-specific benchmark from the NASA KG and shows fusion plus zero-shot LLM reranking lifts performance on the synthetic queries it generated.

read the letter

The core thing to know is that this work turns the NASA Earth Observation Knowledge Graph into an open benchmark of 47k query-dataset pairs and then demonstrates a retrieval pipeline that first fuses a fine-tuned neural scorer with BM25 for large gains, then layers on zero-shot agentic reranking for an additional 28% MRR lift on a 200-example slice.

What stands out is the practical angle: they actually deployed the system as a public service for geoscientists and released the benchmark. The complementarity result between the supervised stage and the LLM reranker is cleanly shown on their data, and the 5x jump from fusion is the kind of concrete number that makes the work usable for others building similar tools.

The soft spot is exactly the one the stress test flags. All queries come from the same KG that powers the retriever, so the measured improvements sit inside a closed synthetic distribution. Real user questions in geoscience often involve ambiguity, multi-hop needs, or relationships the KG does not capture, and nothing in the abstract indicates an external human-authored test set or cross-domain check. The 28% figure also rests on a small N=200 subset with no error bars or protocol details supplied.

This is for IR researchers who want a ready benchmark in a scientific domain and for people building data-discovery tools for NASA-style archives. A reader working on retrieval for specialized corpora would find the fusion numbers and the benchmark itself worth looking at.

I would send it to peer review. The open benchmark and deployed system give it enough substance to justify referee time, even if the generalization claims need more external validation.

Referee Report

2 major / 1 minor

Summary. The paper presents an agentic search system for NASA Earth Observation datasets and tools. From the NASA EO-KG it derives the open NASA-EO-Bench benchmark of 47k synthetic query-dataset pairs (including 21k task-based queries). A neural scorer fine-tuned on the benchmark, when combined with BM25 via score fusion, raises R@10 and MRR by over 5x relative to cosine and BM25 baselines. A zero-shot agentic reranking stage then lifts MRR by an additional 28% on a stratified N=200 subset, with the overall system deployed as a public service.

Significance. If the reported gains prove robust, the work illustrates how KGs can be leveraged at scale through LLM-based agentic reranking for a deployed retrieval service in geoscience. The release of NASA-EO-Bench as an open benchmark is a concrete positive contribution that could support further research on EO data discovery.

major comments (2)

[Abstract] Abstract: All quantitative claims (5x lift from neural+BM25 fusion; 28% MRR gain from zero-shot agentic reranking) are measured exclusively on NASA-EO-Bench, which is generated from the same NASA EO-KG that supplies the retrieval index and relationships. No external validation set, human-authored query collection, or out-of-distribution test is referenced, so it is unclear whether the measured improvements transfer to real user queries whose ambiguity, multi-hop structure, or tool/dataset relationships may differ from the KG-derived distribution.
[Abstract] Abstract: The reported results supply no error bars, confidence intervals, statistical significance tests, or full experimental protocol (train/test split details, baseline re-implementations, hyper-parameter search, or verification that the N=200 subset is representative). This absence makes it impossible to assess the reliability or reproducibility of the 5x and 28% figures.

minor comments (1)

[Abstract] Abstract: The distinction between the full 47k pairs and the 21k task-based queries is stated but not used to qualify any of the reported metrics; clarifying which subset drives the fusion and reranking results would improve interpretability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment point by point below, with plans for revision where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: All quantitative claims (5x lift from neural+BM25 fusion; 28% MRR gain from zero-shot agentic reranking) are measured exclusively on NASA-EO-Bench, which is generated from the same NASA EO-KG that supplies the retrieval index and relationships. No external validation set, human-authored query collection, or out-of-distribution test is referenced, so it is unclear whether the measured improvements transfer to real user queries whose ambiguity, multi-hop structure, or tool/dataset relationships may differ from the KG-derived distribution.

Authors: We acknowledge that NASA-EO-Bench is derived from the NASA EO-KG, which encodes the authoritative dataset-tool relationships. This design enables scalable generation of 47k pairs with verifiable ground truth that would otherwise require extensive human annotation. The KG reflects real NASA-curated relationships, allowing the benchmark to test recovery of those relationships from natural-language queries. However, we agree that the absence of an external or human-authored validation set is a limitation for assessing generalization to real user queries with potentially different ambiguity or multi-hop structures. In the revision we will expand the discussion and limitations sections to explicitly address this, including potential distribution shifts, and we will outline plans for future collection of a small human-authored test set. We will also update the abstract to note that all reported results are on the KG-derived benchmark. revision: partial
Referee: [Abstract] Abstract: The reported results supply no error bars, confidence intervals, statistical significance tests, or full experimental protocol (train/test split details, baseline re-implementations, hyper-parameter search, or verification that the N=200 subset is representative). This absence makes it impossible to assess the reliability or reproducibility of the 5x and 28% figures.

Authors: We agree that the abstract and results presentation would be strengthened by including these details. In the revised manuscript we will add error bars (via bootstrapping or multiple runs where available), confidence intervals, and statistical significance tests for the key metrics. We will also expand the experimental section with the full protocol: train/test split details, baseline re-implementation notes, hyper-parameter search procedure, and a description of how the stratified N=200 subset was constructed along with evidence of its representativeness relative to the full benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims on newly constructed benchmark with no derivations or self-referential reductions.

full rationale

The paper contains no equations, derivations, or mathematical claims. All performance numbers (5x gains from fusion, 28% MRR lift from agentic reranking) are direct empirical measurements on NASA-EO-Bench, a benchmark explicitly derived from the NASA EO-KG. This is standard construction of a synthetic evaluation set followed by train/test reporting; it does not match any enumerated circularity pattern such as self-definitional relations, fitted inputs renamed as predictions, or load-bearing self-citations. The benchmark and system share a common data source, but that is a generalization concern, not a reduction of the reported results to their own inputs by construction. No steps qualify for flagging.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the system is presented as an engineering application of existing retrieval and LLM components.

pith-pipeline@v0.9.1-grok · 5746 in / 978 out tokens · 25616 ms · 2026-07-03T06:41:52.557721+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 29 canonical work pages · 8 internal anchors

[1]

Bruch, S

S. Bruch, S. Gai, and A. Ingber. An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42 0 (1): 0 20:1--20:35, 2024. doi:10.1145/3596512

work page doi:10.1145/3596512 2024
[2]

C. Choi, J. Kwon, A. Lopez-Lira, C. Kim, M. Kim, J. Hwang, J. Ha, H. Choi, S. Yun, Y.-J. Kim, and Y. Lee. Finagentbench: A benchmark dataset for agentic retrieval in financial question answering, 2025. URL https://arxiv.org/abs/2508.14052

work page arXiv 2025
[3]

Cohen, K

T. Cohen, K. Roberts, A. E. Gururaj, X. Chen, S. Pournejati, G. Alter, W. R. Hersh, D. Demner-Fushman, L. Ohno-Machado, and H. Xu. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge. Database, 2017, Jan. 2017. ISSN 1758-0463. doi:10.1093/database/bax061. URL http://dx....

work page doi:10.1093/database/bax061 2016
[4]

J. Du, Y. Sun, and H. Yang. AutoNumerics : An autonomous, PDE -agnostic multi-agent pipeline for scientific computing, 2026. URL https://arxiv.org/abs/2602.17607

work page arXiv 2026
[5]

L. Gao, X. Ma, J. Lin, and J. Callan. Precise zero-shot dense retrieval without relevance labels, 2022. URL https://arxiv.org/abs/2212.10496

work page arXiv 2022
[6]

Jaber, W

A. Jaber, W. Zhu, A. Roy, K. Jayavelu, J. Downes, S. Mohamed, C. Agonafir, L. Hawkins, and T. Zheng. Autoclimds: Climate data science agentic ai -- a knowledge graph is all you need, 2025. URL https://arxiv.org/abs/2509.21553

work page arXiv 2025
[7]

M. P. Kato, H. Ohshima, Y. Liu, and H. Chen. Overview of the NTCIR-15 data search task. In C. L. A. Clarke and N. Kando, editors, Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR 2020, Tokyo, Japan, December 8-11, 2020 . National Institute of Informatics (NII) , 2020. URL https://research.nii.ac.jp/ntcir/wor...

2020
[8]

Kolyada, M

N. Kolyada, M. Potthast, and B. Stein. A Test Collection for Dataset Retrieval, pages 372--380. Springer Nature Switzerland, 2025. ISBN 9783031887147. doi:10.1007/978-3-031-88714-7_36. URL http://dx.doi.org/10.1007/978-3-031-88714-7_36

work page doi:10.1007/978-3-031-88714-7_36 2025
[9]

Lewis, E

P. Lewis, E. Perez, A. Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020

2020
[10]

Z. Li, S. Yan, J. Cao, M. Zhang, A. Wei, J. Yoo, and Y. Hong. HydroAgent : Closing the gap between frontier LLMs and human experts in hydrologic model calibration via simulator-grounded RL , 2026. URL https://arxiv.org/abs/2605.17792

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

T. Lin, Q. Chen, G. Cheng, A. Soylu, B. Ell, R. Zhao, Q. Shi, X. Wang, Y. Gu, and E. Kharlamov. Acordar: A test collection for ad hoc content-based (rdf) dataset retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pages 2981--2991. ACM, July 2022. doi:10.1145/3477495.353...

work page doi:10.1145/3477495.3531729 2022
[12]

N. F. Liu, K. Lin, J. K. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

R. Liu, Z. Li, and A. K. Kazazi. Towards intelligent geospatial data discovery: a knowledge graph-driven multi-agent framework powered by large language models, 2026. URL https://arxiv.org/abs/2603.20670

work page arXiv 2026
[14]

Liu and Y

Z. Liu and Y. Wen. Accelerating earth science to action. Bulletin of the American Meteorological Society, 106 0 (10), 2025. doi:10.1175/BAMS-D-24-0226.1

work page doi:10.1175/bams-d-24-0226.1 2025
[15]

X. Ma, X. Zhang, R. Pradeep, and J. Lin. Zero-shot listwise document reranking with a large language model, 2023. URL https://arxiv.org/abs/2305.02156

work page arXiv 2023
[16]

nasa-impact/nasa-smd-ibm-st-v2 : Domain-adapted sentence transformer for nasa scientific text

nasa-smd-ibm-st-v2. nasa-impact/nasa-smd-ibm-st-v2 : Domain-adapted sentence transformer for nasa scientific text. https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2, 2024. Accessed: 2026

2024
[17]

Pantiukhin, B

D. Pantiukhin, B. Shapkin, I. Kuznetsov, A. A. Jost, and N. Koldunov. Accelerating earth science discovery via multi-agent llm systems, 2025. URL https://arxiv.org/abs/2503.05854

work page arXiv 2025
[18]

Pantiukhin, I

D. Pantiukhin, I. Kuznetsov, B. Shapkin, A. Jost, T. Jung, and N. Koldunov. A hierarchical multi-agent system for autonomous discovery in geoscientific data archives, 2026. URL https://arxiv.org/abs/2602.21351

work page arXiv 2026
[19]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

R. Pradeep, S. Sharifymoghaddam, and J. Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL https://arxiv.org/abs/2312.02724

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

X. Ren, Y. Sun, and H. Liang. Correcting mean bias in text embeddings: A refined renormalization with training-free improvements on MMTEB , 2025. URL https://arxiv.org/abs/2511.11041

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

X. Ren, Y. Sun, C. Yi, K. Zhang, J. Guo, J. Du, and H. Yang. What's missing in autonomous research? A systematization of systems, benchmarks, and verification, June 2026. URL https://www.researchgate.net/publication/406952713

work page arXiv 2026
[22]

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC -3. In TREC, 1995

1995
[23]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Răileanu, M. Lomelí, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URL https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Schluntz and B

E. Schluntz and B. Zhang. Building effective agents. https://www.anthropic.com/engineering/building-effective-agents, 2024. Anthropic Engineering Blog, December 2024

2024
[25]

Q. Shi, J. He, Q. Chen, and G. Cheng. Dsebench: A test collection for explainable dataset search with examples, 2025. URL https://arxiv.org/abs/2510.17228

work page arXiv 2025
[26]

W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren. Is chatgpt good at search? investigating large language models as re-ranking agents, 2023. URL https://arxiv.org/abs/2304.09542

work page arXiv 2023
[27]

Y. Sun, X. Ren, C. Yi, J. Guo, K. Zhang, J. Du, and H. Yang. Agon: An autonomous large-scale omnidisciplinary research system built on prompt economy, 2026 a . URL https://arxiv.org/abs/2606.24177

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Y. Sun, Y. Wen, and H. Yang. ReSearch : A multi-stage machine learning framework for earth science data discovery. arXiv preprint arXiv:2601.14176, 2026 b

work page arXiv 2026
[29]

Tan and C

Z. Tan and C. Duan. Multi-disciplinary dataset discovery from citation-verified literature contexts, 2026. URL https://arxiv.org/abs/2601.05099

work page arXiv 2026
[30]

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

R. Terrenzi, P. M. Konrad, T. L. Adam, and S. Ayvaz. A reference architecture for agentic hybrid retrieval in dataset search, 2026. URL https://arxiv.org/abs/2604.16394

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

S. Yan, M. Chen, Z. Li, Y. Wen, et al. AI agent for hydrologic modeling: Definition, development and application, 2026. URL https://essopenarchive.org/doi/full/10.22541/essoar.176894821.13120988/v1

work page doi:10.22541/essoar.176894821.13120988/v1 2026
[32]

E. Yang, A. Yates, K. Ricci, O. Weller, V. Chari, B. V. Durme, and D. Lawrie. Rank-k: Test-time reasoning for listwise reranking, 2025. URL https://arxiv.org/abs/2505.14432

work page arXiv 2025
[33]

S. Yao, J. Zhao, D. Yu, et al. ReAct : Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

C. Yi, M. Yu, W. Qian, Y. Wen, and H. Yang. Efficient kilometer-scale precipitation downscaling with conditional wavelet diffusion, 2025. URL https://arxiv.org/abs/2507.01354

work page arXiv 2025

[1] [1]

Bruch, S

S. Bruch, S. Gai, and A. Ingber. An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems, 42 0 (1): 0 20:1--20:35, 2024. doi:10.1145/3596512

work page doi:10.1145/3596512 2024

[2] [2]

C. Choi, J. Kwon, A. Lopez-Lira, C. Kim, M. Kim, J. Hwang, J. Ha, H. Choi, S. Yun, Y.-J. Kim, and Y. Lee. Finagentbench: A benchmark dataset for agentic retrieval in financial question answering, 2025. URL https://arxiv.org/abs/2508.14052

work page arXiv 2025

[3] [3]

Cohen, K

T. Cohen, K. Roberts, A. E. Gururaj, X. Chen, S. Pournejati, G. Alter, W. R. Hersh, D. Demner-Fushman, L. Ohno-Machado, and H. Xu. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 biocaddie dataset retrieval challenge. Database, 2017, Jan. 2017. ISSN 1758-0463. doi:10.1093/database/bax061. URL http://dx....

work page doi:10.1093/database/bax061 2016

[4] [4]

J. Du, Y. Sun, and H. Yang. AutoNumerics : An autonomous, PDE -agnostic multi-agent pipeline for scientific computing, 2026. URL https://arxiv.org/abs/2602.17607

work page arXiv 2026

[5] [5]

L. Gao, X. Ma, J. Lin, and J. Callan. Precise zero-shot dense retrieval without relevance labels, 2022. URL https://arxiv.org/abs/2212.10496

work page arXiv 2022

[6] [6]

Jaber, W

A. Jaber, W. Zhu, A. Roy, K. Jayavelu, J. Downes, S. Mohamed, C. Agonafir, L. Hawkins, and T. Zheng. Autoclimds: Climate data science agentic ai -- a knowledge graph is all you need, 2025. URL https://arxiv.org/abs/2509.21553

work page arXiv 2025

[7] [7]

M. P. Kato, H. Ohshima, Y. Liu, and H. Chen. Overview of the NTCIR-15 data search task. In C. L. A. Clarke and N. Kando, editors, Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, NTCIR 2020, Tokyo, Japan, December 8-11, 2020 . National Institute of Informatics (NII) , 2020. URL https://research.nii.ac.jp/ntcir/wor...

2020

[8] [8]

Kolyada, M

N. Kolyada, M. Potthast, and B. Stein. A Test Collection for Dataset Retrieval, pages 372--380. Springer Nature Switzerland, 2025. ISBN 9783031887147. doi:10.1007/978-3-031-88714-7_36. URL http://dx.doi.org/10.1007/978-3-031-88714-7_36

work page doi:10.1007/978-3-031-88714-7_36 2025

[9] [9]

Lewis, E

P. Lewis, E. Perez, A. Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459--9474, 2020

2020

[10] [10]

Z. Li, S. Yan, J. Cao, M. Zhang, A. Wei, J. Yoo, and Y. Hong. HydroAgent : Closing the gap between frontier LLMs and human experts in hydrologic model calibration via simulator-grounded RL , 2026. URL https://arxiv.org/abs/2605.17792

work page internal anchor Pith review Pith/arXiv arXiv 2026

[11] [11]

T. Lin, Q. Chen, G. Cheng, A. Soylu, B. Ell, R. Zhao, Q. Shi, X. Wang, Y. Gu, and E. Kharlamov. Acordar: A test collection for ad hoc content-based (rdf) dataset retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '22, pages 2981--2991. ACM, July 2022. doi:10.1145/3477495.353...

work page doi:10.1145/3477495.3531729 2022

[12] [12]

N. F. Liu, K. Lin, J. K. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang. Lost in the middle: How language models use long contexts, 2023. URL https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

R. Liu, Z. Li, and A. K. Kazazi. Towards intelligent geospatial data discovery: a knowledge graph-driven multi-agent framework powered by large language models, 2026. URL https://arxiv.org/abs/2603.20670

work page arXiv 2026

[14] [14]

Liu and Y

Z. Liu and Y. Wen. Accelerating earth science to action. Bulletin of the American Meteorological Society, 106 0 (10), 2025. doi:10.1175/BAMS-D-24-0226.1

work page doi:10.1175/bams-d-24-0226.1 2025

[15] [15]

X. Ma, X. Zhang, R. Pradeep, and J. Lin. Zero-shot listwise document reranking with a large language model, 2023. URL https://arxiv.org/abs/2305.02156

work page arXiv 2023

[16] [16]

nasa-impact/nasa-smd-ibm-st-v2 : Domain-adapted sentence transformer for nasa scientific text

nasa-smd-ibm-st-v2. nasa-impact/nasa-smd-ibm-st-v2 : Domain-adapted sentence transformer for nasa scientific text. https://huggingface.co/nasa-impact/nasa-smd-ibm-st-v2, 2024. Accessed: 2026

2024

[17] [17]

Pantiukhin, B

D. Pantiukhin, B. Shapkin, I. Kuznetsov, A. A. Jost, and N. Koldunov. Accelerating earth science discovery via multi-agent llm systems, 2025. URL https://arxiv.org/abs/2503.05854

work page arXiv 2025

[18] [18]

Pantiukhin, I

D. Pantiukhin, I. Kuznetsov, B. Shapkin, A. Jost, T. Jung, and N. Koldunov. A hierarchical multi-agent system for autonomous discovery in geoscientific data archives, 2026. URL https://arxiv.org/abs/2602.21351

work page arXiv 2026

[19] [19]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

R. Pradeep, S. Sharifymoghaddam, and J. Lin. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze!, 2023. URL https://arxiv.org/abs/2312.02724

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

X. Ren, Y. Sun, and H. Liang. Correcting mean bias in text embeddings: A refined renormalization with training-free improvements on MMTEB , 2025. URL https://arxiv.org/abs/2511.11041

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

X. Ren, Y. Sun, C. Yi, K. Zhang, J. Guo, J. Du, and H. Yang. What's missing in autonomous research? A systematization of systems, benchmarks, and verification, June 2026. URL https://www.researchgate.net/publication/406952713

work page arXiv 2026

[22] [22]

S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC -3. In TREC, 1995

1995

[23] [23]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessì, R. Răileanu, M. Lomelí, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools, 2023. URL https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Schluntz and B

E. Schluntz and B. Zhang. Building effective agents. https://www.anthropic.com/engineering/building-effective-agents, 2024. Anthropic Engineering Blog, December 2024

2024

[25] [25]

Q. Shi, J. He, Q. Chen, and G. Cheng. Dsebench: A test collection for explainable dataset search with examples, 2025. URL https://arxiv.org/abs/2510.17228

work page arXiv 2025

[26] [26]

W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren. Is chatgpt good at search? investigating large language models as re-ranking agents, 2023. URL https://arxiv.org/abs/2304.09542

work page arXiv 2023

[27] [27]

Y. Sun, X. Ren, C. Yi, J. Guo, K. Zhang, J. Du, and H. Yang. Agon: An autonomous large-scale omnidisciplinary research system built on prompt economy, 2026 a . URL https://arxiv.org/abs/2606.24177

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Y. Sun, Y. Wen, and H. Yang. ReSearch : A multi-stage machine learning framework for earth science data discovery. arXiv preprint arXiv:2601.14176, 2026 b

work page arXiv 2026

[29] [29]

Tan and C

Z. Tan and C. Duan. Multi-disciplinary dataset discovery from citation-verified literature contexts, 2026. URL https://arxiv.org/abs/2601.05099

work page arXiv 2026

[30] [30]

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

R. Terrenzi, P. M. Konrad, T. L. Adam, and S. Ayvaz. A reference architecture for agentic hybrid retrieval in dataset search, 2026. URL https://arxiv.org/abs/2604.16394

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

S. Yan, M. Chen, Z. Li, Y. Wen, et al. AI agent for hydrologic modeling: Definition, development and application, 2026. URL https://essopenarchive.org/doi/full/10.22541/essoar.176894821.13120988/v1

work page doi:10.22541/essoar.176894821.13120988/v1 2026

[32] [32]

E. Yang, A. Yates, K. Ricci, O. Weller, V. Chari, B. V. Durme, and D. Lawrie. Rank-k: Test-time reasoning for listwise reranking, 2025. URL https://arxiv.org/abs/2505.14432

work page arXiv 2025

[33] [33]

S. Yao, J. Zhao, D. Yu, et al. ReAct : Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

C. Yi, M. Yu, W. Qian, Y. Wen, and H. Yang. Efficient kilometer-scale precipitation downscaling with conditional wavelet diffusion, 2025. URL https://arxiv.org/abs/2507.01354

work page arXiv 2025