Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Alon Halevy; Natasha Noy; Shiyu Chen; Tarfah Alrashed

arxiv: 2605.28787 · v1 · pith:YGKTUBGVnew · submitted 2026-05-27 · 💻 cs.IR · cs.AI

Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval

Shiyu Chen , Tarfah Alrashed , Alon Halevy , Natasha Noy This is my paper

Pith reviewed 2026-06-29 09:34 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords semantic metadataagentic data retrievalFAIR principlesdataset discoveryschema.orgLLM evaluationcomparative study

0 comments

The pith

Semantic metadata enables agents to retrieve FAIR-compliant datasets with 65.7 percent higher precision than open-web search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper pits a baseline agent searching billions of unstructured web pages against a semantic agent drawing from 90 million schema.org-annotated datasets. An LLM-as-a-judge pipeline scores both on FAIR criteria of relevance, accessibility, and utility. The semantic agent returns more metadata-rich registries and machine-readable downloads while the baseline often stops at prose pages or landing pages. Although the baseline answers more questions overall, the semantic version produces markedly more precise, execution-ready results. The work concludes that structured metadata remains necessary for reliable agentic data workflows even as LLMs improve unstructured retrieval.

Core claim

The Semantic Agent achieves 65.7 percent higher overall precision in retrieving FAIR-compliant datasets, including 44.9 percent higher precision on metadata-rich registries and 46.6 percent higher precision on pages with machine-readable downloads, while the Baseline Agent returns prose-heavy pages in 20.1 percent of results and portal landing pages in 8.5 percent of results.

What carries the argument

The side-by-side comparison of Baseline Agent (open-web retrieval) and Semantic Agent (schema.org corpus retrieval) scored by an LLM-as-a-judge pipeline mapped directly to FAIR principles.

If this is right

Structured semantic metadata is required for reliable execution-oriented autonomous data workflows.
Unstructured web retrieval supports broad coverage but produces frequent last-mile utility failures.
Semantic registries deliver higher accuracy on downloadable, machine-actionable datasets.
Baseline agents answer more questions but at the cost of lower precision on FAIR-compliant results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid retrieval systems that combine open-web breadth with semantic filtering could improve both coverage and precision.
The results imply that continued investment in schema.org-style annotation will be needed as agent use grows.
The same comparative method could be applied to other domains such as code repositories or scientific literature.

Load-bearing premise

The LLM judge accurately and without bias maps retrieved items to semantic relevance, data accessibility, and computational utility under the FAIR principles.

What would settle it

A human-expert re-evaluation of the same retrieved items that finds no meaningful precision gap between the two agents.

Figures

Figures reproduced from arXiv: 2605.28787 by Alon Halevy, Natasha Noy, Shiyu Chen, Tarfah Alrashed.

**Figure 1.** Figure 1: Comparative System Architecture. Similar agent logic is evaluated across unstructured Baseline Agent and Semantic Agent dataset search environments. Both feed a unified, FAIR-aligned evaluation of relevance, accessibility, and utility. 3.1 Agentic Framework and Setup To ensure experimental parity, we contrast a Semantic Agent against a Baseline Agent using identical underlying architectures ( [PITH_FULL… view at source ↗

**Figure 2.** Figure 2: Comparison of the agent results by relevance scores. semantic metadata, the agent avoided non-computational roadblocks, achieving relative reductions of 46.6% in Narrative/Unstructured Data (data embedded in narrative prose), 62.9% in Presentation-Bounded Data (pages with only charts or interactive dashboards, without metadata), and 76.3% in Non-Data (false positive pages lacking data entirely) [PITH_FULL… view at source ↗

**Figure 3.** Figure 3: Comparison of the agent results by data accessibility levels 5.3 Dataset Page Type The distribution of retrieved dataset type pages varied between the two systems ( [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of the agent results by dataset page type 5.4 Agentic FAIRness To evaluate an agent’s capacity to identify machine-actionable registry entries that ensure metadata compliance and direct data accessibility. We define a dataset as fully “FAIR-compliant” if it achieves a perfect composite score across three criteria: a relevance score of 2 (Highly Relevant), Dataset Accessibility at Level 6 (Machin… view at source ↗

read the original abstract

In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an "LLM-as-a-judge" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers "Last-Mile Utility" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Semantic metadata helps precision in this agent comparison, but every number depends on an unvalidated LLM judge.

read the letter

The paper's core finding is that an agent using a schema.org corpus gets higher precision on FAIR-compliant datasets than one searching the open web, with lifts of 44.9% and 46.6% on specific metrics and 65.7% overall. It also names concrete failure types for the baseline, such as landing on prose pages or portals instead of downloadable data.

The work does a clean job framing the practical tradeoff between coverage and accuracy for autonomous agents. The "Last-Mile Utility" category is a useful label, and mapping the evaluation to FAIR principles gives the results a clear target. The setup is a straightforward head-to-head that hasn't been run exactly this way before.

The main weakness is the evaluation pipeline. All precision numbers come from an LLM judge with no reported human agreement, calibration, or prompt details. Because the two agents return different kinds of pages—one structured, one unstructured—any bias in how the judge scores text could drive the gap. The corpora also differ sharply in scale and content, so the question selection and comparability need more explicit defense.

This is for people building or evaluating data agents who care about machine-actionable outputs rather than pure search. It is worth a serious referee because the question is timely and the empirical design is direct, even though the judge validation will need to be added or strengthened.

Referee Report

2 major / 1 minor

Summary. The manuscript compares a Baseline Agent retrieving data from the unstructured open web against a Semantic Agent using a corpus of 90 million schema.org-annotated datasets. Deploying an LLM-as-a-judge pipeline mapped to FAIR principles, it reports that the Semantic Agent achieves 44.9% higher precision on metadata-rich registries, 46.6% higher precision on pages with machine-readable downloads, and 65.7% higher overall precision in retrieving FAIR-compliant datasets, while the Baseline Agent answers 40% more questions but returns prose-heavy pages (20.1%) and portal landing pages (8.5%) more often. The central claim is that semantic metadata remains indispensable for reliable, execution-oriented agentic data retrieval despite lower coverage.

Significance. If the evaluation holds after validation, the work supplies empirical head-to-head evidence on the precision-coverage trade-off in agentic retrieval and the continued utility of structured metadata for FAIR-compliant outcomes. This could inform the design of hybrid discovery systems for autonomous agents.

major comments (2)

[Evaluation Pipeline (abstract and methods)] The precision deltas (44.9%, 46.6%, 65.7%) and failure-mode percentages (20.1%, 8.5%) are produced exclusively by the LLM-as-a-judge pipeline that scores semantic relevance, accessibility, and computational utility. No prompt text, few-shot examples, temperature, calibration set, or human-rater agreement statistics are supplied, leaving open the possibility of systematic bias favoring the structurally different outputs of the Semantic Agent.
[Experimental Setup] The Baseline Agent searches billions of open-web documents while the Semantic Agent is restricted to a 90-million dataset corpus; the manuscript does not demonstrate that the query sets or topic distributions are matched, so the direct precision comparison rests on an unshown equivalence between the two environments.

minor comments (1)

[Abstract] The abstract reports that the Baseline Agent 'answers 40% more questions' without stating the absolute number of queries or the success criterion used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. Below we respond point-by-point to the two major concerns. We will expand the evaluation details for transparency and clarify the experimental design to address the comparison between environments.

read point-by-point responses

Referee: [Evaluation Pipeline (abstract and methods)] The precision deltas (44.9%, 46.6%, 65.7%) and failure-mode percentages (20.1%, 8.5%) are produced exclusively by the LLM-as-a-judge pipeline that scores semantic relevance, accessibility, and computational utility. No prompt text, few-shot examples, temperature, calibration set, or human-rater agreement statistics are supplied, leaving open the possibility of systematic bias favoring the structurally different outputs of the Semantic Agent.

Authors: We agree that the current manuscript lacks sufficient detail on the LLM-as-a-judge implementation. In the revision we will add an appendix containing the complete prompt templates (including the explicit mapping to each FAIR principle), few-shot examples, the temperature setting of 0.0, and Cohen's kappa statistics from a 100-query human validation subset. The judge instructions emphasize objective indicators such as presence of downloadable machine-readable files and schema.org compliance, which are format-agnostic; however, the added materials will allow readers to assess and replicate the scoring process directly. revision: yes
Referee: [Experimental Setup] The Baseline Agent searches billions of open-web documents while the Semantic Agent is restricted to a 90-million dataset corpus; the manuscript does not demonstrate that the query sets or topic distributions are matched, so the direct precision comparison rests on an unshown equivalence between the two environments.

Authors: The same fixed set of 500 queries is issued to both agents; topic coverage is therefore matched by construction. The environments are deliberately non-equivalent because they represent the two distinct retrieval paradigms under study (unstructured web versus schema.org corpus). Precision is measured on the results each agent actually returns for those identical queries, which directly quantifies the precision-coverage trade-off. We will add a sentence in the methods section explicitly stating that queries are held constant and that the comparison is between paradigms rather than between matched corpora. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical comparison with no derivations or self-referential reductions

full rationale

The paper reports results from running two agents (Baseline and Semantic) on retrieval tasks and scoring outputs via an LLM judge mapped to FAIR criteria. The provided text contains no equations, fitted parameters, predictions derived from inputs, or self-citations used as load-bearing premises. All quantitative claims (e.g., 65.7% higher precision) are presented as direct experimental outcomes rather than reductions of prior results. No patterns from the enumerated list apply.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the LLM judge and the assumption that the baseline and semantic setups are fair comparators; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption An LLM judge can reliably score retrieved data on semantic relevance, accessibility, and computational utility according to FAIR principles.
This assumption underpins all reported precision differences and is stated via the evaluation pipeline description.

pith-pipeline@v0.9.1-grok · 5849 in / 1118 out tokens · 34037 ms · 2026-06-29T09:34:46.504028+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages · 2 internal anchors

[1]

Aghzal, M., Stein, G.J., Yao, Z.: Why Do LLM-based Web Agents Fail? A Hierar- chical Planning Perspective (2026)

2026
[2]

W3C recommendation, World Wide Web Consortium (W3C) (2024), https://www.w3.org/TR/vocab- dcat-3/

Albertoni, R., Browning, D., Cox, S.J.D., Gonzalez Beltran, A., Perego, A., Win- stanley, P.: Data catalog vocabulary (DCAT) - version 3. W3C recommendation, World Wide Web Consortium (W3C) (2024), https://www.w3.org/TR/vocab- dcat-3/

2024
[3]

Data Intelligence6(2), 457–487 (2024)

Albertoni, R., Browning, D., Cox, S.J.D., Gonzalez-Beltran, A.N., Perego, A., Winstanley, P.: The W3C data catalog vocabulary, version 2: Ratio- nale, design principles, and uptake. Data Intelligence6(2), 457–487 (2024). https://doi.org/10.1162/dint_a_00241

work page doi:10.1162/dint_a_00241 2024
[4]

In: The Semantic Web – ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings

Alrashed, T., Paparas, D., Benjelloun, O., Sheng, Y., Noy, N.: Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages. In: The Semantic Web – ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings. p. 338–356. Springer-Verlag, Berlin, Heidelberg (2021), https://doi.org/10.100...

work page doi:10.1007/978-3-030-88361-4_20 2021
[5]

In: International Semantic Web Conference (ISWC-2020), In-Use Track (2020), https://arxiv.org/abs/2006.06894

Benjelloun, O., Chen, S., Noy, N.: Google Dataset Search by the numbers. In: International Semantic Web Conference (ISWC-2020), In-Use Track (2020), https://arxiv.org/abs/2006.06894

work page arXiv 2020
[6]

The VLDB Journal29(1), 251–272 (Aug 2019), https://doi.org/10.1007/s00778-019-00564-x

Chapman,A.,Simperl,E.,Koesten,L.,Konstantinidis,G.,Ibáñez,L.D.,Kacprzak, E., Groth, P.: Dataset Search: A Survey. The VLDB Journal29(1), 251–272 (Aug 2019), https://doi.org/10.1007/s00778-019-00564-x

work page doi:10.1007/s00778-019-00564-x 2019
[7]

Chezelles, T.L.S.D., Gasse, M., Drouin, A., Caccia, M., Boisvert, L., Thakkar, M., Marty, T., Assouel, R., Shayegan, S.O., Jang, L.K., Lù, X.H., Yoran, O., Kong, D., Xu, F.F., Reddy, S., Cappart, Q., Neubig, G., Salakhutdinov, R., Chapados, N., Lacoste, A.: The BrowserGym Ecosystem for Web Agent Research (2025), https://arxiv.org/abs/2412.05467

work page arXiv 2025
[8]

(eds.) Proceed- ings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP

Chiang, C.H., Lee, H.y.: Do Metadata and Appearance of the Retrieved Web- pages Affect LLM’s Reasoning in Retrieval-Augmented Generation? In: Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., Chen, H. (eds.) Proceed- ings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP. pp. 389–406. Association for Computa...

work page doi:10.18653/v1/2024.blackboxnlp-1.24 2024
[9]

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Tru- itt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From Local to Global: A Graph RAG Approach to Query-Focused Summarization (2025), https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Google: Agent Development Kit (ADK) (2026), https://adk.dev/, accessed: 2026- 04-22 Do Agents Need Semantic Metadata? 17

2026
[11]

Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of struc- tured data on the web. Commun. ACM59(2), 44–51 (Jan 2016), https://doi.org/10.1145/2844544

work page doi:10.1145/2844544 2016
[12]

In: Bouamor, H., Pino, J., Bali, K

Gur, I., Nachum, O., Miao, Y., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., Faust, A.: Understanding HTML with Large Language Models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Com- putational Linguistics: EMNLP 2023. pp. 2803–2821. Association for Computa- tional Linguistics, Singapore (Dec 2023), https:/...

2023
[13]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

Humphreys, P.C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Mul- dal, A., Abramson, J., Georgiev, P., Santoro, A., Lillicrap, T.: A data- driven approach for learning to control computers. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Pro...

2022
[14]

In: Bouamor, H., Pino, J., Bali, K

Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, X., Wen, J.R.: Struct- GPT: A General Framework for Large Language Model to Reason over Structured Data. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceed- ings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 9237–9251. Association for Computational Linguis- tics, Singapore (D...

work page doi:10.18653/v1/2023.emnlp-main.574 2023
[15]

Kato, M.P., Ohshima, H., Liu, Y.H., Chen, H.L.: NTCIR-16 Data Search 2 (2022)

2022
[16]

In: Proceedings of the 34th In- ternational Conference on Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th In- ternational Conference on Neural Information Processing Systems. NeurIPS ’20, Curran Associates Inc., Red Hook,...

2020
[17]

Li, X., Lyu, T., Yang, Y., Shan, L., Yang, S., Zhang, L., Huang, Z., Liu, Q., Li, Y.: Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning (2026)

2026
[18]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., Tang, J.: AgentBench: Evaluating LLMs as Agents. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on L...

2024
[19]

In: Bouamor, H., Pino, J., Bali, K

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG Evaluation using GPT-4 with Better Human Alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023), https://aclanthology.org/202...

2023
[20]

Information Services and Use37(1), 49–56 (2017)

Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L.O.B., Wilkin- son, M.D.: Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services and Use37(1), 49–56 (2017)

2017
[21]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., But- 18 S. Chen et al. ton, K., Knight, M., Chess, B., Schulman, J.: WebGPT: Browser-assisted question- answering with human feedback (2022), https://arxiv.org/abs/2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Noy, N., Burgess, M., Brickley, D.: Google Dataset Search: Building a Search En- gineforDatasetsinanOpenWebEcosystem.In:TheWorldWideWebConference. p. 1365–1375. WWW ’19, Association for Computing Machinery, New York, NY, USA (2019), https://doi.org/10.1145/3308558.3313685

work page doi:10.1145/3308558.3313685 2019
[23]

In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Overwijk, A., Xiong, C., Callan, J.: ClueWeb22: 10 Billion Web Docu- ments with Rich Information. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 3360–3362. SIGIR ’22, Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3477495.3536321, https://doi...

work page doi:10.1145/3477495.3536321 2022
[24]

IEEE Transactions on Knowledge and Data Engineering36, 7 (2024), 3580–3601

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unify- ing Large Language Models and Knowledge Graphs: A Roadmap. IEEE Transactions on Knowledge and Data Engineering36(7), 3580–3599 (2024). https://doi.org/10.1109/TKDE.2024.3352100

work page doi:10.1109/tkde.2024.3352100 2024
[25]

O'Brien, Carrie J

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023), https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023
[26]

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large Language Model ConnectedwithMassiveAPIs.AdvancesinNeuralInformationProcessingSystems 37, 126544–126565 (2024)

2024
[27]

In: The Twelfth International Conference on Learning Represen- tations

Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al.: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In: The Twelfth International Conference on Learning Represen- tations. ICLR ’24, https://openreview.net/forum?id=dHng2O0Jjr
[28]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Schick, T., Dwivedi-Yu, J., Dessí, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language Models Can Teach Themselves to Use Tools. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NeurIPS ’23, Curran Associates Inc., Red Hook, NY, USA
[29]

Harvard Data Science Review (Special Issue 4) (apr 2 2024), https://hdsr.mitpress.mit.edu/pub/psnc8zsr

Sostek, K., Russell, D.M., Goyal, N., Alrashed, T., Dugall, S., Noy, N.: Discov- ering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search. Harvard Data Science Review (Special Issue 4) (apr 2 2024), https://hdsr.mitpress.mit.edu/pub/psnc8zsr

2024
[30]

In: Proceedings of the ACM on Web Conference 2025

Tan, J., Dou, Z., Wang, W., Wang, M., Chen, W., Wen, J.R.: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. In: Proceedings of the ACM on Web Conference 2025. p. 1733–1746. WWW ’25, Association for Computing Machinery, New York, NY, USA (2025), https://doi.org/10.1145/3696410.3714546

work page doi:10.1145/3696410.3714546 2025
[31]

Scientific Data8(1), 192 (2021), https://doi.org/10.1038/s41597-021-00981-0

Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., Pedaste, M., Raju, M., Astapova, A., Lukner, H., Kogermann, K., Sepp, T.: Data shar- ing practices and data availability upon request differ across scientific disciplines. Scientific Data8(1), 192 (2021), https://doi.org/10.1038/s41597-021-00981-0

work page doi:10.1038/s41597-021-00981-0 2021
[32]

arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

Upadhyay, S., Pradeep, R., Thakur, N., Craswell, N., Lin, J.: UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv:2406.06519 (2024) Do Agents Need Semantic Metadata? 19

work page arXiv 2024
[33]

In: Advances in Neural Information Processing Systems

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. In: Advances in Neural Information Processing Systems. vol. 35, pp. 24824–24837 (2022)

2022
[34]

The FAIR Guiding Principles for scientific data management and stewardship

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data3(1), 1–9 (2016), https://doi.org/10.1038/sdata.2016.18

work page doi:10.1038/sdata.2016.18 2016
[35]

In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=WE_vluYUL-X

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models. In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=WE_vluYUL-X

2023
[36]

In: Proceedings of the 37th International Con- ference on Neural Information Processing Systems

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In: Proceedings of the 37th International Con- ference on Neural Information Processing Systems. NeurIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023
[37]

(eds.) International Conference on Learning Representations

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: WebArena: A Realistic Web Environment for BuildingAutonomousAgents.In:Kim,B.,Yue,Y.,Chaudhuri,S.,Fragkiadaki,K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. ICLR ’24, vol. 2024, pp. 15585–15606

2024
[38]

PLoS one15(9), e0239283 (2020), https://doi.org/10.1371/journal.pone.0239283

Zuiderwijk, A., Shinde, R., Jeng, W.: What Drives and Inhibits Researchers to Share and Use Open Research Data? A Systematic Literature Review to Analyze Factors Influencing Open Research Data Adoption. PLoS one15(9), e0239283 (2020), https://doi.org/10.1371/journal.pone.0239283

work page doi:10.1371/journal.pone.0239283 2020

[1] [1]

Aghzal, M., Stein, G.J., Yao, Z.: Why Do LLM-based Web Agents Fail? A Hierar- chical Planning Perspective (2026)

2026

[2] [2]

W3C recommendation, World Wide Web Consortium (W3C) (2024), https://www.w3.org/TR/vocab- dcat-3/

Albertoni, R., Browning, D., Cox, S.J.D., Gonzalez Beltran, A., Perego, A., Win- stanley, P.: Data catalog vocabulary (DCAT) - version 3. W3C recommendation, World Wide Web Consortium (W3C) (2024), https://www.w3.org/TR/vocab- dcat-3/

2024

[3] [3]

Data Intelligence6(2), 457–487 (2024)

Albertoni, R., Browning, D., Cox, S.J.D., Gonzalez-Beltran, A.N., Perego, A., Winstanley, P.: The W3C data catalog vocabulary, version 2: Ratio- nale, design principles, and uptake. Data Intelligence6(2), 457–487 (2024). https://doi.org/10.1162/dint_a_00241

work page doi:10.1162/dint_a_00241 2024

[4] [4]

In: The Semantic Web – ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings

Alrashed, T., Paparas, D., Benjelloun, O., Sheng, Y., Noy, N.: Dataset or Not? A Study on the Veracity of Semantic Markup for Dataset Pages. In: The Semantic Web – ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings. p. 338–356. Springer-Verlag, Berlin, Heidelberg (2021), https://doi.org/10.100...

work page doi:10.1007/978-3-030-88361-4_20 2021

[5] [5]

In: International Semantic Web Conference (ISWC-2020), In-Use Track (2020), https://arxiv.org/abs/2006.06894

Benjelloun, O., Chen, S., Noy, N.: Google Dataset Search by the numbers. In: International Semantic Web Conference (ISWC-2020), In-Use Track (2020), https://arxiv.org/abs/2006.06894

work page arXiv 2020

[6] [6]

The VLDB Journal29(1), 251–272 (Aug 2019), https://doi.org/10.1007/s00778-019-00564-x

Chapman,A.,Simperl,E.,Koesten,L.,Konstantinidis,G.,Ibáñez,L.D.,Kacprzak, E., Groth, P.: Dataset Search: A Survey. The VLDB Journal29(1), 251–272 (Aug 2019), https://doi.org/10.1007/s00778-019-00564-x

work page doi:10.1007/s00778-019-00564-x 2019

[7] [7]

Chezelles, T.L.S.D., Gasse, M., Drouin, A., Caccia, M., Boisvert, L., Thakkar, M., Marty, T., Assouel, R., Shayegan, S.O., Jang, L.K., Lù, X.H., Yoran, O., Kong, D., Xu, F.F., Reddy, S., Cappart, Q., Neubig, G., Salakhutdinov, R., Chapados, N., Lacoste, A.: The BrowserGym Ecosystem for Web Agent Research (2025), https://arxiv.org/abs/2412.05467

work page arXiv 2025

[8] [8]

(eds.) Proceed- ings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP

Chiang, C.H., Lee, H.y.: Do Metadata and Appearance of the Retrieved Web- pages Affect LLM’s Reasoning in Retrieval-Augmented Generation? In: Belinkov, Y., Kim, N., Jumelet, J., Mohebbi, H., Mueller, A., Chen, H. (eds.) Proceed- ings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Net- works for NLP. pp. 389–406. Association for Computa...

work page doi:10.18653/v1/2024.blackboxnlp-1.24 2024

[9] [9]

Edge, D., Trinh, H., Cheng, N., Bradley, J., Chao, A., Mody, A., Tru- itt, S., Metropolitansky, D., Ness, R.O., Larson, J.: From Local to Global: A Graph RAG Approach to Query-Focused Summarization (2025), https://arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Google: Agent Development Kit (ADK) (2026), https://adk.dev/, accessed: 2026- 04-22 Do Agents Need Semantic Metadata? 17

2026

[11] [11]

Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of struc- tured data on the web. Commun. ACM59(2), 44–51 (Jan 2016), https://doi.org/10.1145/2844544

work page doi:10.1145/2844544 2016

[12] [12]

In: Bouamor, H., Pino, J., Bali, K

Gur, I., Nachum, O., Miao, Y., Safdari, M., Huang, A., Chowdhery, A., Narang, S., Fiedel, N., Faust, A.: Understanding HTML with Large Language Models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Findings of the Association for Com- putational Linguistics: EMNLP 2023. pp. 2803–2821. Association for Computa- tional Linguistics, Singapore (Dec 2023), https:/...

2023

[13] [13]

In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S

Humphreys, P.C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Mul- dal, A., Abramson, J., Georgiev, P., Santoro, A., Lillicrap, T.: A data- driven approach for learning to control computers. In: Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., Sabato, S. (eds.) Proceedings of the 39th International Conference on Machine Learning. Pro...

2022

[14] [14]

In: Bouamor, H., Pino, J., Bali, K

Jiang, J., Zhou, K., Dong, Z., Ye, K., Zhao, X., Wen, J.R.: Struct- GPT: A General Framework for Large Language Model to Reason over Structured Data. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceed- ings of the 2023 Conference on Empirical Methods in Natural Lan- guage Processing. pp. 9237–9251. Association for Computational Linguis- tics, Singapore (D...

work page doi:10.18653/v1/2023.emnlp-main.574 2023

[15] [15]

Kato, M.P., Ohshima, H., Liu, Y.H., Chen, H.L.: NTCIR-16 Data Search 2 (2022)

2022

[16] [16]

In: Proceedings of the 34th In- ternational Conference on Neural Information Processing Systems

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., Riedel, S., Kiela, D.: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Proceedings of the 34th In- ternational Conference on Neural Information Processing Systems. NeurIPS ’20, Curran Associates Inc., Red Hook,...

2020

[17] [17]

Li, X., Lyu, T., Yang, Y., Shan, L., Yang, S., Zhang, L., Huang, Z., Liu, Q., Li, Y.: Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning (2026)

2026

[18] [18]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., Tang, J.: AgentBench: Evaluating LLMs as Agents. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on L...

2024

[19] [19]

In: Bouamor, H., Pino, J., Bali, K

Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., Zhu, C.: G-eval: NLG Evaluation using GPT-4 with Better Human Alignment. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 2511–2522. Association for Computational Linguistics, Singapore (Dec 2023), https://aclanthology.org/202...

2023

[20] [20]

Information Services and Use37(1), 49–56 (2017)

Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L.O.B., Wilkin- son, M.D.: Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services and Use37(1), 49–56 (2017)

2017

[21] [21]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., But- 18 S. Chen et al. ton, K., Knight, M., Chess, B., Schulman, J.: WebGPT: Browser-assisted question- answering with human feedback (2022), https://arxiv.org/abs/2112.09332

work page internal anchor Pith review Pith/arXiv arXiv 2022

[22] [22]

Noy, N., Burgess, M., Brickley, D.: Google Dataset Search: Building a Search En- gineforDatasetsinanOpenWebEcosystem.In:TheWorldWideWebConference. p. 1365–1375. WWW ’19, Association for Computing Machinery, New York, NY, USA (2019), https://doi.org/10.1145/3308558.3313685

work page doi:10.1145/3308558.3313685 2019

[23] [23]

In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Overwijk, A., Xiong, C., Callan, J.: ClueWeb22: 10 Billion Web Docu- ments with Rich Information. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. p. 3360–3362. SIGIR ’22, Association for Computing Machin- ery, New York, NY, USA (2022). https://doi.org/10.1145/3477495.3536321, https://doi...

work page doi:10.1145/3477495.3536321 2022

[24] [24]

IEEE Transactions on Knowledge and Data Engineering36, 7 (2024), 3580–3601

Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., Wu, X.: Unify- ing Large Language Models and Knowledge Graphs: A Roadmap. IEEE Transactions on Knowledge and Data Engineering36(7), 3580–3599 (2024). https://doi.org/10.1109/TKDE.2024.3352100

work page doi:10.1109/tkde.2024.3352100 2024

[25] [25]

O'Brien, Carrie J

Park, J.S., O’Brien, J., Cai, C.J., Morris, M.R., Liang, P., Bernstein, M.S.: Generative agents: Interactive simulacra of human behavior. In: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology. UIST ’23, Association for Computing Machinery, New York, NY, USA (2023), https://doi.org/10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763 2023

[26] [26]

Patil, S.G., Zhang, T., Wang, X., Gonzalez, J.E.: Gorilla: Large Language Model ConnectedwithMassiveAPIs.AdvancesinNeuralInformationProcessingSystems 37, 126544–126565 (2024)

2024

[27] [27]

In: The Twelfth International Conference on Learning Represen- tations

Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al.: ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs. In: The Twelfth International Conference on Learning Represen- tations. ICLR ’24, https://openreview.net/forum?id=dHng2O0Jjr

[28] [28]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Schick, T., Dwivedi-Yu, J., Dessí, R., Raileanu, R., Lomeli, M., Hambro, E., Zettle- moyer, L., Cancedda, N., Scialom, T.: Toolformer: Language Models Can Teach Themselves to Use Tools. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NeurIPS ’23, Curran Associates Inc., Red Hook, NY, USA

[29] [29]

Harvard Data Science Review (Special Issue 4) (apr 2 2024), https://hdsr.mitpress.mit.edu/pub/psnc8zsr

Sostek, K., Russell, D.M., Goyal, N., Alrashed, T., Dugall, S., Noy, N.: Discov- ering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search. Harvard Data Science Review (Special Issue 4) (apr 2 2024), https://hdsr.mitpress.mit.edu/pub/psnc8zsr

2024

[30] [30]

In: Proceedings of the ACM on Web Conference 2025

Tan, J., Dou, Z., Wang, W., Wang, M., Chen, W., Wen, J.R.: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems. In: Proceedings of the ACM on Web Conference 2025. p. 1733–1746. WWW ’25, Association for Computing Machinery, New York, NY, USA (2025), https://doi.org/10.1145/3696410.3714546

work page doi:10.1145/3696410.3714546 2025

[31] [31]

Scientific Data8(1), 192 (2021), https://doi.org/10.1038/s41597-021-00981-0

Tedersoo, L., Küngas, R., Oras, E., Köster, K., Eenmaa, H., Leijen, Ä., Pedaste, M., Raju, M., Astapova, A., Lukner, H., Kogermann, K., Sepp, T.: Data shar- ing practices and data availability upon request differ across scientific disciplines. Scientific Data8(1), 192 (2021), https://doi.org/10.1038/s41597-021-00981-0

work page doi:10.1038/s41597-021-00981-0 2021

[32] [32]

arXiv:2406.06519 [cs.IR] https://arxiv.org/abs/2406.06519 Guangzhi Xiong, Qiao Jin, Zhiyong Lu, and Aidong Zhang

Upadhyay, S., Pradeep, R., Thakur, N., Craswell, N., Lin, J.: UMBRELA: UMbrela is the (Open-Source Reproduction of the) Bing RELevance Assessor. arXiv:2406.06519 (2024) Do Agents Need Semantic Metadata? 19

work page arXiv 2024

[33] [33]

In: Advances in Neural Information Processing Systems

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain- of-Thought Prompting Elicits Reasoning in Large Language Models. In: Advances in Neural Information Processing Systems. vol. 35, pp. 24824–24837 (2022)

2022

[34] [34]

The FAIR Guiding Principles for scientific data management and stewardship

Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.W., da Silva Santos, L.B., Bourne, P.E., et al.: The FAIR Guiding Principles for Scientific Data Management and Stewardship. Scientific Data3(1), 1–9 (2016), https://doi.org/10.1038/sdata.2016.18

work page doi:10.1038/sdata.2016.18 2016

[35] [35]

In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=WE_vluYUL-X

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: ReAct: Synergizing Reasoning and Acting in Language Models. In: The Eleventh International Conference on Learning Representations (2023), https://openreview.net/forum?id=WE_vluYUL-X

2023

[36] [36]

In: Proceedings of the 37th International Con- ference on Neural Information Processing Systems

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E.P., Zhang, H., Gonzalez, J.E., Stoica, I.: Judging LLM-as-a-judge with MT-bench and Chatbot Arena. In: Proceedings of the 37th International Con- ference on Neural Information Processing Systems. NeurIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023

[37] [37]

(eds.) International Conference on Learning Representations

Zhou, S., Xu, F.F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Ou, T., Bisk, Y., Fried, D., Alon, U., Neubig, G.: WebArena: A Realistic Web Environment for BuildingAutonomousAgents.In:Kim,B.,Yue,Y.,Chaudhuri,S.,Fragkiadaki,K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. ICLR ’24, vol. 2024, pp. 15585–15606

2024

[38] [38]

PLoS one15(9), e0239283 (2020), https://doi.org/10.1371/journal.pone.0239283

Zuiderwijk, A., Shinde, R., Jeng, W.: What Drives and Inhibits Researchers to Share and Use Open Research Data? A Systematic Literature Review to Analyze Factors Influencing Open Research Data Adoption. PLoS one15(9), e0239283 (2020), https://doi.org/10.1371/journal.pone.0239283

work page doi:10.1371/journal.pone.0239283 2020