pith. sign in

arxiv: 2507.03674 · v3 · pith:J7KYL46Unew · submitted 2025-07-04 · 💻 cs.CL · cs.AI

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Pith reviewed 2026-05-22 12:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords structured information extractionontology-guided extractionagentic refinementhuman-in-the-loop validationnamed entity recognitionscientific literature miningprovenance transparencylarge language models
0
0 comments X

The pith

StructSense integrates ontology guidance with agentic self-refinement and human validation to extract structured information from scientific literature across multiple tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StructSense as a modular framework meant to handle the extraction of structured data from scientific texts in areas where standard language models typically underperform. It combines predefined ontologies to supply domain knowledge, agents that review and improve their own outputs, and selective human review to correct errors. The authors test this setup on three tasks that grow more complex in meaning: pulling assessment instruments from documents, pulling metadata and resources from papers, and spotting named entities in neuroscience writing. Results include accuracy ranges from 58 percent up to 100 percent depending on the task, plus better recall than gold standards on some biomedical benchmarks while also finding extra entities. The central idea is that this mix of symbolic knowledge, self-correction, and limited human input can produce reliable extractions that stay grounded in the original sources and work without heavy task-specific redesign.

Core claim

We introduce StructSense, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate StructSense on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100% accuracy), metadata and resource extraction from scientific papers (86--93% overall), and named entity recognition (NER) from neuroscience literature (58--75% label accuracy across 8,882 entities). On two biomedical NER benchmarks the system achieves at least 90 percent relaxed recall while extracting 1,000--3,600 additional entities

What carries the argument

StructSense, a modular task-agnostic framework that combines ontology-guided symbolic knowledge, agentic self-evaluative refinement, and selective human-in-the-loop validation to perform domain-aware structured extraction while preserving source grounding.

If this is right

  • The framework generalizes across tasks of increasing semantic complexity while keeping source grounding and provenance transparency.
  • It reaches at least 90 percent relaxed recall on biomedical named entity benchmarks and identifies thousands of additional entities beyond existing gold annotations.
  • Concept mapping within the system hits 62 to 82 percent under strict matching and 68 to 86 percent under semantic matching.
  • The approach maintains consistent performance on schema-based extraction, metadata pulling, and entity recognition without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This combination of symbolic rules and self-correction could reduce reliance on large labeled datasets for new scientific fields.
  • The provenance tracking feature may support verification steps in large collaborative knowledge bases built from literature.
  • Extending the human-in-the-loop component to active learning loops could further lower the amount of manual review needed over time.
  • Testing the same pipeline on non-biomedical literature such as physics or materials science would reveal how far the task-agnostic property holds.

Load-bearing premise

That the combination of ontology guidance, agentic self-refinement, and selective human validation will produce strong accuracy and generalization without overfitting to the tested domains or demanding excessive human effort.

What would settle it

Apply the framework unchanged to a fourth domain with different terminology and measure whether accuracy and recall remain within the same ranges reported for the original three domains.

Figures

Figures reproduced from arXiv: 2507.03674 by Dorota Jarecka, Lydia Ng, Patrick Ray, Puja Trivedi, Saif Haobsh, Satrajit S. Ghosh, Tek Raj Chhetri, Yibei Chen.

Figure 1
Figure 1. Figure 1: Architecture of the STRUCTSENSE and align with FAIR principles, they lack the adaptability of ML/DL; conventional ML/DL models provide learning capabilities but often require large annotated datasets. Hybrid approaches have emerged to combine these strengths but still lack the robustness of LLMs. We observe an increasing use of LLMs in IE because their capability has facilitated end-to-end extraction. Howe… view at source ↗
Figure 2
Figure 2. Figure 2: The correction mode interface of AIProofBuddy, activated in response to negative user feedback (thumbs [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task 3 entity pool analysis (with HIL) Entity Detection and Ontology Mapping Claude achieved the highest entity detection rates across both settings, identifying 50 of 59 (see [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task 3 entity pool analysis (without HIL) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task 3 ontology coverage (with HIL) 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task 3 ontology coverage (without HIL) types). DeepSeek’s index decreased from 2.34 (11 types) to 2.12 (10 types). GPT-4o-mini maintained the lowest diversity, with 1.89 (7 types) under HIL and 1.76 (6 types) without. Claude’s label distribution spanned technical concepts such as COMPUTATIONAL_COMPONENT, EXPERIMENTAL_CONDITION, and MODEL_COMPONENT. DeepSeek extracted a narrower but balanced set, with incre… view at source ↗
Figure 7
Figure 7. Figure 7: Task 3 performance comparison between with-HIL and without-HIL [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of HIL Vs Non-HIL Performance [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cost-Speed Trade-offs Across Tasks, Models, and HIL Settings [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
read the original abstract

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces StructSense, a modular, task-agnostic, open-source framework for structured information extraction from scientific literature. It integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation. The framework is evaluated on three tasks of increasing complexity: schema-based extraction of assessment instruments (91-100% accuracy), metadata and resource extraction from scientific papers (86-93% overall), and named entity recognition from neuroscience literature (58-75% label accuracy across 8,882 entities). Additional results on biomedical NER benchmarks (NCBI Disease and S800) report >=90% relaxed recall, 62.5-85.8% strict recall, and extraction of 1,000-3,600 additional entities beyond gold annotations, with local concept mapping achieving Hits@1 of 62-82% (strict) and 68-86% (semantic).

Significance. If the reported accuracies and generalization hold after verification of protocols and ablations, the work offers a practical open-source contribution to domain-aware extraction that combines symbolic guidance with agentic refinement and selective human oversight. The emphasis on source grounding, provenance transparency, and evaluation across tasks of varying semantic complexity addresses real limitations of LLMs in specialized scientific domains; the open-source release and human-in-the-loop component are explicit strengths for reproducibility and adoption.

major comments (2)
  1. [Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.
  2. [NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.
minor comments (2)
  1. [Abstract and Results] Clarify notation for accuracy ranges (e.g., whether 91-100% reflects per-task variation or confidence intervals) and ensure all metrics are defined consistently between the abstract and main text.
  2. [Framework description] Provide more detail on the modular architecture diagram or pseudocode to make the integration of symbolic ontology guidance with agentic self-refinement reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the major comments point by point below, committing to specific revisions that will improve the verifiability of our results without overstating the current manuscript.

read point-by-point responses
  1. Referee: [Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.

    Authors: We agree that the current manuscript lacks sufficient detail on these aspects to fully support the generalization claims. In the revised version, we will expand the Evaluation section with a precise description of the evaluation protocol (including annotation guidelines, inter-annotator agreement where applicable, and exact metrics computation). We will add baseline comparisons against zero-shot LLM prompting and rule-based extractors. Ablation studies will isolate the contributions of ontology guidance, agentic refinement, and human-in-the-loop components. Statistical significance (e.g., McNemar tests or bootstrap confidence intervals) will be reported for key accuracy figures. Human effort will be quantified via average validation time per document and total annotator hours. These additions will directly address concerns about overfitting and unmeasured costs. revision: yes

  2. Referee: [NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.

    Authors: We accept that validation of the additional entities and discussion of the recall gap are currently insufficient. The revised manuscript will include a new error analysis subsection detailing the validation criteria (e.g., expert review of a random sample of 200 additional entities per benchmark, with categorization into true novel entities, synonyms, or errors) and quantitative results from that review. We will also add a paragraph discussing the relaxed-strict recall gap, noting that relaxed recall reflects the framework's semantic flexibility for task-agnostic use while strict recall ensures high precision; this duality supports generalization across domains without requiring task-specific retraining. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework evaluation

full rationale

The paper introduces StructSense as a modular framework and reports empirical accuracies on three tasks (schema extraction, metadata extraction, NER) plus two benchmarks, with no equations, first-principles derivations, or predictions that reduce to fitted inputs. Claims rest on experimental results, ontology integration, and human-in-the-loop validation rather than self-referential definitions or self-citation chains that force outcomes. The work is self-contained as benchmarking; no load-bearing step equates a result to its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard domain assumptions in LLM-based information extraction rather than new axioms or invented entities; no free parameters or novel postulated objects are mentioned.

axioms (2)
  • domain assumption Ontology-guided symbolic knowledge can reliably steer LLM extraction in specialized scientific domains
    Invoked in the abstract description of the framework's core integration for domain-aware extraction.
  • domain assumption Agentic self-evaluative refinement improves output quality without introducing new errors
    Stated as part of the modular design that enables robust extraction.

pith-pipeline@v0.9.0 · 5799 in / 1444 out tokens · 42865 ms · 2026-05-22T12:34:50.727268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Information extraction from scientific articles: a survey

    Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Information extraction from scientific articles: a survey. Scientometrics, 117(3):1931–1990, Dec 2018

  2. [2]

    Publication output by region, country, or economy and by scientific field, 2023

    National Science Board. Publication output by region, country, or economy and by scientific field, 2023. Science & Engineering Indicators 2023, NSB-2023-33

  3. [3]

    Scientific literature: Information overload

    Esther Landhuis. Scientific literature: Information overload. Nature, 535(7612):457–458, Jul 2016

  4. [4]

    Scientific discourse tagging for evidence extraction, 2021

    Xiangci Li, Gully Burns, and Nanyun Peng. Scientific discourse tagging for evidence extraction, 2021

  5. [5]

    Scholarly knowledge graphs through structuring scholarly communication: a review

    Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & Intelligent Systems, 9(1):1059–1095, Feb 2023

  6. [6]

    Rosen, Gerbrand Ceder, Kristin A

    John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, Feb 2024

  7. [7]

    Large language models for generative information extraction: a survey

    Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, Yang Wang, and Enhong Chen. Large language models for generative information extraction: a survey. Frontiers of Computer Science, 18(6):186357, Nov 2024

  8. [8]

    Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S

    Tek Raj Chhetri, Yaroslav O. Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S. Ghosh, Patrick Ray, and Lydia Ng. Bridging the scientific knowledge gap and reproducibility: A survey of provenance, assertion and evidence ontologies. In Companion Proceedings of the ACM Web Conference 2025, WWW Companion ’25, page 5, New York, NY , USA, 2025. Association f...

  9. [9]

    Challenges and advances in information extraction from scientific literature: a review

    Zhi Hong, Logan Ward, Kyle Chard, Ben Blaiszik, and Ian Foster. Challenges and advances in information extraction from scientific literature: a review. JOM, 73(11):3383–3400, Nov 2021

  10. [10]

    Barbara Nebe, Sascha Spors, and Frank Krüger

    Max Schröder, Susanne Staehlke, Paul Groth, J. Barbara Nebe, Sascha Spors, and Frank Krüger. Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. Journal of Biomedical Semantics, 13(1):4, Jan 2022. 18 STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK for STRUCTURED INFORMATION EXTRACTION with HUMAN-I...

  11. [11]

    Large language models are few-shot clinical information extractors, 2022

    Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. Large language models are few-shot clinical information extractors, 2022

  12. [12]

    Data extraction from polymer literature using large language models

    Sonakshi Gupta, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. Data extraction from polymer literature using large language models. Communications Materials, 5(1):269, Dec 2024

  13. [13]

    Gpt-re: In-context learning for relation extraction using large language models, 2023

    Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models, 2023

  14. [14]

    Concept, pages 3–10

    Dieter Fensel. Concept, pages 3–10. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004

  15. [15]

    Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies

    Licong Cui and Ankur Agrawal. Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies. BMC Medical Informatics and Decision Making, 23(1):302, Aug 2024

  16. [16]

    Biobert: a pre- trained biomedical language representation model for biomedical text mining

    Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre- trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020

  17. [17]

    Domain-specific language model pretraining for biomedical natural language processing

    Yu Gu, Robert Tinn, Hao Cheng, Jason Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2022

  18. [18]

    & Cohan, A

    Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

  19. [19]

    Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning

    Yifan Peng, Thomas Smith, and Wei Zhang. Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning. Nature Communications, 15(1):1121, 2024

  20. [20]

    Biocreative v cdr task corpus: a resource for chemical disease relation extraction

    Jin-Dong Li, Yanan Sun, Rodney J Johnson, Dan Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan P Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016

  21. [21]

    Overview of bionlp’09 shared task on event extraction

    Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9, 2009

  22. [22]

    Introduction to the bio-entity recognition task at jnlpba

    Nigel Collier and Jin-Dong Kim. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pages 70–75, 2004

  23. [23]

    A survey on biomedical named entity recognition

    Lei Hou, Juanzi Zhang, Zhiyuan Liu, Yankai Song, Xianpei Han, and Maosong Sun. A survey on biomedical named entity recognition. Frontiers in Cell and Developmental Biology, 8:673, 2020

  24. [24]

    Distributional semantics resources for biomedical text processing

    Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. Distributional semantics resources for biomedical text processing. Proceedings of LBM, 2013:39–44, 2013

  25. [25]

    Neural named entity recognition for scientific text: A survey and outlook

    Kyunghyun Cho, Suchin Gururangan, Kyle Lo, and Noah A Smith. Neural named entity recognition for scientific text: A survey and outlook. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4565–4576, 2021

  26. [26]

    Gohar Zaman, Hairulnizam Mahdin, Khalid Hussain, Atta-Ur-Rahman, Jemal Abawajy, and Salama A. Mostafa. An ontological framework for information extraction from diverse scientific sources.IEEE Access, 9:42111–42124, 2021

  27. [27]

    A hybrid ontology-based information extraction system

    Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya, and Hui Zong. A hybrid ontology-based information extraction system. Journal of Information Science, 42(6):798–820, 2016

  28. [28]

    Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports

    Kaijian Liu and Nora El-Gohary. Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81:313–327, 2017

  29. [29]

    An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique

    P Raghavendra Nayaka and Rajeev Ranjan. An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique. In 2023 2nd International Conference for Innovation in Technology (INOCON), pages 1–9, 2023

  30. [30]

    Data acquisition and information extraction for scientific knowledge base building

    Piotr Andruszkiewicz and Henryk Rybinski. Data acquisition and information extraction for scientific knowledge base building. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) , pages 256–259, 2018

  31. [31]

    Trie: End-to-end text reading and information extraction for document understanding

    Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. Trie: End-to-end text reading and information extraction for document understanding. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 1413–1422, New York, NY , USA, 2020. Association for Computing Machinery. 19 STRUCTSENSE: A TASK-...

  32. [32]

    Pradeep Dasigi, Gully A. P. C. Burns, Eduard Hovy, and Anita de Waard. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks, 2017

  33. [33]

    Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

    Di Jin and Peter Szolovits. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

  34. [34]

    Claimdistiller: Scientific claim extraction with supervised contrastive learning

    Xin Wei, Md Reshad Ul Hoque, Jian Wu, and Jiang Li. Claimdistiller: Scientific claim extraction with supervised contrastive learning. In CEUR Workshop Proceedings: EEKE-All2023: Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and AI + Informetrics (All2023): Proceedings of Joint Workshop of the 4th Extraction and Evalu...

  35. [35]

    Argumentation mining in scientific literature for sustainable development

    Aris Fergadis, Dimitris Pappas, Antonia Karamolegkou, and Haris Papageorgiou. Argumentation mining in scientific literature for sustainable development. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors, Proceedings of the 8th Workshop on Argument Mining , pages 100–111, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

  36. [36]

    REBEL: Relation extraction by end-to-end language generation

    Pere-Lluís Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end language generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

  37. [37]

    Deep learning models for spatial relation extraction in text

    Kehan Wu, Xueying Zhang, Yulong Dang, and Peng Ye and. Deep learning models for spatial relation extraction in text. Geo-spatial Information Science, 26(1):58–70, 2023

  38. [38]

    Do llms really adapt to domains? an ontology learning perspective

    Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do llms really adapt to domains? an ontology learning perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors, The Semantic Web – ISWC 2024, pages 126–143, Cham, 2025. Springer Nature Switzerland

  39. [39]

    Efficient knowledge infusion via KG-LLM alignment

    Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. Efficient knowledge infusion via KG-LLM alignment. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 2986–2999, Bangkok, Thailand, August 2024. Association for Computational L...

  40. [40]

    InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration

    Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and Haifeng Chen. InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2024, pages 3675–3688, Miami, F...

  41. [41]

    Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A

    Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. UnifiedSKG: Unifying and multi-taski...

  42. [42]

    Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents

    Yubo Shu, Haonan Zhang, Hansu Gu, Peng Zhang, Tun Lu, Dongsheng Li, and Ning Gu. Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents. IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024

  43. [43]

    Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

    Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

  44. [44]

    Towards an ai co-scientist, 2025

    Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

  45. [45]

    Judging llm-as-a-judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...

  46. [46]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025

  47. [47]

    Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

    Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, and Jordan Boyd-Graber. Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

  48. [48]

    Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

    Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, and Chirag Shah. Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

  49. [49]

    Y . Chen, D. Jarecka, S. A. Abraham, R. Gau, E. Ng, D. M. Low, I. Bevers, A. Johnson, A. Keshavan, A. Klein, J. Clucas, Z. Rosli, S. M. Hodge, J. Linkersdörfer, H. Bartsch, S. Das, D. Fair, D. Kennedy, and S. S. Ghosh. Standardizing survey data collection to enhance reproducibility: An evaluation of reproschema. Journal of Medical Internet Research, 2025....

  50. [50]

    Squad: 100,000+ questions for machine comprehension of text, 2016

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016

  51. [51]

    Find me the right content! diversity-based sampling of social media spaces for topic-centric search

    Munmun De Choudhury, Scott Counts, and Mary Czerwinski. Find me the right content! diversity-based sampling of social media spaces for topic-centric search. Proceedings of the International AAAI Conference on Web and Social Media, 5(1):129–136, Aug. 2021. 21