STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Dorota Jarecka; Lydia Ng; Patrick Ray; Puja Trivedi; Saif Haobsh; Satrajit S. Ghosh; Tek Raj Chhetri; Yibei Chen

arxiv: 2507.03674 · v3 · pith:J7KYL46Unew · submitted 2025-07-04 · 💻 cs.CL · cs.AI

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

Tek Raj Chhetri , Yibei Chen , Puja Trivedi , Dorota Jarecka , Saif Haobsh , Patrick Ray , Lydia Ng , Satrajit S. Ghosh This is my paper

Pith reviewed 2026-05-22 12:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords structured information extractionontology-guided extractionagentic refinementhuman-in-the-loop validationnamed entity recognitionscientific literature miningprovenance transparencylarge language models

0 comments

The pith

StructSense integrates ontology guidance with agentic self-refinement and human validation to extract structured information from scientific literature across multiple tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StructSense as a modular framework meant to handle the extraction of structured data from scientific texts in areas where standard language models typically underperform. It combines predefined ontologies to supply domain knowledge, agents that review and improve their own outputs, and selective human review to correct errors. The authors test this setup on three tasks that grow more complex in meaning: pulling assessment instruments from documents, pulling metadata and resources from papers, and spotting named entities in neuroscience writing. Results include accuracy ranges from 58 percent up to 100 percent depending on the task, plus better recall than gold standards on some biomedical benchmarks while also finding extra entities. The central idea is that this mix of symbolic knowledge, self-correction, and limited human input can produce reliable extractions that stay grounded in the original sources and work without heavy task-specific redesign.

Core claim

We introduce StructSense, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate StructSense on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100% accuracy), metadata and resource extraction from scientific papers (86--93% overall), and named entity recognition (NER) from neuroscience literature (58--75% label accuracy across 8,882 entities). On two biomedical NER benchmarks the system achieves at least 90 percent relaxed recall while extracting 1,000--3,600 additional entities

What carries the argument

StructSense, a modular task-agnostic framework that combines ontology-guided symbolic knowledge, agentic self-evaluative refinement, and selective human-in-the-loop validation to perform domain-aware structured extraction while preserving source grounding.

If this is right

The framework generalizes across tasks of increasing semantic complexity while keeping source grounding and provenance transparency.
It reaches at least 90 percent relaxed recall on biomedical named entity benchmarks and identifies thousands of additional entities beyond existing gold annotations.
Concept mapping within the system hits 62 to 82 percent under strict matching and 68 to 86 percent under semantic matching.
The approach maintains consistent performance on schema-based extraction, metadata pulling, and entity recognition without task-specific retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This combination of symbolic rules and self-correction could reduce reliance on large labeled datasets for new scientific fields.
The provenance tracking feature may support verification steps in large collaborative knowledge bases built from literature.
Extending the human-in-the-loop component to active learning loops could further lower the amount of manual review needed over time.
Testing the same pipeline on non-biomedical literature such as physics or materials science would reveal how far the task-agnostic property holds.

Load-bearing premise

That the combination of ontology guidance, agentic self-refinement, and selective human validation will produce strong accuracy and generalization without overfitting to the tested domains or demanding excessive human effort.

What would settle it

Apply the framework unchanged to a fourth domain with different terminology and measure whether accuracy and recall remain within the same ranges reported for the original three domains.

Figures

Figures reproduced from arXiv: 2507.03674 by Dorota Jarecka, Lydia Ng, Patrick Ray, Puja Trivedi, Saif Haobsh, Satrajit S. Ghosh, Tek Raj Chhetri, Yibei Chen.

**Figure 1.** Figure 1: Architecture of the STRUCTSENSE and align with FAIR principles, they lack the adaptability of ML/DL; conventional ML/DL models provide learning capabilities but often require large annotated datasets. Hybrid approaches have emerged to combine these strengths but still lack the robustness of LLMs. We observe an increasing use of LLMs in IE because their capability has facilitated end-to-end extraction. Howe… view at source ↗

**Figure 2.** Figure 2: The correction mode interface of AIProofBuddy, activated in response to negative user feedback (thumbs [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Task 3 entity pool analysis (with HIL) Entity Detection and Ontology Mapping Claude achieved the highest entity detection rates across both settings, identifying 50 of 59 (see [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Task 3 entity pool analysis (without HIL) [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Task 3 ontology coverage (with HIL) 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Task 3 ontology coverage (without HIL) types). DeepSeek’s index decreased from 2.34 (11 types) to 2.12 (10 types). GPT-4o-mini maintained the lowest diversity, with 1.89 (7 types) under HIL and 1.76 (6 types) without. Claude’s label distribution spanned technical concepts such as COMPUTATIONAL_COMPONENT, EXPERIMENTAL_CONDITION, and MODEL_COMPONENT. DeepSeek extracted a narrower but balanced set, with incre… view at source ↗

**Figure 7.** Figure 7: Task 3 performance comparison between with-HIL and without-HIL [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of HIL Vs Non-HIL Performance [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Cost-Speed Trade-offs Across Tasks, Models, and HIL Settings [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

read the original abstract

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StructSense combines ontology guidance, agentic refinement, and human checks into one open framework and reports concrete numbers on extraction tasks, but the added value of each piece still needs clearer proof.

read the letter

The main point is that StructSense brings together ontology guidance, agentic self-refinement, and selective human validation into an open-source system for extracting structured information from scientific literature, with reported accuracies that look promising on the tasks they tested. The paper does a good job laying out evaluations on three tasks that ramp up in difficulty: high accuracy on schema-based extraction of assessment instruments, solid numbers on metadata and resource extraction, and then NER on neuroscience literature where it hits 58 to 75 percent label accuracy but does better on relaxed recall for benchmarks. Extracting extra entities beyond the gold standard while keeping provenance is a practical feature. The fact that it's modular and task-agnostic in design is what they highlight as the advance over prior piecemeal approaches. The softer areas are the lack of clear ablations showing why the full combination is needed. The abstract gives overall figures but the full paper should have more on whether the agent loops or human checks are what move the needle or if it's mostly the ontology and base LLM. The variation in performance across tasks suggests it may not be equally robust everywhere, and the human effort cost isn't broken out, which is important for judging if this scales without too much manual work. Also, the additional entities extracted could include useful new info or noise, so error analysis would help. This is aimed at researchers doing meta-analysis or building domain knowledge bases from papers, especially in biomedical fields. Anyone needing a starting point for automated extraction with some human oversight will find the benchmarks and the framework useful to consider. I think it should go to peer review. The results are concrete enough and the open-source aspect makes it worth referees spending time to check the details and suggest improvements.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces StructSense, a modular, task-agnostic, open-source framework for structured information extraction from scientific literature. It integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation. The framework is evaluated on three tasks of increasing complexity: schema-based extraction of assessment instruments (91-100% accuracy), metadata and resource extraction from scientific papers (86-93% overall), and named entity recognition from neuroscience literature (58-75% label accuracy across 8,882 entities). Additional results on biomedical NER benchmarks (NCBI Disease and S800) report >=90% relaxed recall, 62.5-85.8% strict recall, and extraction of 1,000-3,600 additional entities beyond gold annotations, with local concept mapping achieving Hits@1 of 62-82% (strict) and 68-86% (semantic).

Significance. If the reported accuracies and generalization hold after verification of protocols and ablations, the work offers a practical open-source contribution to domain-aware extraction that combines symbolic guidance with agentic refinement and selective human oversight. The emphasis on source grounding, provenance transparency, and evaluation across tasks of varying semantic complexity addresses real limitations of LLMs in specialized scientific domains; the open-source release and human-in-the-loop component are explicit strengths for reproducibility and adoption.

major comments (2)

[Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.
[NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.

minor comments (2)

[Abstract and Results] Clarify notation for accuracy ranges (e.g., whether 91-100% reflects per-task variation or confidence intervals) and ensure all metrics are defined consistently between the abstract and main text.
[Framework description] Provide more detail on the modular architecture diagram or pseudocode to make the integration of symbolic ontology guidance with agentic self-refinement reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address the major comments point by point below, committing to specific revisions that will improve the verifiability of our results without overstating the current manuscript.

read point-by-point responses

Referee: [Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.

Authors: We agree that the current manuscript lacks sufficient detail on these aspects to fully support the generalization claims. In the revised version, we will expand the Evaluation section with a precise description of the evaluation protocol (including annotation guidelines, inter-annotator agreement where applicable, and exact metrics computation). We will add baseline comparisons against zero-shot LLM prompting and rule-based extractors. Ablation studies will isolate the contributions of ontology guidance, agentic refinement, and human-in-the-loop components. Statistical significance (e.g., McNemar tests or bootstrap confidence intervals) will be reported for key accuracy figures. Human effort will be quantified via average validation time per document and total annotator hours. These additions will directly address concerns about overfitting and unmeasured costs. revision: yes
Referee: [NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.

Authors: We accept that validation of the additional entities and discussion of the recall gap are currently insufficient. The revised manuscript will include a new error analysis subsection detailing the validation criteria (e.g., expert review of a random sample of 200 additional entities per benchmark, with categorization into true novel entities, synonyms, or errors) and quantitative results from that review. We will also add a paragraph discussing the relaxed-strict recall gap, noting that relaxed recall reflects the framework's semantic flexibility for task-agnostic use while strict recall ensures high precision; this duality supports generalization across domains without requiring task-specific retraining. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical framework evaluation

full rationale

The paper introduces StructSense as a modular framework and reports empirical accuracies on three tasks (schema extraction, metadata extraction, NER) plus two benchmarks, with no equations, first-principles derivations, or predictions that reduce to fitted inputs. Claims rest on experimental results, ontology integration, and human-in-the-loop validation rather than self-referential definitions or self-citation chains that force outcomes. The work is self-contained as benchmarking; no load-bearing step equates a result to its own construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based solely on the abstract, the central claim rests on standard domain assumptions in LLM-based information extraction rather than new axioms or invented entities; no free parameters or novel postulated objects are mentioned.

axioms (2)

domain assumption Ontology-guided symbolic knowledge can reliably steer LLM extraction in specialized scientific domains
Invoked in the abstract description of the framework's core integration for domain-aware extraction.
domain assumption Agentic self-evaluative refinement improves output quality without introducing new errors
Stated as part of the modular design that enables robust extraction.

pith-pipeline@v0.9.0 · 5799 in / 1444 out tokens · 42865 ms · 2026-05-22T12:34:50.727268+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

[1]

Information extraction from scientific articles: a survey

Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Information extraction from scientific articles: a survey. Scientometrics, 117(3):1931–1990, Dec 2018

work page 1931
[2]

Publication output by region, country, or economy and by scientific field, 2023

National Science Board. Publication output by region, country, or economy and by scientific field, 2023. Science & Engineering Indicators 2023, NSB-2023-33

work page 2023
[3]

Scientific literature: Information overload

Esther Landhuis. Scientific literature: Information overload. Nature, 535(7612):457–458, Jul 2016

work page 2016
[4]

Scientific discourse tagging for evidence extraction, 2021

Xiangci Li, Gully Burns, and Nanyun Peng. Scientific discourse tagging for evidence extraction, 2021

work page 2021
[5]

Scholarly knowledge graphs through structuring scholarly communication: a review

Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & Intelligent Systems, 9(1):1059–1095, Feb 2023

work page 2023
[6]

Rosen, Gerbrand Ceder, Kristin A

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, Feb 2024

work page 2024
[7]

Large language models for generative information extraction: a survey

Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, Yang Wang, and Enhong Chen. Large language models for generative information extraction: a survey. Frontiers of Computer Science, 18(6):186357, Nov 2024

work page 2024
[8]

Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S

Tek Raj Chhetri, Yaroslav O. Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S. Ghosh, Patrick Ray, and Lydia Ng. Bridging the scientific knowledge gap and reproducibility: A survey of provenance, assertion and evidence ontologies. In Companion Proceedings of the ACM Web Conference 2025, WWW Companion ’25, page 5, New York, NY , USA, 2025. Association f...

work page 2025
[9]

Challenges and advances in information extraction from scientific literature: a review

Zhi Hong, Logan Ward, Kyle Chard, Ben Blaiszik, and Ian Foster. Challenges and advances in information extraction from scientific literature: a review. JOM, 73(11):3383–3400, Nov 2021

work page 2021
[10]

Barbara Nebe, Sascha Spors, and Frank Krüger

Max Schröder, Susanne Staehlke, Paul Groth, J. Barbara Nebe, Sascha Spors, and Frank Krüger. Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. Journal of Biomedical Semantics, 13(1):4, Jan 2022. 18 STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK for STRUCTURED INFORMATION EXTRACTION with HUMAN-I...

work page 2022
[11]

Large language models are few-shot clinical information extractors, 2022

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. Large language models are few-shot clinical information extractors, 2022

work page 2022
[12]

Data extraction from polymer literature using large language models

Sonakshi Gupta, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. Data extraction from polymer literature using large language models. Communications Materials, 5(1):269, Dec 2024

work page 2024
[13]

Gpt-re: In-context learning for relation extraction using large language models, 2023

Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models, 2023

work page 2023
[14]

Concept, pages 3–10

Dieter Fensel. Concept, pages 3–10. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004

work page 2004
[15]

Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies

Licong Cui and Ankur Agrawal. Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies. BMC Medical Informatics and Decision Making, 23(1):302, Aug 2024

work page 2024
[16]

Biobert: a pre- trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre- trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020

work page 2020
[17]

Domain-specific language model pretraining for biomedical natural language processing

Yu Gu, Robert Tinn, Hao Cheng, Jason Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2022

work page 2022
[18]

& Cohan, A

Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

work page arXiv 1903
[19]

Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning

Yifan Peng, Thomas Smith, and Wei Zhang. Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning. Nature Communications, 15(1):1121, 2024

work page 2024
[20]

Biocreative v cdr task corpus: a resource for chemical disease relation extraction

Jin-Dong Li, Yanan Sun, Rodney J Johnson, Dan Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan P Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016

work page 2016
[21]

Overview of bionlp’09 shared task on event extraction

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9, 2009

work page 2009
[22]

Introduction to the bio-entity recognition task at jnlpba

Nigel Collier and Jin-Dong Kim. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pages 70–75, 2004

work page 2004
[23]

A survey on biomedical named entity recognition

Lei Hou, Juanzi Zhang, Zhiyuan Liu, Yankai Song, Xianpei Han, and Maosong Sun. A survey on biomedical named entity recognition. Frontiers in Cell and Developmental Biology, 8:673, 2020

work page 2020
[24]

Distributional semantics resources for biomedical text processing

Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. Distributional semantics resources for biomedical text processing. Proceedings of LBM, 2013:39–44, 2013

work page 2013
[25]

Neural named entity recognition for scientific text: A survey and outlook

Kyunghyun Cho, Suchin Gururangan, Kyle Lo, and Noah A Smith. Neural named entity recognition for scientific text: A survey and outlook. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4565–4576, 2021

work page 2021
[26]

Gohar Zaman, Hairulnizam Mahdin, Khalid Hussain, Atta-Ur-Rahman, Jemal Abawajy, and Salama A. Mostafa. An ontological framework for information extraction from diverse scientific sources.IEEE Access, 9:42111–42124, 2021

work page 2021
[27]

A hybrid ontology-based information extraction system

Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya, and Hui Zong. A hybrid ontology-based information extraction system. Journal of Information Science, 42(6):798–820, 2016

work page 2016
[28]

Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports

Kaijian Liu and Nora El-Gohary. Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81:313–327, 2017

work page 2017
[29]

An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique

P Raghavendra Nayaka and Rajeev Ranjan. An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique. In 2023 2nd International Conference for Innovation in Technology (INOCON), pages 1–9, 2023

work page 2023
[30]

Data acquisition and information extraction for scientific knowledge base building

Piotr Andruszkiewicz and Henryk Rybinski. Data acquisition and information extraction for scientific knowledge base building. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) , pages 256–259, 2018

work page 2018
[31]

Trie: End-to-end text reading and information extraction for document understanding

Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. Trie: End-to-end text reading and information extraction for document understanding. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 1413–1422, New York, NY , USA, 2020. Association for Computing Machinery. 19 STRUCTSENSE: A TASK-...

work page 2020
[32]

Pradeep Dasigi, Gully A. P. C. Burns, Eduard Hovy, and Anita de Waard. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks, 2017

work page 2017
[33]

Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

Di Jin and Peter Szolovits. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

work page 2018
[34]

Claimdistiller: Scientific claim extraction with supervised contrastive learning

Xin Wei, Md Reshad Ul Hoque, Jian Wu, and Jiang Li. Claimdistiller: Scientific claim extraction with supervised contrastive learning. In CEUR Workshop Proceedings: EEKE-All2023: Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and AI + Informetrics (All2023): Proceedings of Joint Workshop of the 4th Extraction and Evalu...

work page 2023
[35]

Argumentation mining in scientific literature for sustainable development

Aris Fergadis, Dimitris Pappas, Antonia Karamolegkou, and Haris Papageorgiou. Argumentation mining in scientific literature for sustainable development. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors, Proceedings of the 8th Workshop on Argument Mining , pages 100–111, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

work page 2021
[36]

REBEL: Relation extraction by end-to-end language generation

Pere-Lluís Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end language generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

work page 2021
[37]

Deep learning models for spatial relation extraction in text

Kehan Wu, Xueying Zhang, Yulong Dang, and Peng Ye and. Deep learning models for spatial relation extraction in text. Geo-spatial Information Science, 26(1):58–70, 2023

work page 2023
[38]

Do llms really adapt to domains? an ontology learning perspective

Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do llms really adapt to domains? an ontology learning perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors, The Semantic Web – ISWC 2024, pages 126–143, Cham, 2025. Springer Nature Switzerland

work page 2024
[39]

Efficient knowledge infusion via KG-LLM alignment

Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. Efficient knowledge infusion via KG-LLM alignment. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 2986–2999, Bangkok, Thailand, August 2024. Association for Computational L...

work page 2024
[40]

InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration

Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and Haifeng Chen. InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2024, pages 3675–3688, Miami, F...

work page 2024
[41]

Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A

Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. UnifiedSKG: Unifying and multi-taski...

work page 2022
[42]

Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents

Yubo Shu, Haonan Zhang, Hansu Gu, Peng Zhang, Tun Lu, Dongsheng Li, and Ning Gu. Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents. IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024

work page 2024
[43]

Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

work page 2024
[44]

Towards an ai co-scientist, 2025

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page 2025
[45]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...

work page 2023
[46]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025

work page 2025
[47]

Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, and Jordan Boyd-Graber. Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

work page 2025
[48]

Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, and Chirag Shah. Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

work page 2024
[49]

Y . Chen, D. Jarecka, S. A. Abraham, R. Gau, E. Ng, D. M. Low, I. Bevers, A. Johnson, A. Keshavan, A. Klein, J. Clucas, Z. Rosli, S. M. Hodge, J. Linkersdörfer, H. Bartsch, S. Das, D. Fair, D. Kennedy, and S. S. Ghosh. Standardizing survey data collection to enhance reproducibility: An evaluation of reproschema. Journal of Medical Internet Research, 2025....

work page 2025
[50]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016

work page 2016
[51]

Find me the right content! diversity-based sampling of social media spaces for topic-centric search

Munmun De Choudhury, Scott Counts, and Mary Czerwinski. Find me the right content! diversity-based sampling of social media spaces for topic-centric search. Proceedings of the International AAAI Conference on Web and Social Media, 5(1):129–136, Aug. 2021. 21

work page 2021

[1] [1]

Information extraction from scientific articles: a survey

Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Information extraction from scientific articles: a survey. Scientometrics, 117(3):1931–1990, Dec 2018

work page 1931

[2] [2]

Publication output by region, country, or economy and by scientific field, 2023

National Science Board. Publication output by region, country, or economy and by scientific field, 2023. Science & Engineering Indicators 2023, NSB-2023-33

work page 2023

[3] [3]

Scientific literature: Information overload

Esther Landhuis. Scientific literature: Information overload. Nature, 535(7612):457–458, Jul 2016

work page 2016

[4] [4]

Scientific discourse tagging for evidence extraction, 2021

Xiangci Li, Gully Burns, and Nanyun Peng. Scientific discourse tagging for evidence extraction, 2021

work page 2021

[5] [5]

Scholarly knowledge graphs through structuring scholarly communication: a review

Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & Intelligent Systems, 9(1):1059–1095, Feb 2023

work page 2023

[6] [6]

Rosen, Gerbrand Ceder, Kristin A

John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, Feb 2024

work page 2024

[7] [7]

Large language models for generative information extraction: a survey

Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, Yang Wang, and Enhong Chen. Large language models for generative information extraction: a survey. Frontiers of Computer Science, 18(6):186357, Nov 2024

work page 2024

[8] [8]

Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S

Tek Raj Chhetri, Yaroslav O. Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S. Ghosh, Patrick Ray, and Lydia Ng. Bridging the scientific knowledge gap and reproducibility: A survey of provenance, assertion and evidence ontologies. In Companion Proceedings of the ACM Web Conference 2025, WWW Companion ’25, page 5, New York, NY , USA, 2025. Association f...

work page 2025

[9] [9]

Challenges and advances in information extraction from scientific literature: a review

Zhi Hong, Logan Ward, Kyle Chard, Ben Blaiszik, and Ian Foster. Challenges and advances in information extraction from scientific literature: a review. JOM, 73(11):3383–3400, Nov 2021

work page 2021

[10] [10]

Barbara Nebe, Sascha Spors, and Frank Krüger

Max Schröder, Susanne Staehlke, Paul Groth, J. Barbara Nebe, Sascha Spors, and Frank Krüger. Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. Journal of Biomedical Semantics, 13(1):4, Jan 2022. 18 STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK for STRUCTURED INFORMATION EXTRACTION with HUMAN-I...

work page 2022

[11] [11]

Large language models are few-shot clinical information extractors, 2022

Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. Large language models are few-shot clinical information extractors, 2022

work page 2022

[12] [12]

Data extraction from polymer literature using large language models

Sonakshi Gupta, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. Data extraction from polymer literature using large language models. Communications Materials, 5(1):269, Dec 2024

work page 2024

[13] [13]

Gpt-re: In-context learning for relation extraction using large language models, 2023

Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models, 2023

work page 2023

[14] [14]

Concept, pages 3–10

Dieter Fensel. Concept, pages 3–10. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004

work page 2004

[15] [15]

Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies

Licong Cui and Ankur Agrawal. Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies. BMC Medical Informatics and Decision Making, 23(1):302, Aug 2024

work page 2024

[16] [16]

Biobert: a pre- trained biomedical language representation model for biomedical text mining

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre- trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020

work page 2020

[17] [17]

Domain-specific language model pretraining for biomedical natural language processing

Yu Gu, Robert Tinn, Hao Cheng, Jason Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2022

work page 2022

[18] [18]

& Cohan, A

Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019

work page arXiv 1903

[19] [19]

Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning

Yifan Peng, Thomas Smith, and Wei Zhang. Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning. Nature Communications, 15(1):1121, 2024

work page 2024

[20] [20]

Biocreative v cdr task corpus: a resource for chemical disease relation extraction

Jin-Dong Li, Yanan Sun, Rodney J Johnson, Dan Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan P Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016

work page 2016

[21] [21]

Overview of bionlp’09 shared task on event extraction

Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9, 2009

work page 2009

[22] [22]

Introduction to the bio-entity recognition task at jnlpba

Nigel Collier and Jin-Dong Kim. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pages 70–75, 2004

work page 2004

[23] [23]

A survey on biomedical named entity recognition

Lei Hou, Juanzi Zhang, Zhiyuan Liu, Yankai Song, Xianpei Han, and Maosong Sun. A survey on biomedical named entity recognition. Frontiers in Cell and Developmental Biology, 8:673, 2020

work page 2020

[24] [24]

Distributional semantics resources for biomedical text processing

Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. Distributional semantics resources for biomedical text processing. Proceedings of LBM, 2013:39–44, 2013

work page 2013

[25] [25]

Neural named entity recognition for scientific text: A survey and outlook

Kyunghyun Cho, Suchin Gururangan, Kyle Lo, and Noah A Smith. Neural named entity recognition for scientific text: A survey and outlook. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4565–4576, 2021

work page 2021

[26] [26]

Gohar Zaman, Hairulnizam Mahdin, Khalid Hussain, Atta-Ur-Rahman, Jemal Abawajy, and Salama A. Mostafa. An ontological framework for information extraction from diverse scientific sources.IEEE Access, 9:42111–42124, 2021

work page 2021

[27] [27]

A hybrid ontology-based information extraction system

Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya, and Hui Zong. A hybrid ontology-based information extraction system. Journal of Information Science, 42(6):798–820, 2016

work page 2016

[28] [28]

Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports

Kaijian Liu and Nora El-Gohary. Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81:313–327, 2017

work page 2017

[29] [29]

An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique

P Raghavendra Nayaka and Rajeev Ranjan. An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique. In 2023 2nd International Conference for Innovation in Technology (INOCON), pages 1–9, 2023

work page 2023

[30] [30]

Data acquisition and information extraction for scientific knowledge base building

Piotr Andruszkiewicz and Henryk Rybinski. Data acquisition and information extraction for scientific knowledge base building. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) , pages 256–259, 2018

work page 2018

[31] [31]

Trie: End-to-end text reading and information extraction for document understanding

Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. Trie: End-to-end text reading and information extraction for document understanding. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 1413–1422, New York, NY , USA, 2020. Association for Computing Machinery. 19 STRUCTSENSE: A TASK-...

work page 2020

[32] [32]

Pradeep Dasigi, Gully A. P. C. Burns, Eduard Hovy, and Anita de Waard. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks, 2017

work page 2017

[33] [33]

Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

Di Jin and Peter Szolovits. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018

work page 2018

[34] [34]

Claimdistiller: Scientific claim extraction with supervised contrastive learning

Xin Wei, Md Reshad Ul Hoque, Jian Wu, and Jiang Li. Claimdistiller: Scientific claim extraction with supervised contrastive learning. In CEUR Workshop Proceedings: EEKE-All2023: Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and AI + Informetrics (All2023): Proceedings of Joint Workshop of the 4th Extraction and Evalu...

work page 2023

[35] [35]

Argumentation mining in scientific literature for sustainable development

Aris Fergadis, Dimitris Pappas, Antonia Karamolegkou, and Haris Papageorgiou. Argumentation mining in scientific literature for sustainable development. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors, Proceedings of the 8th Workshop on Argument Mining , pages 100–111, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

work page 2021

[36] [36]

REBEL: Relation extraction by end-to-end language generation

Pere-Lluís Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end language generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic, November 2021. Association for Computationa...

work page 2021

[37] [37]

Deep learning models for spatial relation extraction in text

Kehan Wu, Xueying Zhang, Yulong Dang, and Peng Ye and. Deep learning models for spatial relation extraction in text. Geo-spatial Information Science, 26(1):58–70, 2023

work page 2023

[38] [38]

Do llms really adapt to domains? an ontology learning perspective

Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do llms really adapt to domains? an ontology learning perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors, The Semantic Web – ISWC 2024, pages 126–143, Cham, 2025. Springer Nature Switzerland

work page 2024

[39] [39]

Efficient knowledge infusion via KG-LLM alignment

Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. Efficient knowledge infusion via KG-LLM alignment. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 2986–2999, Bangkok, Thailand, August 2024. Association for Computational L...

work page 2024

[40] [40]

InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration

Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and Haifeng Chen. InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2024, pages 3675–3688, Miami, F...

work page 2024

[41] [41]

Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A

Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. UnifiedSKG: Unifying and multi-taski...

work page 2022

[42] [42]

Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents

Yubo Shu, Haonan Zhang, Hansu Gu, Peng Zhang, Tun Lu, Dongsheng Li, and Ning Gu. Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents. IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024

work page 2024

[43] [43]

Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024

work page 2024

[44] [44]

Towards an ai co-scientist, 2025

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page 2025

[45] [45]

Judging llm-as-a-judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...

work page 2023

[46] [46]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025

work page 2025

[47] [47]

Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, and Jordan Boyd-Graber. Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025

work page 2025

[48] [48]

Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, and Chirag Shah. Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024

work page 2024

[49] [49]

Y . Chen, D. Jarecka, S. A. Abraham, R. Gau, E. Ng, D. M. Low, I. Bevers, A. Johnson, A. Keshavan, A. Klein, J. Clucas, Z. Rosli, S. M. Hodge, J. Linkersdörfer, H. Bartsch, S. Das, D. Fair, D. Kennedy, and S. S. Ghosh. Standardizing survey data collection to enhance reproducibility: An evaluation of reproschema. Journal of Medical Internet Research, 2025....

work page 2025

[50] [50]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016

work page 2016

[51] [51]

Find me the right content! diversity-based sampling of social media spaces for topic-centric search

Munmun De Choudhury, Scott Counts, and Mary Czerwinski. Find me the right content! diversity-based sampling of social media spaces for topic-centric search. Proceedings of the International AAAI Conference on Web and Social Media, 5(1):129–136, Aug. 2021. 21

work page 2021