STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking
Pith reviewed 2026-05-22 12:34 UTC · model grok-4.3
The pith
StructSense integrates ontology guidance with agentic self-refinement and human validation to extract structured information from scientific literature across multiple tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce StructSense, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate StructSense on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100% accuracy), metadata and resource extraction from scientific papers (86--93% overall), and named entity recognition (NER) from neuroscience literature (58--75% label accuracy across 8,882 entities). On two biomedical NER benchmarks the system achieves at least 90 percent relaxed recall while extracting 1,000--3,600 additional entities
What carries the argument
StructSense, a modular task-agnostic framework that combines ontology-guided symbolic knowledge, agentic self-evaluative refinement, and selective human-in-the-loop validation to perform domain-aware structured extraction while preserving source grounding.
If this is right
- The framework generalizes across tasks of increasing semantic complexity while keeping source grounding and provenance transparency.
- It reaches at least 90 percent relaxed recall on biomedical named entity benchmarks and identifies thousands of additional entities beyond existing gold annotations.
- Concept mapping within the system hits 62 to 82 percent under strict matching and 68 to 86 percent under semantic matching.
- The approach maintains consistent performance on schema-based extraction, metadata pulling, and entity recognition without task-specific retraining.
Where Pith is reading between the lines
- This combination of symbolic rules and self-correction could reduce reliance on large labeled datasets for new scientific fields.
- The provenance tracking feature may support verification steps in large collaborative knowledge bases built from literature.
- Extending the human-in-the-loop component to active learning loops could further lower the amount of manual review needed over time.
- Testing the same pipeline on non-biomedical literature such as physics or materials science would reveal how far the task-agnostic property holds.
Load-bearing premise
That the combination of ontology guidance, agentic self-refinement, and selective human validation will produce strong accuracy and generalization without overfitting to the tested domains or demanding excessive human effort.
What would settle it
Apply the framework unchanged to a fourth domain with different terminology and measure whether accuracy and recall remain within the same ranges reported for the original three domains.
Figures
read the original abstract
Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces StructSense, a modular, task-agnostic, open-source framework for structured information extraction from scientific literature. It integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation. The framework is evaluated on three tasks of increasing complexity: schema-based extraction of assessment instruments (91-100% accuracy), metadata and resource extraction from scientific papers (86-93% overall), and named entity recognition from neuroscience literature (58-75% label accuracy across 8,882 entities). Additional results on biomedical NER benchmarks (NCBI Disease and S800) report >=90% relaxed recall, 62.5-85.8% strict recall, and extraction of 1,000-3,600 additional entities beyond gold annotations, with local concept mapping achieving Hits@1 of 62-82% (strict) and 68-86% (semantic).
Significance. If the reported accuracies and generalization hold after verification of protocols and ablations, the work offers a practical open-source contribution to domain-aware extraction that combines symbolic guidance with agentic refinement and selective human oversight. The emphasis on source grounding, provenance transparency, and evaluation across tasks of varying semantic complexity addresses real limitations of LLMs in specialized scientific domains; the open-source release and human-in-the-loop component are explicit strengths for reproducibility and adoption.
major comments (2)
- [Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.
- [NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.
minor comments (2)
- [Abstract and Results] Clarify notation for accuracy ranges (e.g., whether 91-100% reflects per-task variation or confidence intervals) and ensure all metrics are defined consistently between the abstract and main text.
- [Framework description] Provide more detail on the modular architecture diagram or pseudocode to make the integration of symbolic ontology guidance with agentic self-refinement reproducible.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for major revision. We address the major comments point by point below, committing to specific revisions that will improve the verifiability of our results without overstating the current manuscript.
read point-by-point responses
-
Referee: [Evaluation section / Results] The central generalization claim rests on the reported accuracy numbers, yet the evaluation protocol, baseline comparisons, statistical significance, and ablation studies for the ontology, agentic, and human-in-loop components are not verifiable from the provided details. This leaves the premise that the combination produces robust performance without domain-specific overfitting or unmeasured human cost unconfirmed.
Authors: We agree that the current manuscript lacks sufficient detail on these aspects to fully support the generalization claims. In the revised version, we will expand the Evaluation section with a precise description of the evaluation protocol (including annotation guidelines, inter-annotator agreement where applicable, and exact metrics computation). We will add baseline comparisons against zero-shot LLM prompting and rule-based extractors. Ablation studies will isolate the contributions of ontology guidance, agentic refinement, and human-in-the-loop components. Statistical significance (e.g., McNemar tests or bootstrap confidence intervals) will be reported for key accuracy figures. Human effort will be quantified via average validation time per document and total annotator hours. These additions will directly address concerns about overfitting and unmeasured costs. revision: yes
-
Referee: [NER benchmark results] On the biomedical NER benchmarks, the extraction of 1,000-3,600 additional entities beyond gold annotations requires explicit validation criteria and error analysis to substantiate that these are genuine improvements rather than over-extraction; the gap between relaxed (>=90%) and strict (62.5-85.8%) recall also needs discussion of its implications for the task-agnostic claim.
Authors: We accept that validation of the additional entities and discussion of the recall gap are currently insufficient. The revised manuscript will include a new error analysis subsection detailing the validation criteria (e.g., expert review of a random sample of 200 additional entities per benchmark, with categorization into true novel entities, synonyms, or errors) and quantitative results from that review. We will also add a paragraph discussing the relaxed-strict recall gap, noting that relaxed recall reflects the framework's semantic flexibility for task-agnostic use while strict recall ensures high precision; this duality supports generalization across domains without requiring task-specific retraining. revision: yes
Circularity Check
No significant circularity; empirical framework evaluation
full rationale
The paper introduces StructSense as a modular framework and reports empirical accuracies on three tasks (schema extraction, metadata extraction, NER) plus two benchmarks, with no equations, first-principles derivations, or predictions that reduce to fitted inputs. Claims rest on experimental results, ontology integration, and human-in-the-loop validation rather than self-referential definitions or self-citation chains that force outcomes. The work is self-contained as benchmarking; no load-bearing step equates a result to its own construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Ontology-guided symbolic knowledge can reliably steer LLM extraction in specialized scientific domains
- domain assumption Agentic self-evaluative refinement improves output quality without introducing new errors
Reference graph
Works this paper leans on
-
[1]
Information extraction from scientific articles: a survey
Zara Nasar, Syed Waqar Jaffry, and Muhammad Kamran Malik. Information extraction from scientific articles: a survey. Scientometrics, 117(3):1931–1990, Dec 2018
work page 1931
-
[2]
Publication output by region, country, or economy and by scientific field, 2023
National Science Board. Publication output by region, country, or economy and by scientific field, 2023. Science & Engineering Indicators 2023, NSB-2023-33
work page 2023
-
[3]
Scientific literature: Information overload
Esther Landhuis. Scientific literature: Information overload. Nature, 535(7612):457–458, Jul 2016
work page 2016
-
[4]
Scientific discourse tagging for evidence extraction, 2021
Xiangci Li, Gully Burns, and Nanyun Peng. Scientific discourse tagging for evidence extraction, 2021
work page 2021
-
[5]
Scholarly knowledge graphs through structuring scholarly communication: a review
Shilpa Verma, Rajesh Bhatia, Sandeep Harit, and Sanjay Batish. Scholarly knowledge graphs through structuring scholarly communication: a review. Complex & Intelligent Systems, 9(1):1059–1095, Feb 2023
work page 2023
-
[6]
Rosen, Gerbrand Ceder, Kristin A
John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S. Rosen, Gerbrand Ceder, Kristin A. Persson, and Anubhav Jain. Structured information extraction from scientific text with large language models. Nature Communications, 15(1):1418, Feb 2024
work page 2024
-
[7]
Large language models for generative information extraction: a survey
Derong Xu, Wei Chen, Wenjun Peng, Chao Zhang, Tong Xu, Xiangyu Zhao, Xian Wu, Yefeng Zheng, Yang Wang, and Enhong Chen. Large language models for generative information extraction: a survey. Frontiers of Computer Science, 18(6):186357, Nov 2024
work page 2024
-
[8]
Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S
Tek Raj Chhetri, Yaroslav O. Halchenko, Dorota Jarecka, Puja Trivedi, Satrajit S. Ghosh, Patrick Ray, and Lydia Ng. Bridging the scientific knowledge gap and reproducibility: A survey of provenance, assertion and evidence ontologies. In Companion Proceedings of the ACM Web Conference 2025, WWW Companion ’25, page 5, New York, NY , USA, 2025. Association f...
work page 2025
-
[9]
Challenges and advances in information extraction from scientific literature: a review
Zhi Hong, Logan Ward, Kyle Chard, Ben Blaiszik, and Ian Foster. Challenges and advances in information extraction from scientific literature: a review. JOM, 73(11):3383–3400, Nov 2021
work page 2021
-
[10]
Barbara Nebe, Sascha Spors, and Frank Krüger
Max Schröder, Susanne Staehlke, Paul Groth, J. Barbara Nebe, Sascha Spors, and Frank Krüger. Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation. Journal of Biomedical Semantics, 13(1):4, Jan 2022. 18 STRUCTSENSE: A TASK-AGNOSTIC AGENTIC FRAMEWORK for STRUCTURED INFORMATION EXTRACTION with HUMAN-I...
work page 2022
-
[11]
Large language models are few-shot clinical information extractors, 2022
Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. Large language models are few-shot clinical information extractors, 2022
work page 2022
-
[12]
Data extraction from polymer literature using large language models
Sonakshi Gupta, Akhlak Mahmood, Pranav Shetty, Aishat Adeboye, and Rampi Ramprasad. Data extraction from polymer literature using large language models. Communications Materials, 5(1):269, Dec 2024
work page 2024
-
[13]
Gpt-re: In-context learning for relation extraction using large language models, 2023
Zhen Wan, Fei Cheng, Zhuoyuan Mao, Qianying Liu, Haiyue Song, Jiwei Li, and Sadao Kurohashi. Gpt-re: In-context learning for relation extraction using large language models, 2023
work page 2023
-
[14]
Dieter Fensel. Concept, pages 3–10. Springer Berlin Heidelberg, Berlin, Heidelberg, 2004
work page 2004
-
[15]
Licong Cui and Ankur Agrawal. Special supplement issue on quality assurance and enrichment of biological and biomedical ontologies and terminologies. BMC Medical Informatics and Decision Making, 23(1):302, Aug 2024
work page 2024
-
[16]
Biobert: a pre- trained biomedical language representation model for biomedical text mining
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Chan Ho So, and Jaewoo Kang. Biobert: a pre- trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240, 2020
work page 2020
-
[17]
Domain-specific language model pretraining for biomedical natural language processing
Yu Gu, Robert Tinn, Hao Cheng, Jason Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2022
work page 2022
-
[18]
Iz Beltagy, Kyle Lo, and Arman Cohan. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676, 2019
-
[19]
Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning
Yifan Peng, Thomas Smith, and Wei Zhang. Biomistral: Accurate biomedical named entity recognition with domain-specific fine-tuning. Nature Communications, 15(1):1121, 2024
work page 2024
-
[20]
Biocreative v cdr task corpus: a resource for chemical disease relation extraction
Jin-Dong Li, Yanan Sun, Rodney J Johnson, Dan Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan P Davis, Carolyn J Mattingly, Thomas C Wiegers, and Zhiyong Lu. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016, 2016
work page 2016
-
[21]
Overview of bionlp’09 shared task on event extraction
Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, and Jun’ichi Tsujii. Overview of bionlp’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task, pages 1–9, 2009
work page 2009
-
[22]
Introduction to the bio-entity recognition task at jnlpba
Nigel Collier and Jin-Dong Kim. Introduction to the bio-entity recognition task at jnlpba. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications, pages 70–75, 2004
work page 2004
-
[23]
A survey on biomedical named entity recognition
Lei Hou, Juanzi Zhang, Zhiyuan Liu, Yankai Song, Xianpei Han, and Maosong Sun. A survey on biomedical named entity recognition. Frontiers in Cell and Developmental Biology, 8:673, 2020
work page 2020
-
[24]
Distributional semantics resources for biomedical text processing
Sampo Pyysalo, Filip Ginter, Hans Moen, Tapio Salakoski, and Sophia Ananiadou. Distributional semantics resources for biomedical text processing. Proceedings of LBM, 2013:39–44, 2013
work page 2013
-
[25]
Neural named entity recognition for scientific text: A survey and outlook
Kyunghyun Cho, Suchin Gururangan, Kyle Lo, and Noah A Smith. Neural named entity recognition for scientific text: A survey and outlook. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4565–4576, 2021
work page 2021
-
[26]
Gohar Zaman, Hairulnizam Mahdin, Khalid Hussain, Atta-Ur-Rahman, Jemal Abawajy, and Salama A. Mostafa. An ontological framework for information extraction from diverse scientific sources.IEEE Access, 9:42111–42124, 2021
work page 2021
-
[27]
A hybrid ontology-based information extraction system
Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya, and Hui Zong. A hybrid ontology-based information extraction system. Journal of Information Science, 42(6):798–820, 2016
work page 2016
-
[28]
Kaijian Liu and Nora El-Gohary. Ontology-based semi-supervised conditional random fields for automated information extraction from bridge inspection reports. Automation in Construction, 81:313–327, 2017
work page 2017
-
[29]
P Raghavendra Nayaka and Rajeev Ranjan. An efficient framework for metadata extraction over scholarly documents using ensemble cnn and bilstm technique. In 2023 2nd International Conference for Innovation in Technology (INOCON), pages 1–9, 2023
work page 2023
-
[30]
Data acquisition and information extraction for scientific knowledge base building
Piotr Andruszkiewicz and Henryk Rybinski. Data acquisition and information extraction for scientific knowledge base building. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC) , pages 256–259, 2018
work page 2018
-
[31]
Trie: End-to-end text reading and information extraction for document understanding
Peng Zhang, Yunlu Xu, Zhanzhan Cheng, Shiliang Pu, Jing Lu, Liang Qiao, Yi Niu, and Fei Wu. Trie: End-to-end text reading and information extraction for document understanding. InProceedings of the 28th ACM International Conference on Multimedia, MM ’20, page 1413–1422, New York, NY , USA, 2020. Association for Computing Machinery. 19 STRUCTSENSE: A TASK-...
work page 2020
-
[32]
Pradeep Dasigi, Gully A. P. C. Burns, Eduard Hovy, and Anita de Waard. Experiment segmentation in scientific discourse as clause-level structured prediction using recurrent neural networks, 2017
work page 2017
-
[33]
Di Jin and Peter Szolovits. Hierarchical neural networks for sequential sentence classification in medical scientific abstracts, 2018
work page 2018
-
[34]
Claimdistiller: Scientific claim extraction with supervised contrastive learning
Xin Wei, Md Reshad Ul Hoque, Jian Wu, and Jiang Li. Claimdistiller: Scientific claim extraction with supervised contrastive learning. In CEUR Workshop Proceedings: EEKE-All2023: Extraction and Evaluation of Knowledge Entities from Scientific Documents (EEKE2023) and AI + Informetrics (All2023): Proceedings of Joint Workshop of the 4th Extraction and Evalu...
work page 2023
-
[35]
Argumentation mining in scientific literature for sustainable development
Aris Fergadis, Dimitris Pappas, Antonia Karamolegkou, and Haris Papageorgiou. Argumentation mining in scientific literature for sustainable development. In Khalid Al-Khatib, Yufang Hou, and Manfred Stede, editors, Proceedings of the 8th Workshop on Argument Mining , pages 100–111, Punta Cana, Dominican Republic, November 2021. Association for Computationa...
work page 2021
-
[36]
REBEL: Relation extraction by end-to-end language generation
Pere-Lluís Huguet Cabot and Roberto Navigli. REBEL: Relation extraction by end-to-end language generation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2370–2381, Punta Cana, Dominican Republic, November 2021. Association for Computationa...
work page 2021
-
[37]
Deep learning models for spatial relation extraction in text
Kehan Wu, Xueying Zhang, Yulong Dang, and Peng Ye and. Deep learning models for spatial relation extraction in text. Geo-spatial Information Science, 26(1):58–70, 2023
work page 2023
-
[38]
Do llms really adapt to domains? an ontology learning perspective
Huu Tan Mai, Cuong Xuan Chu, and Heiko Paulheim. Do llms really adapt to domains? an ontology learning perspective. In Gianluca Demartini, Katja Hose, Maribel Acosta, Matteo Palmonari, Gong Cheng, Hala Skaf-Molli, Nicolas Ferranti, Daniel Hernández, and Aidan Hogan, editors, The Semantic Web – ISWC 2024, pages 126–143, Cham, 2025. Springer Nature Switzerland
work page 2024
-
[39]
Efficient knowledge infusion via KG-LLM alignment
Zhouyu Jiang, Ling Zhong, Mengshu Sun, Jun Xu, Rui Sun, Hui Cai, Shuhan Luo, and Zhiqiang Zhang. Efficient knowledge infusion via KG-LLM alignment. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Findings of the Association for Computational Linguistics: ACL 2024 , pages 2986–2999, Bangkok, Thailand, August 2024. Association for Computational L...
work page 2024
-
[40]
Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, and Haifeng Chen. InfuserKI: Enhancing large language models with knowledge graphs via infuser-guided knowledge integration. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Findings of the Association for Computational Linguis- tics: EMNLP 2024, pages 3675–3688, Miami, F...
work page 2024
-
[41]
Tianbao Xie, Chen Henry Wu, Peng Shi, Ruiqi Zhong, Torsten Scholak, Michihiro Yasunaga, Chien-Sheng Wu, Ming Zhong, Pengcheng Yin, Sida I. Wang, Victor Zhong, Bailin Wang, Chengzu Li, Connor Boyle, Ansong Ni, Ziyu Yao, Dragomir Radev, Caiming Xiong, Lingpeng Kong, Rui Zhang, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. UnifiedSKG: Unifying and multi-taski...
work page 2022
-
[42]
Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents
Yubo Shu, Haonan Zhang, Hansu Gu, Peng Zhang, Tun Lu, Dongsheng Li, and Ning Gu. Rah! rec- sys–assistant–human: A human-centered recommendation framework with llm agents. IEEE Transactions on Computational Social Systems, 11(5):6759–6770, 2024
work page 2024
-
[43]
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024
Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge, 2024
work page 2024
-
[44]
Towards an ai co-scientist, 2025
Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...
work page 2025
-
[45]
Judging llm-as-a-judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Sy...
work page 2023
-
[46]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025
work page 2025
-
[47]
Zongxia Li, Lorena Calvo-Bartolomé, Alexander Hoyle, Paiheng Xu, Alden Dima, Juan Francisco Fung, and Jordan Boyd-Graber. Large language models struggle to describe the haystack without human help: Human-in- the-loop evaluation of llms, 2025
work page 2025
-
[48]
Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024
Maryam Amirizaniani, Jihan Yao, Adrian Lavergne, Elizabeth Snell Okada, Aman Chadha, Tanya Roosta, and Chirag Shah. Llmauditor: A framework for auditing large language models using human-in-the-loop, 2024
work page 2024
-
[49]
Y . Chen, D. Jarecka, S. A. Abraham, R. Gau, E. Ng, D. M. Low, I. Bevers, A. Johnson, A. Keshavan, A. Klein, J. Clucas, Z. Rosli, S. M. Hodge, J. Linkersdörfer, H. Bartsch, S. Das, D. Fair, D. Kennedy, and S. S. Ghosh. Standardizing survey data collection to enhance reproducibility: An evaluation of reproschema. Journal of Medical Internet Research, 2025....
work page 2025
-
[50]
Squad: 100,000+ questions for machine comprehension of text, 2016
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016
work page 2016
-
[51]
Find me the right content! diversity-based sampling of social media spaces for topic-centric search
Munmun De Choudhury, Scott Counts, and Mary Czerwinski. Find me the right content! diversity-based sampling of social media spaces for topic-centric search. Proceedings of the International AAAI Conference on Web and Social Media, 5(1):129–136, Aug. 2021. 21
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.