CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training

Karun Sharma; Pratinav Seth; Soham Bhattacharjee; Vinay Kumar Sankarapu

arxiv: 2606.21631 · v1 · pith:CNXCDZVSnew · submitted 2026-06-19 · 💻 cs.CL · cs.LG

CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training

Soham Bhattacharjee , Karun Sharma , Vinay Kumar Sankarapu , Pratinav Seth This is my paper

Pith reviewed 2026-06-26 14:18 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords data curationsynthetic data generationLLM post-trainingprovenance trackingquality filteringopen-source libraryPython pipelineauditable data

0 comments

The pith

CuratorKIT unifies the full data curation and synthetic generation lifecycle for LLM post-training into one configurable pipeline with provenance tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CuratorKIT as an open-source Python library that combines ingestion, deduplication, synthetic data generation, quality filtering, and export into a single configurable pipeline. Existing tools handle these steps separately, which makes it hard to audit decisions or trace why samples are rejected. The library adds source readers with schema detection, hygiene checks for PII and toxicity, eight generation tasks, three quality gates with hallucination verification, and structured provenance records for every sample. A sympathetic reader would care because this setup aims to deliver reproducible and auditable data preparation at scale for LLM post-training. If the integration works as described, practitioners could maintain complete chains of custody without silent discards or fragmented tooling.

Core claim

CuratorKIT is an open-source Python library that covers the full lifecycle in a single configurable pipeline, composed of six source format readers and automatic schema detection, a pre-generation data hygiene layer, eight LLM-powered generation tasks, three complementary quality gates with provenance-exact hallucination verification, structured adaptive recovery, and five training-ready export formats, while recording every pipeline decision in an append-only per-sample provenance chain and attaching structured failure reasons to rejected samples.

What carries the argument

CuratorKIT, the open-source Python library that integrates source readers, hygiene, generation, quality gates, and provenance tracking into one pipeline with append-only per-sample records.

If this is right

Pipelines gain auditability because every decision is recorded in an append-only per-sample provenance chain.
Rejected samples include structured failure reasons instead of silent discards.
The library supports 100+ LLM providers via LiteLLM and offers both Python API and YAML CLI access.
Export formats are directly compatible with TRL, Unsloth, and AlignTune training frameworks.
Quality gates include provenance-exact hallucination verification and structured adaptive recovery.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Centralizing these stages could reduce context switching between separate tools and lower the risk of inconsistent filtering rules across a project.
The provenance mechanism might support downstream compliance audits that require tracing individual training samples back to their origin and rejection criteria.
Practitioners could test whether adding more generation tasks or custom quality metrics extends the same pipeline structure without breaking existing provenance guarantees.

Load-bearing premise

The integration of stages, provenance tracking, and quality gates can be realized in practice without introducing new fragmentation, bugs, or unhandled edge cases that undermine the claimed auditability.

What would settle it

Execute the library pipeline on a held-out dataset and verify whether every rejected sample carries a structured failure reason and whether the provenance chain for accepted samples remains complete, append-only, and queryable without gaps.

read the original abstract

Data curation is a critical part of post-training pipelines for large language models, yet existing tools often treat ingestion, deduplication, synthetic generation, and quality filtering as separate stages. This fragmentation makes it difficult to audit pipeline decisions or understand why individual samples are rejected. CuratorKIT is an open-source Python library that covers this full lifecycle in a single configurable pipeline. The framework is composed of six source format readers and automatic schema detection, a pre-generation data hygiene layer for credentials, PII, and toxic content, eight LLM-powered generation tasks, three complementary quality gates with provenance-exact hallucination verification, structured adaptive recovery, and five training-ready export formats compatible with TRL, Unsloth, and AlignTune. Every pipeline decision is recorded in an append-only per-sample provenance chain, and rejected samples carry structured failure reasons rather than being silently discarded. CuratorKIT supports 100+ LLM providers through LiteLLM, exposes both a Python API and a YAML-driven CLI, and is designed for practitioners who need reproducible, auditable data pipelines at scale .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CuratorKIT bundles standard curation steps into one library with provenance tracking, but the paper is only a feature description with no evaluations or comparisons.

read the letter

CuratorKIT is a Python library that puts ingestion, cleaning, synthetic generation, quality filtering, and export into one configurable pipeline for LLM post-training, with every decision logged in an append-only per-sample chain.

It does a reasonable job spelling out the components: six format readers with auto schema detection, hygiene checks for PII and toxicity, eight LLM generation tasks, three quality gates that include hallucination verification, adaptive recovery, and five export formats for TRL, Unsloth, and similar. The LiteLLM integration for 100+ providers and the YAML CLI plus Python API are practical touches. The emphasis on structured failure reasons instead of silent drops is a clear improvement for auditability.

The paper does not introduce new methods or algorithms. It packages existing pieces and makes the case that fragmentation is a problem worth solving through integration. That framing is straightforward and matches real engineering pain points.

The soft spot is the complete lack of evidence. No benchmarks, no usage examples, no comparison to existing tools, and no test of whether the provenance actually helps users or whether the integration introduces its own bugs. The claims rest entirely on the feature list. If the released code works as described, that is the test, but the manuscript itself supplies none of it.

This is for practitioners who build LLM data pipelines and want something off-the-shelf with tracking built in. A reader who needs to set up reproducible curation might save time with the unified design. It shows clear engagement with the workflow stages but is not advancing new ideas.

I would send it to peer review in a tools or systems track so the implementation details and any added examples can be checked.

Referee Report

0 major / 2 minor

Summary. The manuscript describes CuratorKIT, an open-source Python library that unifies data ingestion via six format readers with automatic schema detection, pre-generation hygiene for PII/toxicity/credentials, eight LLM-powered generation tasks, three quality gates including provenance-exact hallucination checks, structured recovery, and five export formats compatible with TRL/Unsloth/AlignTune, all within a single YAML-configurable pipeline that records append-only per-sample provenance chains and supports 100+ providers via LiteLLM.

Significance. If the implementation matches the description, the library could reduce fragmentation in LLM post-training data workflows by providing integrated auditability and reproducibility features that existing separate-stage tools lack. The emphasis on provenance tracking and explicit failure reasons for rejected samples is a practical strength for practitioners.

minor comments (2)

[Abstract] Abstract: the claim of covering the 'full lifecycle in a single configurable pipeline' would be strengthened by including at least one concrete YAML configuration example or pseudocode snippet showing how readers, generation tasks, and quality gates are composed.
The manuscript provides no usage statistics, example runtimes, or qualitative case studies on real datasets, which is acceptable for a tool-description paper but leaves the practical scalability claims unillustrated.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of the CuratorKIT library, recognition of its practical strengths in provenance tracking and auditability, and recommendation for minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes CuratorKIT, an open-source Python library for data curation and synthetic data generation in LLM post-training pipelines. It outlines components such as source readers, hygiene layers, generation tasks, quality gates, provenance tracking, and export formats without presenting any mathematical derivations, equations, predictions, fitted parameters, or formal results. No load-bearing steps exist that reduce by construction to self-definitions, self-citations, or renamed inputs, as the central claim is simply the implementation and release of the described tool rather than a derived theoretical outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software tool description with no mathematical model, empirical claims, or derivations; no free parameters, axioms, or invented entities are required or present.

pith-pipeline@v0.9.1-grok · 5729 in / 984 out tokens · 25296 ms · 2026-06-26T14:18:59.941801+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references

[1]

Distilabel: An ai feedback (aif) framework for building datasets with and for llms

Ã ˛ Alvaro BartolomÃl’ Del Canto, Gabriel MartÃn BlÃ ˛ azquez, AgustÃn Piqueres LajarÃ n, and Daniel Vila Suero. Distilabel: An ai feedback (aif) framework for building datasets with and for llms. https://github. com/argilla-io/distilabel, 2024

2024
[2]

Agentinstruct: Toward generative teaching with agentic flows, 2024

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows, 2024

2024
[3]

Datatrove: large scale data processing, 2024

Guilherme Penedo, Hynek KydlÃ Ä ek, Alessandro Cappelli, Mario Sasko, and Thomas Wolf. Datatrove: large scale data processing, 2024. 21 CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training

2024
[4]

Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data

NVIDIA The NeMo Data Designer Team. Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data. https://github.com/NVIDIA-NeMo/DataDesigner, 2025. GitHub Repository

2025
[5]

LiteLLM: Call all llm apis using the openai format, 2024

BerriAI. LiteLLM: Call all llm apis using the openai format, 2024

2024
[6]

Provenance-grounded gating and adaptive recovery in synthetic post-training data curation, 2026

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, and Pratinav Seth. Provenance-grounded gating and adaptive recovery in synthetic post-training data curation, 2026

2026
[7]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin GallouÃl’dec. TRL: Transformers Reinforcement Learning, 2020

2020
[8]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RÃl’mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-a...

2020
[9]

The faiss library, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel MazarÃl’, Maria Lomeli, Lucas Hosseini, and HervÃl’ JÃl’gou. The faiss library, 2025

2025
[10]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

2020
[11]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023
[12]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

2025
[13]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[14]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: scaling reinforcement learning from human feedback with ai feedback. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024
[15]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024
[17]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

2024
[18]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023
[19]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Li...

1906
[20]

Increasing faithfulness in knowledge- grounded dialogue with controllable features

Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. Increasing faithfulness in knowledge- grounded dialogue with controllable features. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on N...

2021
[21]

Ultrafeedback: Boosting language models with scaled ai feedback, 2024

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback, 2024

2024
[22]

Making monolingual sentence embeddings multilingual using knowledge distillation

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020

2020
[23]

Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

2021
[24]

PubMedQA: A dataset for biomedi- cal research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedi- cal research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro...

2019
[25]

Association for Computational Linguistics
[26]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

2004
[27]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

2020
[28]

Aligntune: Modular toolkit for post-training alignment of large language models, 2026

R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, and Vinay Kumar Sankarapu. Aligntune: Modular toolkit for post-training alignment of large language models, 2026

2026
[29]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. 23

2023

[1] [1]

Distilabel: An ai feedback (aif) framework for building datasets with and for llms

Ã ˛ Alvaro BartolomÃl’ Del Canto, Gabriel MartÃn BlÃ ˛ azquez, AgustÃn Piqueres LajarÃ n, and Daniel Vila Suero. Distilabel: An ai feedback (aif) framework for building datasets with and for llms. https://github. com/argilla-io/distilabel, 2024

2024

[2] [2]

Agentinstruct: Toward generative teaching with agentic flows, 2024

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah. Agentinstruct: Toward generative teaching with agentic flows, 2024

2024

[3] [3]

Datatrove: large scale data processing, 2024

Guilherme Penedo, Hynek KydlÃ Ä ek, Alessandro Cappelli, Mario Sasko, and Thomas Wolf. Datatrove: large scale data processing, 2024. 21 CuratorKIT : Data Curation and Synthetic Data Generation for LLM Post-Training

2024

[4] [4]

Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data

NVIDIA The NeMo Data Designer Team. Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data. https://github.com/NVIDIA-NeMo/DataDesigner, 2025. GitHub Repository

2025

[5] [5]

LiteLLM: Call all llm apis using the openai format, 2024

BerriAI. LiteLLM: Call all llm apis using the openai format, 2024

2024

[6] [6]

Provenance-grounded gating and adaptive recovery in synthetic post-training data curation, 2026

Soham Bhattacharjee, Karun Sharma, Vinay Kumar Sankarapu, and Pratinav Seth. Provenance-grounded gating and adaptive recovery in synthetic post-training data curation, 2026

2026

[7] [7]

TRL: Transformers Reinforcement Learning, 2020

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin GallouÃl’dec. TRL: Transformers Reinforcement Learning, 2020

2020

[8] [8]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, RÃl’mi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-a...

2020

[9] [9]

The faiss library, 2025

Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel MazarÃl’, Maria Lomeli, Lucas Hosseini, and HervÃl’ JÃl’gou. The faiss library, 2025

2025

[10] [10]

Detoxify

Laura Hanu and Unitary team. Detoxify. Github. https://github.com/unitaryai/detoxify, 2020

2020

[11] [11]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

2023

[12] [12]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

2025

[13] [13]

Self-rewarding language models

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024

[14] [14]

Rlaif vs

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. Rlaif vs. rlhf: scaling reinforcement learning from human feedback with ai feedback. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

2024

[15] [15]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023

[16] [16]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y .K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024

2024

[17] [17]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. WizardLM: Empowering large pre-trained language models to follow complex instructions. InThe Twelfth International Conference on Learning Representations, 2024

2024

[18] [18]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

2023

[19] [19]

On faithfulness and factuality in abstractive summarization

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Li...

1906

[20] [20]

Increasing faithfulness in knowledge- grounded dialogue with controllable features

Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. Increasing faithfulness in knowledge- grounded dialogue with controllable features. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on N...

2021

[21] [21]

Ultrafeedback: Boosting language models with scaled ai feedback, 2024

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Ultrafeedback: Boosting language models with scaled ai feedback, 2024

2024

[22] [22]

Making monolingual sentence embeddings multilingual using knowledge distillation

Nils Reimers and Iryna Gurevych. Making monolingual sentence embeddings multilingual using knowledge distillation. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2020

2020

[23] [23]

Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. Cuad: An expert-annotated nlp dataset for legal contract review.NeurIPS, 2021

2021

[24] [24]

PubMedQA: A dataset for biomedi- cal research question answering

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedi- cal research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Pro...

2019

[25] [25]

Association for Computational Linguistics

[26] [26]

ROUGE: A package for automatic evaluation of summaries

Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

2004

[27] [27]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

2020

[28] [28]

Aligntune: Modular toolkit for post-training alignment of large language models, 2026

R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, and Vinay Kumar Sankarapu. Aligntune: Modular toolkit for post-training alignment of large language models, 2026

2026

[29] [29]

Unsloth, 2023

Michael Han Daniel Han and Unsloth team. Unsloth, 2023. 23

2023