Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

Karun Sharma; Pratinav Seth; Soham Bhattacharjee; Vinay Kumar Sankarapu

arxiv: 2606.11127 · v1 · pith:BWZNSL7Inew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation

Soham Bhattacharjee , Karun Sharma , Vinay Kumar Sankarapu , Pratinav Seth This is my paper

Pith reviewed 2026-06-27 12:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords synthetic data curationprovenance groundingfaithfulness gatingadaptive recoveryhallucination detectionreward modelspost-training dataLLM fine-tuning

0 comments

The pith

Exact source provenance improves faithfulness gating for stronger judges while hallucination and reward gates reject disjoint populations and adaptive recovery outperforms naive resampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether grounding filter decisions in the original source text that prompted each generation improves the accuracy of faithfulness checks on synthetic post-training data. It also examines whether combining hallucination detection with reward model scoring is necessary because the two reject different bad samples, and whether rejected samples can be recovered through diagnosis and targeted regeneration instead of being discarded. Experiments across gate setups, recovery methods, and generator scales use adversarially injected corpora to supply known failure labels. Results show provenance grounding helps stronger judges, both gate types are required, adaptive recovery raises yield and recall, and downstream fine-tuning quality depends mainly on generator scale with filtration as a secondary factor.

Core claim

Using adversarially injected corpora to supply ground-truth failure labels, the study shows that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

What carries the argument

Provenance-grounded gating for faithfulness assessment together with an adaptive recovery pipeline that diagnoses failures and performs targeted regeneration on rejected samples.

If this is right

Hallucination and reward gates must both be used because they reject largely disjoint populations of samples.
An adaptive recovery pipeline that diagnoses failures and regenerates targets yields higher recovery rate and injection recall than naive resampling.
Downstream fine-tuning quality is driven primarily by generator scale rather than by details of filtration or recovery.
Filtration and recovery conditions contribute meaningfully but secondarily to final model performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If real-world synthetic failures without adversarial injection follow different patterns, the measured gains from provenance grounding and adaptive recovery may shrink in practice.
Because generator scale dominates outcomes, scaling the data generator could deliver larger gains than further refinements to gating or recovery.
The disjoint rejection finding suggests combining multiple independent gate types may be a general strategy worth testing in other data curation pipelines.

Load-bearing premise

Adversarially injected corpora provide reliable ground-truth failure labels that accurately represent the distribution of real-world synthetic generation failures without such controlled injections.

What would settle it

A direct comparison finding that failure patterns and rejection distributions in naturally generated synthetic data differ substantially from those produced by adversarial injection would show the ground-truth labels do not generalize.

read the original abstract

Synthetic post-training pipelines commonly filter generated samples with reward models or holistic LLM judges, yet two practices remain rarely examined together: whether the filtering signal is grounded in the source evidence that induced each generation, and whether rejected samples can be systematically recovered rather than permanently discarded. We present a controlled study of both questions across gate configurations, recovery strategies, and generator scales, using adversarially injected corpora to provide ground-truth failure labels. We find that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations making both necessary, and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration achieves higher yield, recovery rate, and injection recall than naive resampling. Downstream fine-tuning quality is driven primarily by generator scale, with filtration and recovery conditions contributing meaningfully but secondarily.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows provenance grounding helps gating and that hallucination/reward gates catch disjoint errors, with adaptive recovery beating resampling, but the adversarial injection labels are the part that needs checking.

read the letter

The core findings are that exact source provenance strengthens faithfulness gating on stronger judges, that hallucination and reward gates reject largely separate sets of samples, and that an adaptive recovery step with failure diagnosis and targeted regeneration lifts yield, recovery rate, and recall over plain resampling. Downstream fine-tuning quality tracks generator scale first, with the filtering and recovery choices mattering but less so.

What is new is the joint test of provenance grounding and systematic recovery in one controlled setup, plus the specific claim that the two gate types are complementary rather than redundant. The injected-corpus design gives external labels for measuring recall, which lets them quantify the recovery gains directly.

The work is straightforward on the empirical side: it varies gate configurations, recovery strategies, and generator scales, and reports the downstream effect. That is useful for anyone running synthetic post-training pipelines.

The main soft spot is the ground-truth source. The claims rest on adversarially injected failures serving as representative labels. The abstract gives no sign that the injection procedure was checked against the error distribution that appears in ordinary generation, or that different injection strengths or types were ablated. If the injected failures are easier to spot or cluster differently, the disjoint-rejection result and the reported recovery advantage become tied to this particular label source. That is the point a referee should press.

This is for readers who curate synthetic data for LLM post-training and want concrete comparisons on gating and recovery. It is not a broad theoretical advance, but the questions are practical and the setup is controlled enough to be worth referee time. I would send it to review.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a controlled study of provenance-grounded gating and adaptive recovery in synthetic post-training data curation across gate configurations, recovery strategies, and generator scales. Using adversarially injected corpora to supply ground-truth failure labels, it claims that exact source provenance improves faithfulness gating for stronger judges, that hallucination and reward gates reject largely disjoint sample populations (making both necessary), and that an adaptive recovery pipeline combining failure diagnosis with targeted regeneration outperforms naive resampling on yield, recovery rate, and injection recall. Downstream fine-tuning quality is reported as driven primarily by generator scale, with filtration and recovery conditions contributing secondarily.

Significance. If the results hold under representative failure distributions, the work supplies quantitative evidence that grounding filters in source provenance and recovering rejected samples can improve curation efficiency without sacrificing quality. The controlled injection setup enables direct measurement of gate complementarity and recovery gains, which could guide practical improvements in synthetic data pipelines for LLM post-training.

major comments (2)

[Abstract] Abstract: the claims that hallucination and reward gates reject largely disjoint populations and that the adaptive recovery pipeline outperforms resampling rest on the injected corpora supplying representative ground-truth labels. No validation is described (e.g., comparison of error-category distributions or ablation on injection strength/type) showing that the injected failure modes match those arising under normal generation; this distributional match is load-bearing for the disjointness and recovery-rate conclusions.
[Abstract] Abstract (results paragraph): the statement that downstream fine-tuning quality is driven primarily by generator scale while filtration/recovery contribute secondarily lacks any reported effect sizes, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the secondary contributions are practically meaningful or merely statistically detectable.

minor comments (1)

Abstract: terms such as 'exact source provenance,' 'faithfulness gating,' and 'adaptive recovery pipeline' are introduced without concise operational definitions, which would help readers interpret the gate configurations and recovery strategies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these focused comments on the abstract. Both points identify areas where the current presentation could be strengthened with additional justification or quantitative support. We address each below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that hallucination and reward gates reject largely disjoint populations and that the adaptive recovery pipeline outperforms resampling rest on the injected corpora supplying representative ground-truth labels. No validation is described (e.g., comparison of error-category distributions or ablation on injection strength/type) showing that the injected failure modes match those arising under normal generation; this distributional match is load-bearing for the disjointness and recovery-rate conclusions.

Authors: The adversarial injection was introduced specifically to create verifiable ground-truth failure labels that permit direct measurement of gate complementarity and recovery efficacy, which cannot be obtained from unlabeled natural generations. We did not include distributional validation or ablation on injection parameters because the experimental design prioritizes internal validity of the controlled mechanism study over claims of ecological equivalence. We agree, however, that this assumption is load-bearing for generalizing the disjoint-population and recovery-rate results. In revision we will add an explicit limitations subsection that (a) states the scope of the injection-based evaluation and (b) notes that future work should compare injected versus organic error distributions. revision: yes
Referee: [Abstract] Abstract (results paragraph): the statement that downstream fine-tuning quality is driven primarily by generator scale while filtration/recovery contribute secondarily lacks any reported effect sizes, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the secondary contributions are practically meaningful or merely statistically detectable.

Authors: The primary-driver claim rests on the relative magnitude of performance deltas observed when sweeping generator scale versus filtration/recovery conditions. We concur that effect sizes, confidence intervals, and formal statistical tests are needed to substantiate the “primarily” versus “secondarily” distinction and to allow readers to judge practical significance. In the revised manuscript we will report standardized effect sizes and appropriate statistical comparisons for the key downstream fine-tuning contrasts. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical measurements only

full rationale

The paper reports findings from a controlled empirical study that measures gating performance, population disjointness, and recovery metrics on adversarially injected corpora treated as external ground-truth labels. No equations, parameter fits, or derivations are presented that reduce to the inputs by construction. Claims rest on direct experimental observations rather than self-referential definitions, fitted quantities renamed as predictions, or load-bearing self-citations. The derivation chain is therefore self-contained as a set of measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that injected failures serve as valid ground truth and that the controlled study conditions generalize. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Adversarially injected corpora provide ground-truth failure labels
Used to evaluate gating accuracy and recovery effectiveness across configurations.

pith-pipeline@v0.9.1-grok · 5678 in / 1240 out tokens · 19614 ms · 2026-06-27T12:53:28.862792+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

[1]

UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023

Ganqu Cui et al. UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023

Pith/arXiv arXiv 2023
[2]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP, 2023

2023
[3]

CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists

Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Lang...

work page doi:10.18653/v1/2025.emnlp-main.796 2025
[4]

distilabel: An ai feedback (aif) framework for building datasets with and for llms, 2024

Argilla. distilabel: An ai feedback (aif) framework for building datasets with and for llms, 2024. URL https: //github.com/argilla-io/distilabel

2024
[5]

AgentInstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024

Arindam Mitra et al. AgentInstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024

arXiv 2024
[6]

RAFT: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum? id=m7p5O7zblY

2023
[7]

Beyond human data: Scaling self-training for problem-solving with language models.Transactions on Machine Learning Research,

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pen...
[8]

URLhttps://openreview.net/forum?id=lNAyUngGFK

ISSN 2835-8856. URLhttps://openreview.net/forum?id=lNAyUngGFK. Expert Certification
[9]

CUAD: An expert-annotated NLP dataset for legal contract review.NeurIPS Datasets and Benchmarks Track, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD: An expert-annotated NLP dataset for legal contract review.NeurIPS Datasets and Benchmarks Track, 2021

2021
[10]

Cohen, and Xinghua Lu

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InEMNLP, 2019

2019
[11]

Qwen3 technical report.https://huggingface.co/Qwen, 2025

Qwen Team. Qwen3 technical report.https://huggingface.co/Qwen, 2025. 6 Provenance-Grounded Gating and Adaptive Recovery

2025
[12]

FaithDial: A faithful benchmark for information-seeking dialogue

Nouha Dziri et al. FaithDial: A faithful benchmark for information-seeking dialogue. InTransactions of the Association for Computational Linguistics, 2022

2022
[13]

Unsloth: 2× faster, 50% less memory LLM fine-tuning.https://github.com/unslothai/unsloth, 2024

Unsloth AI. Unsloth: 2× faster, 50% less memory LLM fine-tuning.https://github.com/unslothai/unsloth, 2024

2024
[14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022
[15]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS Datasets and Benchmarks Track, 2023

2023
[16]

LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281, 2023

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281, 2023

arXiv 2023
[17]

On faithfulness and factuality in abstractive summarization

Joshua Maynez et al. On faithfulness and factuality in abstractive summarization. InACL, 2020

2020
[18]

Increasing faithfulness in knowledge-grounded dialogue with controllable features

Hannah Rashkin et al. Increasing faithfulness in knowledge-grounded dialogue with controllable features. InACL, 2021

2021
[19]

TRUE: Re-evaluating factual consistency evaluation.arXiv preprint arXiv:2204.04991, 2022

Or Honovich et al. TRUE: Re-evaluating factual consistency evaluation.arXiv preprint arXiv:2204.04991, 2022

arXiv 2022
[20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023
[21]

RLAIF: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

Harrison Lee et al. RLAIF: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

Pith/arXiv arXiv 2023
[22]

Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024

Weizhe Yuan et al. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024

Pith/arXiv arXiv 2024
[23]

DataTrove: Large scale data processing

Guilherme Penedo et al. DataTrove: Large scale data processing. InNeurIPS Datasets and Benchmarks Track, 2024

2024
[24]

Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data

NVIDIA The NeMo Data Designer Team. Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data. https://github.com/NVIDIA-NeMo/DataDesigner, 2025. GitHub Repository

2025
[25]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machi...

work page doi:10.1145/3600006.3613165 2023
[26]

{weakness}

propose G-Eval, a holistic chain-of-thought scoring approach. Structured claim verification [15] breaks a response into atomic claims and verifies each against source evidence; we use this formulation for our hallucination gate. Faithfulness in generated data.Fluency and faithfulness are weakly correlated, motivating dedicated checks [ 16]. NLI-based post...

arXiv

[1] [1]

UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023

Ganqu Cui et al. UltraFeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023

Pith/arXiv arXiv 2023

[2] [2]

G-Eval: NLG evaluation using GPT-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-Eval: NLG evaluation using GPT-4 with better human alignment. InEMNLP, 2023

2023

[3] [3]

CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists

Yukyung Lee, JoongHoon Kim, Jaehee Kim, Hyowon Cho, Jaewook Kang, Pilsung Kang, and Najoung Kim. CheckEval: A reliable LLM-as-a-judge framework for evaluating text generation using checklists. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Lang...

work page doi:10.18653/v1/2025.emnlp-main.796 2025

[4] [4]

distilabel: An ai feedback (aif) framework for building datasets with and for llms, 2024

Argilla. distilabel: An ai feedback (aif) framework for building datasets with and for llms, 2024. URL https: //github.com/argilla-io/distilabel

2024

[5] [5]

AgentInstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024

Arindam Mitra et al. AgentInstruct: Toward generative teaching with agentic flows.arXiv preprint arXiv:2407.03502, 2024

arXiv 2024

[6] [6]

RAFT: Reward ranked finetuning for generative foundation model alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, KaShun SHUM, and Tong Zhang. RAFT: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum? id=m7p5O7zblY

2023

[7] [7]

Beyond human data: Scaling self-training for problem-solving with language models.Transactions on Machine Learning Research,

Avi Singh, John D Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron T Parisi, Abhishek Kumar, Alexander A Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Fathy Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pen...

[8] [8]

URLhttps://openreview.net/forum?id=lNAyUngGFK

ISSN 2835-8856. URLhttps://openreview.net/forum?id=lNAyUngGFK. Expert Certification

[9] [9]

CUAD: An expert-annotated NLP dataset for legal contract review.NeurIPS Datasets and Benchmarks Track, 2021

Dan Hendrycks, Collin Burns, Anya Chen, and Spencer Ball. CUAD: An expert-annotated NLP dataset for legal contract review.NeurIPS Datasets and Benchmarks Track, 2021

2021

[10] [10]

Cohen, and Xinghua Lu

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. PubMedQA: A dataset for biomedical research question answering. InEMNLP, 2019

2019

[11] [11]

Qwen3 technical report.https://huggingface.co/Qwen, 2025

Qwen Team. Qwen3 technical report.https://huggingface.co/Qwen, 2025. 6 Provenance-Grounded Gating and Adaptive Recovery

2025

[12] [12]

FaithDial: A faithful benchmark for information-seeking dialogue

Nouha Dziri et al. FaithDial: A faithful benchmark for information-seeking dialogue. InTransactions of the Association for Computational Linguistics, 2022

2022

[13] [13]

Unsloth: 2× faster, 50% less memory LLM fine-tuning.https://github.com/unslothai/unsloth, 2024

Unsloth AI. Unsloth: 2× faster, 50% less memory LLM fine-tuning.https://github.com/unslothai/unsloth, 2024

2024

[14] [14]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InICLR, 2022

2022

[15] [15]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InNeurIPS Datasets and Benchmarks Track, 2023

2023

[16] [16]

LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281, 2023

Roi Cohen, May Hamri, Mor Geva, and Amir Globerson. LM vs LM: Detecting factual errors via cross examination.arXiv preprint arXiv:2305.13281, 2023

arXiv 2023

[17] [17]

On faithfulness and factuality in abstractive summarization

Joshua Maynez et al. On faithfulness and factuality in abstractive summarization. InACL, 2020

2020

[18] [18]

Increasing faithfulness in knowledge-grounded dialogue with controllable features

Hannah Rashkin et al. Increasing faithfulness in knowledge-grounded dialogue with controllable features. InACL, 2021

2021

[19] [19]

TRUE: Re-evaluating factual consistency evaluation.arXiv preprint arXiv:2204.04991, 2022

Or Honovich et al. TRUE: Re-evaluating factual consistency evaluation.arXiv preprint arXiv:2204.04991, 2022

arXiv 2022

[20] [20]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InICLR, 2023

2023

[21] [21]

RLAIF: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

Harrison Lee et al. RLAIF: Scaling reinforcement learning from human feedback with ai feedback.arXiv preprint arXiv:2309.00267, 2023

Pith/arXiv arXiv 2023

[22] [22]

Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024

Weizhe Yuan et al. Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024

Pith/arXiv arXiv 2024

[23] [23]

DataTrove: Large scale data processing

Guilherme Penedo et al. DataTrove: Large scale data processing. InNeurIPS Datasets and Benchmarks Track, 2024

2024

[24] [24]

Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data

NVIDIA The NeMo Data Designer Team. Nemo data designer: A framework for generating synthetic data from scratch or based on your own seed data. https://github.com/NVIDIA-NeMo/DataDesigner, 2025. GitHub Repository

2025

[25] [25]

Proceedings of the 29th Symposium on Operating Systems Principles , pages =

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 611–626, New York, NY , USA, 2023. Association for Computing Machi...

work page doi:10.1145/3600006.3613165 2023

[26] [26]

{weakness}

propose G-Eval, a holistic chain-of-thought scoring approach. Structured claim verification [15] breaks a response into atomic claims and verifies each against source evidence; we use this formulation for our hallucination gate. Faithfulness in generated data.Fluency and faithfulness are weakly correlated, motivating dedicated checks [ 16]. NLI-based post...

arXiv