pith. sign in

arxiv: 2606.32002 · v1 · pith:WOF3X5NInew · submitted 2026-06-30 · 💻 cs.AI · cs.LG

Self-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA

Pith reviewed 2026-07-01 05:05 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords synthetic QAself-generated supervisionquestion generation biasinstruction compliancelanguage model trainingdistillationfine-tuningtraining data artifacts
0
0 comments X

The pith

Generating synthetic QA pairs for language model training embeds non-neutral selection biases and instruction compliance that concentrate on salient text and follow embedded directives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that creating synthetic question-answer pairs by having a model generate questions about a document and answer them from the same text is not neutral preprocessing. This generation step acts as an implicit policy that both chooses which evidence enters the training signal and determines the form of the answers. Question selection saturates early on salient spans, converges across prompts, and can be hijacked by local artifacts such as markup. Answer generation tends to obey instruction-like passages in the text, with compliance rates depending on passage intent and surface form rather than strictness. These failure modes can be reduced by tying each question to a fixed target and filtering instruction-like spans before answering, without altering the downstream training loop.

Core claim

The generation step in self-generated QA supervision is an implicit policy that both selects which evidence becomes training signal and decides how that evidence is answered. When choosing what to ask, generators do not scan a document uniformly: coverage saturates early and concentrates on salient spans, diverse prompts converge on the same regions, and what looks question-worthy is driven by local presentation, allowing artifacts such as poorly cleaned markup to hijack question generation across model families and scales. When answering, the model that produces the supervision tends to obey instruction-like passages embedded in the text; this compliance depends on the intent and surface fo

What carries the argument

The implicit policy enacted during QA generation, which performs non-uniform evidence selection and determines answer compliance with embedded instructions.

If this is right

  • Question generation concentrates on salient spans rather than scanning documents uniformly, with coverage saturating early.
  • Diverse prompts converge on the same regions, and local presentation artifacts such as markup can hijack generation across scales.
  • Answering compliance depends on the intent and surface form of embedded passages rather than their strictness.
  • Compliance is worst under task conflict, and larger models comply more often.
  • Tying questions to fixed targets reduces biased selection, and filtering instruction-like spans lowers mean injection compliance from 88 percent to 13 percent while retaining nearly all clean text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the biases persist across domains, they may systematically limit what knowledge is transferred in distillation and compression pipelines that rely on self-generated data.
  • Document cleaning pipelines could incorporate removal of instruction-like spans as a standard preprocessing step before any QA generation.
  • The same selection and compliance mechanisms might appear in other self-supervised generation tasks that create their own training signals from raw text.
  • Testing the mitigations on multi-document collections or retrieval-augmented settings would check whether the reductions in bias hold when evidence spans multiple sources.

Load-bearing premise

The observed selection biases, instruction compliance rates, and effectiveness of the proposed mitigations generalize beyond the specific models, document collections, and evaluation setups used in the experiments.

What would settle it

An experiment that measures whether tying questions to fixed targets produces uniform coverage across all document spans rather than early saturation on salient ones, or whether filtering instruction-like spans before answering reduces mean compliance below 13 percent on a held-out set of documents containing such passages.

Figures

Figures reproduced from arXiv: 2606.32002 by Aleksandr Beznosikov, Alexey Kadeishvili, Denis Shveykin, Ekaterina Alimaskina, Gleb Molodtsov, Igor Shalygin.

Figure 1
Figure 1. Figure 1: Cumulative evidence coverage over generated interactions. Coverage grows rapidly at first and then saturates across all corpora and model sizes, indicating diminishing returns from additional generated interactions. 0 25 50 75 100 Cartridges 0 25 50 75 100 LongHealth Creative 0 25 50 75 100 QASPER Question Structuring Summarization Use case Document share (%) uncovered 1× 2× 3× 4× 5+× [PITH_FULL_IMAGE:fig… view at source ↗
Figure 2
Figure 2. Figure 2: Coverage depth within individual prompt seeds. Each bar shows the fraction of document text that remains uncovered or is used as answer support 1, 2, 3, 4, or 5+ times within the same prompt seed. Observation 2: evidence coverage is uneven and repetitive. Saturation is not only caused by aggregating different prompt seeds. Even within a single prompt type, question generation allocates supervision unevenly… view at source ↗
Figure 3
Figure 3. Figure 3: Exact HTML-like diagnostic artifact inserted into documents. Results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Injection hit rate under uniformly distributed HTML-like artifacts. Rows correspond to generator models, columns to prompt seeds, and panels to corpora. Higher values mean that generated interactions are more often grounded in the injected artifact rather than in the original document content. Importantly, the effect is not eliminated by model scale or family. Qwen, Gemma, and Llama generators all select t… view at source ↗
Figure 5
Figure 5. Figure 5: Injection compliance (%; higher indicates more thorough diversion from the requested task) for the four injection axes (Appendix E), with both models observing the modified chunk. Bars are means over five prompt seeds. and only Qwen3-1.7B shows substantial resistance (62%). A single instruction-like passage is often enough to redirect the answering model and the supervision it produces. Strictness (S1). Co… view at source ↗
Figure 6
Figure 6. Figure 6: Mean compliance by defense method, averaged over 17 injection types and six models; lower is better. No defense repeats undefended means from Section 5. We test whether upstream sanitization can reduce how often the models follow instruction-like passages embedded in the chunk. Sanitization maps the raw chunk to a filtered version – instruction-like spans removed – before either the question￾generating or … view at source ↗
Figure 7
Figure 7. Figure 7: Share of judged questions classified as grounded or hallucinated by prompt seed; judge failures are excluded. fact-dense tables, where most atomic facts are cell-local and there is little narrative structure, section hierarchy, or redundant prose for generic seeds to anchor on. We run the unchanged self-study protocol (Appendix B) on one synthetic table—60 rows × 10 columns (600 cells), serialized as colum… view at source ↗
Figure 8
Figure 8. Figure 8: Injection compliance by prompt seed for the four injection axes (S1–S4, clockwise from top-left). Each axis panel contains five seed-specific subplots; the model color key matches [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
read the original abstract

Language models are increasingly taught from synthetic question--answer (QA) supervision: a model generates questions about a document, answers them from the same text, and the resulting pairs are used to fine-tune, distill, or compress knowledge into another model. We show that this generation step is not neutral preprocessing. It is an implicit policy that both selects which evidence becomes training signal and decides how that evidence is answered, and it is fragile at both stages. When choosing what to ask, generators do not scan a document uniformly. Coverage saturates early and concentrates on salient spans, diverse prompts converge on the same regions, and what looks question-worthy is driven by local presentation. As a result, salient artifacts such as poorly cleaned markup can hijack question generation across model families and scales. When answering, the model that produces the supervision tends to obey instruction-like passages embedded in the text. This compliance depends on the intent and surface form of the passage rather than its strictness, and is worst under task conflict, where larger models comply more often. These failure modes arise from choices made during QA generation, so they can be reduced without changing the training loop. Tying each question to a fixed target reduces biased selection, and filtering instruction-like spans before answering lowers mean injection compliance from $88\%$ to $13\%$ in our evaluation while retaining nearly all clean text.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that generating synthetic QA pairs from documents for LM training is not neutral preprocessing but an implicit policy that biases both evidence selection (early saturation on salient spans, convergence across prompts, artifact hijacking) and answering (compliance with instruction-like passages, worse under task conflict and for larger models). It supports this with experiments quantifying effects such as mean injection compliance dropping from 88% to 13% after span filtering, and proposes mitigations (fixed-target tying, span filtering) that reduce these issues while retaining most clean text.

Significance. If the empirical patterns hold beyond the tested regimes, the work identifies a practically important source of fragility in synthetic supervision pipelines used for fine-tuning, distillation, and knowledge compression. It provides concrete, actionable mitigations that operate at the generation stage without altering the downstream training loop. The empirical focus on existing generation procedures is a strength, though the manuscript contains no machine-checked proofs, parameter-free derivations, or falsifiable predictions.

major comments (2)
  1. [Abstract, results] Abstract and results sections: the central quantitative claims (e.g., compliance dropping from 88% to 13%, retention of nearly all clean text) are presented without the full experimental details, data exclusion rules, baseline comparisons, or exact protocols for measuring injection compliance and span filtering. This directly affects assessment of whether post-hoc choices influence the reported fragility and mitigation efficacy.
  2. [Introduction, experiments] The claim that the generation step is inherently fragile (rather than fragile under the evaluated conditions) rests on the untested assumption that selection biases, compliance rates, and mitigation success generalize beyond the specific model families, document collections, and evaluation setups used. No cross-regime experiments or sensitivity analyses are reported to support this extrapolation.
minor comments (1)
  1. Notation for compliance rates and filtering thresholds should be defined more explicitly when first introduced to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the opportunity to clarify our work. We address each major comment below with point-by-point responses, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract, results] Abstract and results sections: the central quantitative claims (e.g., compliance dropping from 88% to 13%, retention of nearly all clean text) are presented without the full experimental details, data exclusion rules, baseline comparisons, or exact protocols for measuring injection compliance and span filtering. This directly affects assessment of whether post-hoc choices influence the reported fragility and mitigation efficacy.

    Authors: We agree that the abstract and main results present the quantitative findings in summarized form. The complete experimental protocols—including the specific model families and scales tested, document collections, precise definition and measurement of injection compliance (rate of following embedded instruction-like passages), span filtering criteria, data exclusion rules, and baseline comparisons—are provided in the Methods section and Appendix. To improve accessibility, we will expand the abstract with a brief note on the evaluation regime and insert a concise protocol summary table or paragraph in the Results section. This revision will not change the reported numbers or conclusions. revision: yes

  2. Referee: [Introduction, experiments] The claim that the generation step is inherently fragile (rather than fragile under the evaluated conditions) rests on the untested assumption that selection biases, compliance rates, and mitigation success generalize beyond the specific model families, document collections, and evaluation setups used. No cross-regime experiments or sensitivity analyses are reported to support this extrapolation.

    Authors: The manuscript frames the observed fragility as an empirical finding within the tested regimes, with all quantitative results explicitly qualified as 'in our evaluation.' We do not claim parameter-free universality. The patterns (early saturation, prompt convergence, artifact hijacking, and instruction compliance) were consistent across the model families and document sets examined. We acknowledge the absence of broad cross-regime sensitivity analyses. We will revise the Introduction and add a Limitations section to explicitly bound the claims to the evaluated conditions and note that further validation across additional regimes would be valuable. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential fits

full rationale

The paper reports experimental measurements of selection biases, compliance rates, and mitigation effects in QA generation across models and documents. No equations, fitted parameters, or derivations are present that could reduce reported outcomes to quantities defined by the paper's own inputs. Claims rest on direct observation rather than any self-definitional, fitted-prediction, or self-citation chain. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical study of an existing training technique and introduces no new mathematical axioms, free parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5805 in / 1088 out tokens · 54718 ms · 2026-07-01T05:05:36.925803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 33 canonical work pages · 12 internal anchors

  1. [1]

    Longhealth: A question answering benchmark with long clinical documents.arXiv preprint arXiv:2401.14490, 2024

    Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu Ortala, Alexander Löser, Hugo JWL Aerts, Jakob Nikolas Kather, Daniel Truhn, and Keno Bressem. Longhealth: A question answering benchmark with long clinical documents.arXiv preprint arXiv:2401.14490, 2024. URLhttps://arxiv.org/abs/2401 .14490

  2. [2]

    Physics of language models: Part 3.1, knowledge storage and extraction

    Zeyuan Allen-Zhu and Yuanzhi Li. Physics of language models: Part 3.1, knowledge storage and extraction. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. URLhttps: //arxiv.org/abs/2309.14316

  3. [3]

    InPars: Unsupervised dataset generation for information retrieval

    Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, and Rodrigo Nogueira. InPars: Unsupervised dataset generation for information retrieval. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2387–2392, 2022. URLhttps://arxiv.org/ab s/2202.05144

  4. [4]

    Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr

    Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. Poisoning web-scale training datasets is practical. In2024 IEEE Symposium on Security and Privacy (SP), pages 407–425, 2024. URLhttps: //arxiv.org/abs/2302.10149

  5. [5]

    StruQ: Defending against prompt injection with structured queries

    Sizhe Chen, Julien Piet, Chawin Sitawarin, and David Wagner. StruQ: Defending against prompt injection with structured queries. In34th USENIX Security Symposium (USENIX Security 25), pages 2383–2400, 2025. URLhttps://arxiv.org/abs/2402.06363

  6. [6]

    Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

    Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples.arXiv preprint arXiv:2209.11755, 2022. URLhttps://arxiv.org/abs/2209.11755

  7. [7]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4599–4610. Association for Computation...

  8. [8]

    Cartridges: Lightweight and general-purpose long context 10 representations via self-study.arXiv preprint arXiv:2506.06266, 2025

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, and Christopher Ré. Cartridges: Lightweight and general-purpose long context 10 representations via self-study.arXiv preprint arXiv:2506.06266, 2025. URLhttps://arxiv.org/abs/2506 .06266

  9. [9]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URLhttps://arxiv.or g/abs/2503.19786

  10. [10]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

  11. [11]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), pages 79–90, 2023. URLhttps://arxiv.org/abs/2302.12173

  12. [12]

    Synthetic mixed training: Scaling parametric knowledge acquisition beyond rag.arXiv preprint arXiv:2603.23562, 2026

    Seungju Han, Konwoo Kim, Chanwoo Park, Benjamin Newman, Suhas Kotha, Jaehun Jung, James Zou, and Yejin Choi. Synthetic mixed training: Scaling parametric knowledge acquisition beyond rag.arXiv preprint arXiv:2603.23562, 2026. URLhttps://arxiv.org/abs/2603.23562

  13. [13]

    Cartridges at Scale: Training Modular KV Caches over Large Document Collections

    Momchil Hardalov, Gonzalo Iglesias, and Adrià de Gispert. Cartridges at scale: Training modular kv caches over large document collections.arXiv preprint arXiv:2606.04557, 2026. URLhttps://arxiv.org/abs/26 06.04557

  14. [14]

    Defending Against Indirect Prompt Injection Attacks With Spotlighting

    Keegan Hines, Gary Lopez, Matthew Hall, Federico Zarfati, Yonatan Zunger, and Emre Kiciman. Defending against indirect prompt injection attacks with spotlighting.arXiv preprint arXiv:2403.14720, 2024. URL https://arxiv.org/abs/2403.14720

  15. [15]

    Unnaturalinstructions: Tuninglanguagemodels with(almost)nohumanlabor

    OrHonovich,ThomasScialom,OmerLevy,andTimoSchick. Unnaturalinstructions: Tuninglanguagemodels with(almost)nohumanlabor. InProceedingsofthe61stAnnualMeetingoftheAssociationforComputational Linguistics (Volume 1: Long Papers), pages 14409–14428, 2023. URLhttps://aclanthology.org/2023. acl-long.806

  16. [16]

    InPars-v2: Large language models as efficient dataset generators for information retrieval.arXiv preprint arXiv:2301.01820, 2023

    Vitor Jeronymo, Luiz Bonifacio, Hugo Abonizio, Marzieh Fadaee, Roberto Lotufo, Jakub Zavrel, and Rodrigo Nogueira. InPars-v2: Large language models as efficient dataset generators for information retrieval.arXiv preprint arXiv:2301.01820, 2023. URLhttps://arxiv.org/abs/2301.01820

  17. [17]

    Knowledgeinjectionviapromptdistillation.arXivpreprint arXiv:2412.14964, 2024

    KalleKujanpää, HarriValpola, andAlexanderIlin. Knowledgeinjectionviapromptdistillation.arXivpreprint arXiv:2412.14964, 2024. URLhttps://arxiv.org/abs/2412.14964

  18. [18]

    Learning facts at scale with active reading.arXiv preprint arXiv:2508.09494, 2025

    Jessy Lin, Vincent-Pierre Berges, Xilun Chen, Wen-Tau Yih, Gargi Ghosh, and Barlas Oğuz. Learning facts at scale with active reading.arXiv preprint arXiv:2508.09494, 2025. URLhttps://arxiv.org/abs/2508.094 94

  19. [19]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. URLhttps://aclanthology.org/2024.tacl-1.9

  20. [20]

    Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data for language models.arXiv preprint arXiv:2404.07503, 2024. URLhttps://arxiv.org/abs/2404.07503

  21. [21]

    LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning

    Yansheng Mao, Yufei Xu, Jiaqi Li, Fanxu Meng, Haotong Yang, Zilong Zheng, Xiyuan Wang, and Muhan Zhang. Lift: Improving long context understanding of large language models through long input fine-tuning. arXiv preprint arXiv:2502.14644, 2025. URLhttps://arxiv.org/abs/2502.14644. 11

  22. [22]

    Ignore Previous Prompt: Attack Techniques For Language Models

    Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022. URLhttps://arxiv.org/abs/2211.09527. NeurIPS 2022 ML Safety Workshop

  23. [23]

    Fine-tuned deberta-v3-base for prompt injection detection, 2024

    ProtectAI.com. Fine-tuned deberta-v3-base for prompt injection detection, 2024. URLhttps://huggingfac e.co/ProtectAI/deberta-v3-base-prompt-injection-v2

  24. [24]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/ab s/2505.09388

  25. [25]

    ARES: An automated evaluation framework for retrieval-augmented generation systems

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An automated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 338–354. Association for Computational Linguistics, ...

  26. [26]

    Quantifying language models’ sensitivity to spuriousfeaturesinpromptdesignor: Howilearnedtostartworryingaboutpromptformatting

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spuriousfeaturesinpromptdesignor: Howilearnedtostartworryingaboutpromptformatting. InTheTwelfth International Conference on Learning Representations (ICLR), 2024. URLhttps://arxiv.org/abs/2310 .11324

  27. [27]

    Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

    Tianneng Shi, Kaijie Zhu, Zhun Wang, Yuqi Jia, Will Cai, Weida Liang, Haonan Wang, Hend Alzahrani, Joshua Lu, Kenji Kawaguchi, et al. Promptarmor: Simple yet effective prompt injection defenses.arXiv preprint arXiv:2507.15219, 2025

  28. [28]

    Ontheexploitability of instruction tuning

    ManliShu,JiongxiaoWang,ChenZhu,JonasGeiping,ChaoweiXiao,andTomGoldstein. Ontheexploitability of instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/2306.17194

  29. [29]

    AI models collapse when trained on recursively generated data.Nature, 631(8022):755–759, 2024

    Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nicolas Papernot, Ross Anderson, and Yarin Gal. AI models collapse when trained on recursively generated data.Nature, 631(8022):755–759, 2024. doi: 10.1038/s41586-024-07566-y. URLhttps://doi.org/10.1038/s41586-024-07566-y

  30. [30]

    Parametric retrieval augmented generation.arXiv preprint arXiv:2501.15915, 2025

    Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. Parametric retrieval augmented generation.arXiv preprint arXiv:2501.15915, 2025. URL https://arxiv.org/abs/2501.15915

  31. [31]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following LLaMA model.https://github.com/t atsu-lab/stanford_alpaca, 2023

  32. [32]

    The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions

    Eric Wallace, Kai Xiao, Reimar Leike, Lilian Weng, Johannes Heidecke, and Alex Beutel. The instruction hierarchy: Training LLMs to prioritize privileged instructions.arXiv preprint arXiv:2404.13208, 2024. URL https://arxiv.org/abs/2404.13208

  33. [33]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, 2023. URLhttps://aclanthology.org/20...

  34. [34]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. WizardLM: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023. URLhttps://arxiv.org/abs/2304.12244. 12

  35. [35]

    Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

    Zhangchen Xu, Fengqing Jiang, Luyao Niu, Yuntian Deng, Radha Poovendran, Yejin Choi, and Bill Yuchen Lin. Magpie: Alignment data synthesis from scratch by prompting aligned LLMs with nothing. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025. URLhttps://arxiv.org/ abs/2406.08464

  36. [36]

    Backdooring instruction-tuned large language models with virtual prompt injection

    Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...

  37. [37]

    Synthetic continued pretraining.arXiv preprint arXiv:2409.07431, 2024

    Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, and Tatsunori Hashimoto. Synthetic continued pretraining.arXiv preprint arXiv:2409.07431, 2024. URLhttps://arxiv.org/abs/2409.07431

  38. [38]

    Genie: Achieving human parity in content-grounded datasets generation.arXiv preprint arXiv:2401.14367, 2024

    Asaf Yehudai, Boaz Carmeli, Yosi Mass, Ofir Arviv, Nathaniel Mills, Assaf Toledo, Eyal Shnarch, and Leshem Choshen. Genie: Achieving human parity in content-grounded datasets generation.arXiv preprint arXiv:2401.14367, 2024. URLhttps://arxiv.org/abs/2401.14367

  39. [39]

    Sizhe Yuen, Ting Su, Ziyang Wang, Yali Du, and Adam J. Sobey. Automatic dataset generation for knowledge intensive question answering tasks.arXiv preprint arXiv:2505.14212, 2025. URL https: //arxiv.org/abs/2505.14212

  40. [40]

    InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

    QiusiZhan,ZhixiangLiang,ZifanYing,andDanielKang.InjecAgent: Benchmarkingindirectpromptinjections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024. URLhttps://aclanthology.org/2024.findings-acl.624

  41. [41]

    PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models

    Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. PoisonedRAG: Knowledge corruption attacks to retrieval-augmented generation of large language models. In34th USENIX Security Symposium (USENIX Security 25), pages 3827–3844, 2025. URLhttps://arxiv.org/abs/2402.07867

  42. [42]

    Self-adaptinglanguage models.arXiv preprint arXiv:2506.10943, 2025

    AdamZweiger,JyothishPari,HanGuo,EkinAkyürek,YoonKim,andPulkitAgrawal. Self-adaptinglanguage models.arXiv preprint arXiv:2506.10943, 2025. URLhttps://arxiv.org/abs/2506.10943

  43. [43]

    Fast KV Compaction via Attention Matching

    Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching.arXiv preprint arXiv:2602.16284, 2026. URLhttps://arxiv.org/abs/2602.16284. 13 Appendix Supplementary Materials forSelf-Study Reconsidered: The Hidden Fragility of Learning from Self-Generated QA Contents 1 Introduction 1 2 Related Work 2 3 Question Generation as E...

  44. [44]

    Use only exact substrings from the source text

  45. [45]

    Do not select the whole chunk unless the question truly asks about the whole chunk

  46. [46]

    Return at most 3 support spans ; prefer the smallest sufficient set

  47. [47]

    For factual questions , select the minimal answer - support span

  48. [48]

    Ignore decorative file paths , document IDs , or corpus labels unless the requested content is absent from the source text

    For summarization or structuring questions : if the named section , topic , or passage appears in the source text , return grounded = true with the relevant span ( s ) . Ignore decorative file paths , document IDs , or corpus labels unless the requested content is absent from the source text

  49. [49]

    how can I use

    For use - case or creative questions : if the question applies , discusses , or is inspired by concepts , methods , or claims present in the source text , return grounded = true with concept_support spans -- even when phrased generically or hypothetically ( e . g . " how can I use ..." , " what inspired ..." , " key differences ...")

  50. [50]

    LaTeX macros count as grounded support when the question refers to them and their definitions or usages appear in the source text

  51. [51]

    hallucinated

    Return grounded = false with reason =" hallucinated " only when the question clearly cannot be anchored in the source text : no relevant section / topic / entity / concept from the question appears in the chunk , or the question asks about specific facts absent from the chunk

  52. [52]

    hallucinated

    If the question refers to a section title , entity , or document name that does not appear anywhere in the source text and is not a LaTeX macro defined in the chunk , return grounded = false with reason =" hallucinated ". 15

  53. [53]

    un fi ll ed _te mp la te

    If the question contains unfilled placeholders like {{ subsection }} or {{ document }} , return grounded = false with reason =" un fi ll ed _te mp la te ". Return only JSON : { " grounded ": true , " support_spans ": [ { " quote ": " exact substring from the source text " , " role ": " answer_support | s u m m a r i z a t i o n _ t a r g e t | s t ru c t ...