Asking Back: Interaction-Layer Antidistillation Watermarks

Amir Ghasemian; Fengchen Liu; Guang Yang; Homa Hosseinmardi; Ninareh Mehrabi; Zhong Wang

arxiv: 2605.16462 · v1 · pith:ZZOJVSXSnew · submitted 2026-05-15 · 💻 cs.CR · cs.AI

Asking Back: Interaction-Layer Antidistillation Watermarks

Guang Yang , Amir Ghasemian , Fengchen Liu , Zhong Wang , Ninareh Mehrabi , Homa Hosseinmardi This is my paper

Pith reviewed 2026-05-20 18:03 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords antidistillation watermarkinteraction layersystem promptbehavioral markerLoRA distillationblack-box detectionLLM judgeparaphrasing robustness

0 comments

The pith

System prompts can embed behavioral markers in LLM responses that transfer to distilled student models and stay detectable through black-box queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether moving watermark signals into the interaction layer—via intermittent system-prompt instructions that trigger follow-up questions or declarative restatements—lets a defender catch unauthorized distillation. An oblivious fine-tuner inherits these habits, and the defender later probes the student with targeted queries scored by a validated LLM judge. Experiments across 63 LoRA students from a Llama-3.3-70B teacher show the markers survive at 45–89 % relative fidelity depending on the student family, even after non-adaptive paraphrasing, while user-study results indicate negligible change in perceived quality. If the method holds, it adds a complementary defense that does not require control over tokens or logits.

Core claim

By wrapping the teacher with a system prompt that occasionally elicits an explicit follow-up question, a low-frequency variant, or a declarative restatement, the defender induces a behavioral marker that an oblivious distiller inherits during LoRA fine-tuning; the same marker is then recovered in black-box interactions using an LLM-as-judge whose agreement with humans reaches Cohen’s kappa of 0.84 on strong-rubric and 0.78 on style-rubric labels. Across 35,343 judged samples the markers transfer at 88.9 % (Gemma), 80.9 % (OLMo) and 45.2 % (Qwen) relative fidelity; under DIPPER paraphrasing they retain 21–112 % of the teacher-self ceiling; low-density declarative variants exceed per-family un

What carries the argument

interaction-layer antidistillation watermark: a system-prompt-induced behavioral marker (follow-up question, variant phrasing, or restatement) that is inherited by the student and later audited via black-box queries

If this is right

Behavioral markers transfer at relative fidelities between 45 % and 89 % across three student families when the teacher is Llama-3.3-70B-Instruct.
Under non-adaptive paraphrasing the student-relative retention ranges from 21 % to 112 % of the teacher-self ceiling, with one family preserving the signal above the teacher itself.
Low-density (≈20 %) explicit and implicit declarative variants still exceed per-family baseline rates.
All marker variants shift average Likert ratings by no more than 0.22 points in a pre-registered N=20 user study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The interaction layer could be stacked with token- or trace-level watermarks to raise the cost of any single-layer removal attack.
Adaptive attackers who know the marker vocabulary might still be forced to degrade output quality or increase query cost to suppress the signal.
The same prompt-induced behavior could serve as a lightweight provenance signal for non-distillation misuse such as repeated API scraping.
Extending the approach to multi-turn conversations would test whether the marker persists across longer interaction histories.

Load-bearing premise

The behavioral markers triggered by the system prompt are reliably copied into the student during ordinary LoRA fine-tuning and can be identified accurately by the LLM judge without large numbers of false positives or negatives.

What would settle it

A distilled student model that, after training on the watermarked teacher outputs, produces follow-up questions or declarative restatements at the same low rate as an identical model trained on ordinary data.

Figures

Figures reproduced from arXiv: 2605.16462 by Amir Ghasemian, Fengchen Liu, Guang Yang, Homa Hosseinmardi, Ninareh Mehrabi, Zhong Wang.

**Figure 2.** Figure 2: Detection rates across the 3×7 matrix. Top row: Llama-3.3-70B-Instruct teacher reference (point estimate, n=3,009); rows below: three student families under each of seven training conditions, mean over three seeds; column groups: strong / soft / style-control rubrics. The OLMo soft cell (32.0%) exceeds the teacher’s soft cell (17.8%). Gemma-3-1B-pt OLMo-2-0425-1B Qwen3.5-0.8B 0 20 40 60 80 100 Detection ra… view at source ↗

**Figure 4.** Figure 4: H5 in-lab study, N=20: mean Likert per condition. Bars = mean, error bars = SD, dots = participants, dashed line = baseline mean. 5 Results 5.1 H1, H2: Behavioral watermarks transfer across families [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Left: dose-response of student detection vs. teacher-side density ρ; monotone for every family, with OLMo’s 20%-density student exceeding the teacher’s own 20%-density rate (amplification). Right: stealth–robustness–detection trade-off. Each point is a (family, watermark) configuration; circles strong, triangles soft; shaded region is the operating zone with adequate detection and non-trivial paraphrase … view at source ↗

**Figure 6.** Figure 6: Cohen’s κ between the LLM judge and each of three independent human annotators (A, B, C), and against the 3-rater majority, per rubric (n=100 each). Error bars on the majority bar are the bootstrap 95% CI (B=1,000). The dashed purple line is the inter-annotator Fleiss’ κ, indicating how consistently the rubric is applied across humans. J Judge–human κ validation Protocol. We validate the gpt-oss-120b judge… view at source ↗

**Figure 7.** Figure 7: Judge × 3-rater majority confusion matrices, per rubric. Off-diagonal counts are roughly symmetric on both rubrics, indicating residual disagreement is noise rather than systematic over- or under-firing. STRONG STYLE Rubric baseline baseline_up soft soft_up strong strong_up style_control Condition κ=1.00 n=18 κ=0.65 n=34 κ=0.78 n=18 κ=0.76 n=34 κ=0.88 n=16 — κ=0.88 n=16 — κ=0.62 n=16 — κ=0.88 n=16 — — κ=0.… view at source ↗

**Figure 8.** Figure 8: Per (condition, rubric) Cohen’s κ, judge vs. 3-rater majority. Cells are colored on the Landis– Koch scale. Highlights: baseline×STRONG κ=1.00 and style_control×STYLE κ=0.94 (the two most diagnostic cells: “no marker, no marker” and “marker inserted by teacher, marker detected”); strong ×STRONG κ=0.62 where the judge under-detected 3 markers humans saw, making our reported transfer rates conservative rathe… view at source ↗

**Figure 9.** Figure 9: Per (family, rubric) Cohen’s κ, judge vs. 3-rater majority. Qwen is the strongest family on STRONG (κ=0.94) and the weakest on STYLE (κ=0.64); we attribute the latter to Qwen’s natural prose style overlapping more heavily with the STYLE rubric’s advisory phrasings, and footnote this when reporting per-family STYLE rates in §5. STRONG (natural yes ≈ 22.3%) STYLE (natural yes ≈ 3.7%) 0.0 0.2 0.4 0.6 0.8 1.0 … view at source ↗

**Figure 10.** Figure 10: Stratified κ (left bar in each pair) and prevalence-reweighted κ at the natural production yes-rate (right bar in each pair). Reweighting recovers the κ that would be observed on the unstratified production distribution. both numbers and treat 0.40 as the operative “natural-distribution κ” for STYLE in our discussion of F-Style in §5.4. Summary. The headline empirical claims of this paper (H1, H2, H3, H4… view at source ↗

**Figure 11.** Figure 11: H5 study user interface. (a) Welcome / onboarding, where the researcher hands the partici [PITH_FULL_IMAGE:figures/full_fig_p028_11.png] view at source ↗

**Figure 12.** Figure 12: Within-subject paired comparisons. For each marker condition, each line connects one [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Latin-square coverage: mean Likert per (condition [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Per-participant trajectory across the five ordinal sessions. Condition labels are annotated [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗

**Figure 15.** Figure 15: GENOTRACE per-sequence audit z-statistic on 200 fresh E. coli prompts. Vanilla 100M (left) is a well-calibrated null (mean z≈0); the δ=0 student (middle) sits only 0.28σ above zero; the δ=2 student (right) is shifted by ∼7.95σ and crosses the z= 2.33 threshold on 95.5% of sequences. Calibration would in any case be family-aware; the practical implication is contingent on whether this single-family pattern… view at source ↗

**Figure 16.** Figure 16: GENOTRACE multi-sequence aggregation on 200 held-out E. coli prompts. Left: the aggregate z-statistic for the δ=2 student grows like √ n, while both nulls (vanilla 100M and the δ=0 adversarial-mimic student) remain within ±2.33. Right: TPR at FPR = 1% versus the δ=0-student null rises from 45% at n=1 to 99.95% at n=10 and saturates at 100% for n≥20; the single-sequence operating point is therefore a sampl… view at source ↗

**Figure 17.** Figure 17: GENOTRACE robustness to base-level mutation. Detection rate and mean per-sequence z as a function of the uniform random base-substitution rate applied to every δ=2 student output before re-auditing. Error bars are 95% bootstrap CIs over 2,000 resamples of the 200 audited sequences; both curves decay smoothly with mutation rate, consistent with the inherited bias being spread across many tokens rather than… view at source ↗

read the original abstract

Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker's training pipeline nor the next-token logits. Existing defenses operate on the teacher's output tokens -- biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker -- an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via black-box queries with a human-validated LLM-as-judge (Cohen's kappa = 0.84/0.78 on strong/style rubrics). Across 63 LoRA-distilled students under a Llama-3.3-70B-Instruct teacher (35,343 judged samples), behavioral watermarks transfer at 88.9% (Gemma) / 80.9% (OLMo) / 45.2% (Qwen) relative fidelity (H1, H2). Under non-adaptive DIPPER paraphrasing, robustness decomposes into a teacher-self ceiling (about 66.4%) and student-relative retention of 21-112%, with OLMo preserving the watermark above the teacher itself (H3, F-Amp). Low-density (about 20%) explicit and implicit declarative variants transfer above per-family baseline (H4, F-Style). An N=20 in-lab study (pre-registered Latin-square) shows all marker variants within 0.22 Likert step of baseline; TOST, Friedman, and Bonferroni-Wilcoxon support H5. The interaction layer is a viable design locus for antidistillation watermarking, complementary to token-, model-, and reasoning-trace-layer defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The core idea of interaction-layer behavioral watermarks is new and worth testing, but the inheritance during distillation and judge reliability are the parts that still need tighter evidence.

read the letter

Hi, the main thing to take from this paper is that they move antidistillation watermarks up from token biasing or post-generation fixes into the model's conversation behavior. A system prompt makes the teacher occasionally produce follow-up questions, low-frequency variants, or declarative restatements, and the claim is that an oblivious LoRA student inherits enough of that pattern for the defender to spot it later with black-box queries and an LLM judge. That is a genuine shift from the prior work they cite, and the abstract gives concrete transfer numbers across three student families plus a pre-registered usability check that the markers do not hurt user experience much. The breakdown of robustness under DIPPER into a teacher-self ceiling and student-relative retention is also a useful way to report the results. The scale of the judged samples and the reported Cohen's kappa on the judge add some weight to the measurements. The soft spots sit right where the stress-test note flags them. Transfer is strong for Gemma and OLMo but drops to 45 percent relative fidelity for Qwen, which already shows the method is sensitive to model family. It is not yet clear whether the inherited behaviors are specific to the watermark prompt or just generic style the student picks up anyway, and the paper would be stronger with explicit false-positive rates for the judge on unmarked models from the same families. Adaptive attackers who know the prompt pattern are not tested here either. This is aimed at people who run or secure commercial LLM APIs and need practical ways to detect extraction. Anyone working on model IP protection would get value from the new layer and the baseline numbers, even if the results are uneven. I would send it for peer review. The idea is distinct enough and the initial data is solid enough that referees should see it and push on the inheritance and detection details.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes interaction-layer antidistillation watermarks for LLMs. A system prompt intermittently induces behavioral markers (explicit follow-up questions, low-frequency variants, or declarative restatements) in the teacher's outputs. An oblivious distiller inherits these markers during LoRA fine-tuning, which the defender then detects via black-box queries using a human-validated LLM-as-judge (Cohen's kappa 0.84/0.78). Experiments with a Llama-3.3-70B-Instruct teacher and 63 LoRA students (35,343 judged samples) report relative transfer fidelities of 88.9% (Gemma), 80.9% (OLMo), and 45.2% (Qwen). Robustness under non-adaptive DIPPER paraphrasing decomposes into a teacher-self ceiling of ~66% with student-relative retention of 21-112%; low-density (~20%) variants transfer above per-family baselines. A pre-registered N=20 user study (Latin-square) with TOST, Friedman, and Bonferroni-Wilcoxon tests confirms usability within 0.22 Likert steps of baseline.

Significance. If the inheritance and detection assumptions hold, the work establishes the interaction layer as a viable complementary defense locus to token-, model-, and reasoning-trace-layer techniques. The large-scale empirical measurements, model-family transfer variation, robustness decomposition, and pre-registered statistical support from the user study provide concrete, falsifiable evidence for behavioral-marker transfer. The scale (35k samples) and pre-registration are notable strengths that would make the contribution substantial for practical antidistillation auditing if the judge false-positive rates and inheritance specificity are further secured.

major comments (3)

[§4 (H1, H2)] §4 (H1, H2): The reported relative fidelities (88.9% Gemma, 80.9% OLMo, 45.2% Qwen) support transfer but the large model-family dependence leaves open whether the markers reflect specific watermark inheritance or generic style mimicry; a control arm with non-watermarked system prompts on the same teacher-student pairs would isolate the effect and is load-bearing for the antidistillation claim.
[Robustness section (H3, F-Amp)] Robustness section (H3, F-Amp): The decomposition into teacher-self ceiling (~66.4%) and student-relative retention (21-112%) is useful, yet the 21% floor under DIPPER for some students indicates that detection may fall below practical thresholds in non-adaptive paraphrasing scenarios; clarifying the minimum retention needed for reliable auditing would strengthen the central claim.
[Judge validation] Judge validation (Cohen's kappa 0.84/0.78): While inter-rater agreement on strong/style rubrics is reported, the false-positive rate on unmarked models from the same families (baseline Gemma/OLMo/Qwen without the interaction prompt) is not quantified; this is load-bearing because non-negligible false positives would undermine black-box audit reliability.

minor comments (2)

[Abstract] Abstract: The low-density variants are described as 'about 20%'; reporting the exact density and sampling schedule used in the 35k-sample experiments would improve precision and reproducibility.
[User study (H5)] User study (H5): The N=20 pre-registered Latin-square design is a strength, but the exact prompt templates, judge rubrics, and full statistical outputs (including effect sizes) should be included in an appendix to support independent verification.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each of the major comments below and have revised the manuscript accordingly to strengthen the empirical support for our claims.

read point-by-point responses

Referee: [§4 (H1, H2)] The reported relative fidelities (88.9% Gemma, 80.9% OLMo, 45.2% Qwen) support transfer but the large model-family dependence leaves open whether the markers reflect specific watermark inheritance or generic style mimicry; a control arm with non-watermarked system prompts on the same teacher-student pairs would isolate the effect and is load-bearing for the antidistillation claim.

Authors: We agree that the model-family variation warrants careful interpretation to distinguish watermark inheritance from generic style mimicry. The manuscript already reports per-family baselines in the context of H4 and F-Style, where low-density variants transfer above these baselines, providing evidence against pure generic mimicry. To directly address the suggested control, we will include an additional experiment in the revised §4 using non-watermarked system prompts on matched teacher-student pairs. This will allow us to quantify the incremental effect of the interaction-layer prompt beyond any baseline style transfer. revision: yes
Referee: [Robustness section (H3, F-Amp)] The decomposition into teacher-self ceiling (~66.4%) and student-relative retention (21-112%) is useful, yet the 21% floor under DIPPER for some students indicates that detection may fall below practical thresholds in non-adaptive paraphrasing scenarios; clarifying the minimum retention needed for reliable auditing would strengthen the central claim.

Authors: Thank you for highlighting the practical implications of the lower retention rates. The 21% floor is indeed observed for certain student models under DIPPER paraphrasing. We will add a clarification in the revised robustness section on the minimum retention needed for reliable auditing, discussing how it depends on the number of audit queries and the variance observed in our dataset to ensure statistical power. revision: yes
Referee: [Judge validation] While inter-rater agreement on strong/style rubrics is reported, the false-positive rate on unmarked models from the same families (baseline Gemma/OLMo/Qwen without the interaction prompt) is not quantified; this is load-bearing because non-negligible false positives would undermine black-box audit reliability.

Authors: We recognize the importance of quantifying false-positive rates for the black-box audit's reliability. The reported Cohen's kappa values (0.84/0.78) reflect agreement on the rubrics used by the LLM-as-judge. We agree this is load-bearing and will extend the judge validation in the revised manuscript to include explicit false-positive rate measurements on unmarked baseline models from the same families. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical derivation chain

full rationale

The paper's claims rest entirely on empirical measurements: observed transfer rates of behavioral markers (explicit follow-up questions, low-frequency variants, declarative restatements) from a Llama-3.3-70B teacher into 63 LoRA students across model families, retention under DIPPER paraphrasing, and usability scores from an N=20 pre-registered Latin-square study. These quantities are obtained via black-box queries and human-validated LLM-as-judge annotations whose inter-rater agreement (Cohen's kappa 0.84/0.78) is reported as a standard validation step rather than a fitted input. No equations, self-definitional constructions, or load-bearing self-citations appear in the provided text; the central result is the measured model-dependent fidelity variation (88.9 % Gemma to 45.2 % Qwen) and statistical support (TOST, Friedman, Bonferroni-Wilcoxon), which are externally falsifiable observations rather than reductions to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on empirical transfer of behavioral patterns through distillation and reliable detection by an LLM judge, with no free parameters explicitly fitted to the target result, no new mathematical axioms, and no invented physical or theoretical entities.

axioms (2)

domain assumption Behavioral patterns induced by system prompts are preserved during LoRA-based knowledge distillation from teacher to student models.
Invoked implicitly when claiming that an oblivious distiller inherits the marker and when reporting transfer rates across model families.
domain assumption An LLM-as-judge can reliably detect the presence of interaction markers with agreement levels comparable to human raters (Cohen's kappa 0.84/0.78).
Stated in the abstract as the auditing mechanism and used to support all detection claims.

pith-pipeline@v0.9.0 · 5920 in / 1641 out tokens · 83538 ms · 2026-05-20T18:03:55.372960+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker—an explicit follow-up question, a low-frequency variant, or a declarative restatement.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 15 internal anchors

[1]

My AI safety lecture for UT Effective Altruism, 2022

Scott Aaronson. My AI safety lecture for UT Effective Altruism, 2022. URL https:// scottaaronson.blog/?p=6823. Public lecture (Nov. 14, 2022) and accompanying blog write-up (Nov. 28, 2022); widely cited as the originator of the cryptographic-watermark proposal for LLMs

work page 2022
[2]

DITTO: A spoofing attack framework on watermarked LLMs via knowledge distillation

Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. DITTO: A spoofing attack framework on watermarked LLMs via knowledge distillation. InProceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

work page 2026
[3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv. org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Undetectable watermarks for language models

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. InConference on Learning Theory (COLT), 2024. URL https://arxiv.org/abs/2306. 09194. 10

work page 2024
[5]

Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026. doi: 10.1038/s41586-026-10319-8. URL https://www.nature.com/articles/s41586-026-10319-8

work page doi:10.1038/s41586-026-10319-8 2026
[6]

Qwen3.5 family chat-template thinking-mode issues, 2025

Community bug reports. Qwen3.5 family chat-template thinking-mode issues, 2025. URL https://github.com/vllm-project/vllm/issues/35574. Multiple independent repro- ductions across inference backends that the enable_thinking=false chat-template flag does not, in practice, disable chain-of-thought generation in the Qwen3.5 architecture family. Representative...

work page 2025
[7]

PART: Information-preserving reformulation of reasoning traces for antidistillation.arXiv preprint arXiv:2510.11545, 2025

Jiayu Ding, Lei Cui, Li Dong, Nanning Zheng, and Furu Wei. PART: Information-preserving reformulation of reasoning traces for antidistillation.arXiv preprint arXiv:2510.11545, 2025. URLhttps://arxiv.org/abs/2510.11545

work page arXiv 2025
[8]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2306.08543

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning and Representation Learning Workshop, 2015. URL https: //arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1904.09751

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. URLhttps://arxiv.org/abs/2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

A watermark for large language models, 2024

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2301.10226

work page arXiv 2023
[16]

OpenAssistant conversations – democratizing large language model alignment.arXiv preprint arXiv:2304.07327, 2023

Andreas Koepf, Yannic Kilcher, Laura von Rueden, Dmitrii Rybin, Xiaozhe Xu, Iryna Gurevych, et al. OpenAssistant conversations – democratizing large language model alignment.arXiv preprint arXiv:2304.07327, 2023. URLhttps://arxiv.org/abs/2304.07327

work page arXiv 2023
[17]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphras- ing evades detectors of AI-generated text, but retrieval is an effective defense. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/ 2303.13408

work page arXiv 2023
[18]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

work page 1977
[19]

DOGe: Defensive output generation for LLM protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025

Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, and Tianlong Chen. DOGe: Defensive output generation for LLM protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025. URLhttps://arxiv.org/abs/2505.19504. 11

work page arXiv 2025
[21]

URLhttps://arxiv.org/abs/2602.15143

work page internal anchor Pith review Pith/arXiv arXiv
[22]

An empirical analysis of memorization in fine-tuned autoregressive language models

Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans, and Taylor Berg- Kirkpatrick. An empirical analysis of memorization in fine-tuned autoregressive language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022. URLhttps://arxiv.org/abs/2205.10770

work page arXiv 2022
[23]

2 OLMo 2 Furious

OLMo Team, Allen Institute for AI. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2024. URLhttps://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. URL https://arxiv.org/abs/2508.10925. Open-weights OSS model release accompanying the gpt-oss-120b checkpoint

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, and Philip S. Yu. Can LLM watermarks robustly prevent unauthorized knowledge distillation? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. URLhttps://aclanthology.org/2025.acl-long.648/

work page 2025
[26]

Qwen3.5 base: A family of pre-trained language models, 2025

Qwen Team. Qwen3.5 base: A family of pre-trained language models, 2025. URL https: //huggingface.co/Qwen/Qwen3.5-0.8B-Base. Hugging Face model card

work page 2025
[27]

Radioactive data: Tracing through training

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Radioactive data: Tracing through training. InInternational Conference on Machine Learning (ICML), 2020. URLhttps://arxiv.org/abs/2002.00937

work page arXiv 2020
[28]

Can AI-generated text be reliably detected?Transactions on Machine Learning Research (TMLR), 2024

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can AI-generated text be reliably detected?Transactions on Machine Learning Research (TMLR), 2024. URLhttps://openreview.net/forum?id=NvSwR4IvLO

work page 2024
[29]

Watermarking makes language models radioactive

Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, and Teddy Furon. Watermarking makes language models radioactive. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://arxiv.org/abs/2402.14904

work page arXiv 2024
[30]

Zico Kolter

Yash Savani, Asher Trockman, Zhili Feng, Yixuan Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, and J. Zico Kolter. Antidistillation sampling. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id= Vo2UHqMu8t

work page 2025
[31]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. InIEEE Symposium on Security and Privacy (S&P),

work page
[32]

URLhttps://arxiv.org/abs/1610.05820

work page internal anchor Pith review Pith/arXiv arXiv
[34]

URLhttps://arxiv.org/abs/2502.12150

work page arXiv
[35]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model, 2023. URLhttps://github.com/tatsu-lab/stanford_alpaca. Stanford CRFM release; project page and code only, no arXiv preprint

work page 2023
[36]

OpenHermes-2.5: An open instruction dataset, 2023

Teknium. OpenHermes-2.5: An open instruction dataset, 2023. URL https://huggingface. co/datasets/teknium/OpenHermes-2.5. Hugging Face dataset

work page 2023
[37]

Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. OpenMathInstruct-2: Accelerating AI for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024. URL https://arxiv.org/abs/2410.01560

work page arXiv 2024
[38]

Who taught you that? tracing teachers in model distillation

Somin Wadhwa, Chantal Shaib, Silvio Amir, and Byron C Wallace. Who taught you that? tracing teachers in model distillation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3307–3315, 2025. 12

work page 2025
[39]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empow- ering code generation with OSS-Instruct. InInternational Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2312.02120

work page arXiv 2024
[40]

Neural text generation with unlikelihood training

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1908.04319

work page arXiv 2020
[41]

Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. InInternational Conference on Learning Representations (ICLR), 2024. URL https:// openreview.net/forum?id=gjeQKFxFpZ

work page 2024
[42]

Instructional fingerprinting of large language models

Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. URLhttps://aclanthology.org/2024.naacl-long.180/

work page 2024
[43]

Antidistillation Fingerprinting

Yixuan Xu et al. Antidistillation fingerprinting.arXiv preprint arXiv:2602.03812, 2026. URL https://arxiv.org/abs/2602.03812

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Michael J. Q. Zhang, W. Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach LLMs to ask clarifying questions. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=cwuSAR7EKd

work page 2025
[46]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

the watermark is brittle

Zhihan Zhou, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber-Gordon, Mutian Yu, Harrison Ho, Fengchen Liu, Feng Chen, Rachael Morgan-Kiss, Lizhen Shi, Han Liu, and Zhong Wang. GenomeOcean: An efficient genome foundation model trained on large-scale metagenomic assemblies.bioRxiv, 2025. doi: 10.1101/2025.01.30. 635558. UR...

work page doi:10.1101/2025.01.30 2025
[48]

This confirms that the radioactivity property of [28] extends to a non-text generative modality

Inheritance through distillation.The watermarked ( δ=2) student is shifted by ∼7.95σ above the vanilla null and crosses z >2.33 on 95.5% of sequences (Figure 15). This confirms that the radioactivity property of [28] extends to a non-text generative modality

work page
[49]

Multi-sequence aggregation scales as √n.Aggregating n independent sequences lifts TPR at FPR =1% from 45% at n=1 to ≥99.9% at n≥10 (Figure 16), consistent with the closed-form predictionz (n) agg = √n¯z

work page
[50]

Truncation filter

Robustness to mild base-level mutation.Replacing a fraction of the bases in every δ=2 output with uniform substitutions from {A, C, G, T} and re-auditing, 78.5% of sequences still cross threshold at 5% mutation rate (Figure 17), indicating that the inherited bias is not concentrated in a single short motif. A more detailed treatment, including additional ...

work page

[1] [1]

My AI safety lecture for UT Effective Altruism, 2022

Scott Aaronson. My AI safety lecture for UT Effective Altruism, 2022. URL https:// scottaaronson.blog/?p=6823. Public lecture (Nov. 14, 2022) and accompanying blog write-up (Nov. 28, 2022); widely cited as the originator of the cryptographic-watermark proposal for LLMs

work page 2022

[2] [2]

DITTO: A spoofing attack framework on watermarked LLMs via knowledge distillation

Hyeseon An, Shinwoo Park, Suyeon Woo, and Yo-Sub Han. DITTO: A spoofing attack framework on watermarked LLMs via knowledge distillation. InProceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2026

work page 2026

[3] [3]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv. org/abs/2108.07732

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Undetectable watermarks for language models

Miranda Christ, Sam Gunn, and Or Zamir. Undetectable watermarks for language models. InConference on Learning Theory (COLT), 2024. URL https://arxiv.org/abs/2306. 09194. 10

work page 2024

[5] [5]

Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026

Alex Cloud, Minh Le, James Chua, Jan Betley, Anna Sztyber-Betley, Sören Mindermann, Jacob Hilton, Samuel Marks, and Owain Evans. Language models transmit behavioural traits through hidden signals in data.Nature, 652:615–621, 2026. doi: 10.1038/s41586-026-10319-8. URL https://www.nature.com/articles/s41586-026-10319-8

work page doi:10.1038/s41586-026-10319-8 2026

[6] [6]

Qwen3.5 family chat-template thinking-mode issues, 2025

Community bug reports. Qwen3.5 family chat-template thinking-mode issues, 2025. URL https://github.com/vllm-project/vllm/issues/35574. Multiple independent repro- ductions across inference backends that the enable_thinking=false chat-template flag does not, in practice, disable chain-of-thought generation in the Qwen3.5 architecture family. Representative...

work page 2025

[7] [7]

PART: Information-preserving reformulation of reasoning traces for antidistillation.arXiv preprint arXiv:2510.11545, 2025

Jiayu Ding, Lei Cui, Li Dong, Nanning Zheng, and Furu Wei. PART: Information-preserving reformulation of reasoning traces for antidistillation.arXiv preprint arXiv:2510.11545, 2025. URLhttps://arxiv.org/abs/2510.11545

work page arXiv 2025

[8] [8]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025. URL https://arxiv.org/abs/2503.19786

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URLhttps://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. MiniLLM: Knowledge distillation of large language models. InInternational Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2306.08543

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNeurIPS Deep Learning and Representation Learning Workshop, 2015. URL https: //arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

The Curious Case of Neural Text Degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1904.09751

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

Language Models (Mostly) Know What They Know

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, et al. Language models (mostly) know what they know.arXiv preprint arXiv:2207.05221, 2022. URLhttps://arxiv.org/abs/2207.05221

work page internal anchor Pith review Pith/arXiv arXiv 2022

[15] [15]

A watermark for large language models, 2024

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. A watermark for large language models. InInternational Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/2301.10226

work page arXiv 2023

[16] [16]

OpenAssistant conversations – democratizing large language model alignment.arXiv preprint arXiv:2304.07327, 2023

Andreas Koepf, Yannic Kilcher, Laura von Rueden, Dmitrii Rybin, Xiaozhe Xu, Iryna Gurevych, et al. OpenAssistant conversations – democratizing large language model alignment.arXiv preprint arXiv:2304.07327, 2023. URLhttps://arxiv.org/abs/2304.07327

work page arXiv 2023

[17] [17]

Paraphrasing evades detectors of ai-generated text, but retrieval is an effective defense, 2023

Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer. Paraphras- ing evades detectors of AI-generated text, but retrieval is an effective defense. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URL https://arxiv.org/abs/ 2303.13408

work page arXiv 2023

[18] [18]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159–174, 1977

work page 1977

[19] [19]

DOGe: Defensive output generation for LLM protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025

Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, and Tianlong Chen. DOGe: Defensive output generation for LLM protection against knowledge distillation.arXiv preprint arXiv:2505.19504, 2025. URLhttps://arxiv.org/abs/2505.19504. 11

work page arXiv 2025

[20] [21]

URLhttps://arxiv.org/abs/2602.15143

work page internal anchor Pith review Pith/arXiv arXiv

[21] [22]

An empirical analysis of memorization in fine-tuned autoregressive language models

Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans, and Taylor Berg- Kirkpatrick. An empirical analysis of memorization in fine-tuned autoregressive language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2022. URLhttps://arxiv.org/abs/2205.10770

work page arXiv 2022

[22] [23]

2 OLMo 2 Furious

OLMo Team, Allen Institute for AI. 2 OLMo 2 furious.arXiv preprint arXiv:2501.00656, 2024. URLhttps://arxiv.org/abs/2501.00656

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [24]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025. URL https://arxiv.org/abs/2508.10925. Open-weights OSS model release accompanying the gpt-oss-120b checkpoint

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [25]

Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, and Philip S. Yu. Can LLM watermarks robustly prevent unauthorized knowledge distillation? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. URLhttps://aclanthology.org/2025.acl-long.648/

work page 2025

[25] [26]

Qwen3.5 base: A family of pre-trained language models, 2025

Qwen Team. Qwen3.5 base: A family of pre-trained language models, 2025. URL https: //huggingface.co/Qwen/Qwen3.5-0.8B-Base. Hugging Face model card

work page 2025

[26] [27]

Radioactive data: Tracing through training

Alexandre Sablayrolles, Matthijs Douze, Cordelia Schmid, and Hervé Jégou. Radioactive data: Tracing through training. InInternational Conference on Machine Learning (ICML), 2020. URLhttps://arxiv.org/abs/2002.00937

work page arXiv 2020

[27] [28]

Can AI-generated text be reliably detected?Transactions on Machine Learning Research (TMLR), 2024

Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi. Can AI-generated text be reliably detected?Transactions on Machine Learning Research (TMLR), 2024. URLhttps://openreview.net/forum?id=NvSwR4IvLO

work page 2024

[28] [29]

Watermarking makes language models radioactive

Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, and Teddy Furon. Watermarking makes language models radioactive. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. URLhttps://arxiv.org/abs/2402.14904

work page arXiv 2024

[29] [30]

Zico Kolter

Yash Savani, Asher Trockman, Zhili Feng, Yixuan Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, and J. Zico Kolter. Antidistillation sampling. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id= Vo2UHqMu8t

work page 2025

[30] [31]

Membership inference attacks against machine learning models

Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. InIEEE Symposium on Security and Privacy (S&P),

work page

[31] [32]

URLhttps://arxiv.org/abs/1610.05820

work page internal anchor Pith review Pith/arXiv arXiv

[32] [34]

URLhttps://arxiv.org/abs/2502.12150

work page arXiv

[33] [35]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model, 2023. URLhttps://github.com/tatsu-lab/stanford_alpaca. Stanford CRFM release; project page and code only, no arXiv preprint

work page 2023

[34] [36]

OpenHermes-2.5: An open instruction dataset, 2023

Teknium. OpenHermes-2.5: An open instruction dataset, 2023. URL https://huggingface. co/datasets/teknium/OpenHermes-2.5. Hugging Face dataset

work page 2023

[35] [37]

Openmathinstruct-2: Accelerating ai for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, and Igor Gitman. OpenMathInstruct-2: Accelerating AI for math with massive open-source instruction data.arXiv preprint arXiv:2410.01560, 2024. URL https://arxiv.org/abs/2410.01560

work page arXiv 2024

[36] [38]

Who taught you that? tracing teachers in model distillation

Somin Wadhwa, Chantal Shaib, Silvio Amir, and Byron C Wallace. Who taught you that? tracing teachers in model distillation. InFindings of the Association for Computational Linguistics: ACL 2025, pages 3307–3315, 2025. 12

work page 2025

[37] [39]

Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. Magicoder: Empow- ering code generation with OSS-Instruct. InInternational Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2312.02120

work page arXiv 2024

[38] [40]

Neural text generation with unlikelihood training

Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. Neural text generation with unlikelihood training. InInternational Conference on Learning Representations (ICLR), 2020. URLhttps://arxiv.org/abs/1908.04319

work page arXiv 2020

[39] [41]

Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs

Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can LLMs express their uncertainty? an empirical evaluation of confidence elicitation in LLMs. InInternational Conference on Learning Representations (ICLR), 2024. URL https:// openreview.net/forum?id=gjeQKFxFpZ

work page 2024

[40] [42]

Instructional fingerprinting of large language models

Jiashu Xu, Fei Wang, Mingyu Derek Ma, Pang Wei Koh, Chaowei Xiao, and Muhao Chen. Instructional fingerprinting of large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024. URLhttps://aclanthology.org/2024.naacl-long.180/

work page 2024

[41] [43]

Antidistillation Fingerprinting

Yixuan Xu et al. Antidistillation fingerprinting.arXiv preprint arXiv:2602.03812, 2026. URL https://arxiv.org/abs/2602.03812

work page internal anchor Pith review Pith/arXiv arXiv 2026

[42] [44]

Qwen3 Technical Report

An Yang et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [45]

Michael J. Q. Zhang, W. Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach LLMs to ask clarifying questions. InInternational Conference on Learning Representations (ICLR), 2025. URLhttps://openreview.net/forum?id=cwuSAR7EKd

work page 2025

[44] [46]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [47]

the watermark is brittle

Zhihan Zhou, Robert Riley, Satria Kautsar, Weimin Wu, Rob Egan, Steven Hofmeyr, Shira Goldhaber-Gordon, Mutian Yu, Harrison Ho, Fengchen Liu, Feng Chen, Rachael Morgan-Kiss, Lizhen Shi, Han Liu, and Zhong Wang. GenomeOcean: An efficient genome foundation model trained on large-scale metagenomic assemblies.bioRxiv, 2025. doi: 10.1101/2025.01.30. 635558. UR...

work page doi:10.1101/2025.01.30 2025

[46] [48]

This confirms that the radioactivity property of [28] extends to a non-text generative modality

Inheritance through distillation.The watermarked ( δ=2) student is shifted by ∼7.95σ above the vanilla null and crosses z >2.33 on 95.5% of sequences (Figure 15). This confirms that the radioactivity property of [28] extends to a non-text generative modality

work page

[47] [49]

Multi-sequence aggregation scales as √n.Aggregating n independent sequences lifts TPR at FPR =1% from 45% at n=1 to ≥99.9% at n≥10 (Figure 16), consistent with the closed-form predictionz (n) agg = √n¯z

work page

[48] [50]

Truncation filter

Robustness to mild base-level mutation.Replacing a fraction of the bases in every δ=2 output with uniform substitutions from {A, C, G, T} and re-auditing, 78.5% of sequences still cross threshold at 5% mutation rate (Figure 17), indicating that the inherited bias is not concentrated in a single short motif. A more detailed treatment, including additional ...

work page