Hey, That's My Data! Token-Only Dataset Inference in Large Language Models

Chen Xiong , Zihao Wang , Rui Zhu , Tsung-Yi Ho , Pin-Yu Chen , Jingwei Xiong , Haixu Tang

Authors on Pith no claims yet

classification 💻 cs.CL cs.AI

keywords datadatasetinferenceaccesscatshiftllmsmodelstoken-only

read the original abstract

Large Language Models (LLMs) rely on massive training datasets, often including proprietary data, which raises concerns about unauthorized usage and copyright infringement. Existing dataset inference methods typically require access to log probabilities or other internal signals, but many modern LLMs restrict such access, motivating token-only inference approaches. We propose CatShift, a token-only dataset inference framework based on catastrophic forgetting, where models overwrite prior knowledge when trained on new data. Fine-tuning an LLM on a subset of its training data induces larger output shifts than fine-tuning on unseen data. CatShift compares these shifts against those from a known non-member validation set to infer whether a dataset was included in training. Experiments on both open-source and API-based LLMs show that CatShift remains effective without logit access, enabling practical protection of proprietary datasets.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
cs.LG 2026-05 unverdicted novelty 5.0

Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.