In a 2.51B-token LLM clinical extraction corpus, only 10.9% is trainable-unique while 79.4% is redundant from copying and duplication; de-duplication improves downstream disease recognition at fixed token budget.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
How much of an LLM-generated clinical corpus is actually new? A production-scale measurement of content redundancy for provenance classification
In a 2.51B-token LLM clinical extraction corpus, only 10.9% is trainable-unique while 79.4% is redundant from copying and duplication; de-duplication improves downstream disease recognition at fixed token budget.