LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.
Feder Cooper, Sanmi Koyejo, and Percy Liang
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 2polarities
background 2representative citing papers
A non-reversible hashing technique allows legal distribution of annotations for copyrighted texts by enabling alignment between user-owned copies and shared hashed data with high accuracy.
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.
citing papers explorer
-
LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs
LLMs show high memorization capability under prefix attacks but low propensity under generic or dataset-specific prompts, with continual pre-training further reducing both.
-
Both Ends Count! Just How Good are LLM Agents at "Text-to-Big SQL"?
New Text-to-Big SQL metrics show that LLM agents must balance accuracy with cost and speed at scale, where GPT-4o trades some accuracy for up to 12x speedup and GPT-5.2 proves more cost-effective than Gemini 3 Pro on large inputs.
-
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
-
Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference
Merlin achieves byte-exact deduplication of text at up to 8.7 GB/s using SIMD-optimized hashing, reducing LLM context sizes by 13.9-71% with no data loss.