{"paper":{"title":"On the Fragility of Data Attribution When Learning Is Distributed","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A single participant can inflate its measured attribution value in distributed training while preserving global utility.","cross_cats":["cs.AI","cs.DC"],"primary_cat":"cs.LG","authors_text":"Bo Hui, Min-Te Sun, Wei-shinn Ku, Xian Gao","submitted_at":"2026-05-15T01:34:55Z","abstract_excerpt":"Data attribution has become an important component of pricing, auditing, and governance in machine learning pipelines, yet most attribution methods implicitly assume that attribution values faithfully reflect participants' contributions. We show that this assumption can fail: a single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility. Our attribution-first attack uses latent optimization to inject small synthetic batches that preserve utility while exploiting non-IID label coverage and evaluator sens"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"A single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Standard marginal-utility attribution evaluators remain sensitive to small synthetic batches that exploit non-IID label coverage in distributed settings.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A single adversary in distributed training inflates its attribution value via latent optimization on synthetic batches without degrading accuracy or triggering basic defenses.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A single participant can inflate its measured attribution value in distributed training while preserving global utility.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"499a70c9e777fb28bfa11e7b85c01147abd83b24c27129ebc6a93bc315d52145"},"source":{"id":"2605.15520","kind":"arxiv","version":1},"verdict":{"id":"8b1ccd11-b21d-4c1f-954c-9a939604f5e0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T15:11:22.551854Z","strongest_claim":"A single participant in a standard distributed training workflow can substantially inflate its measured attribution value while preserving global utility.","one_line_summary":"A single adversary in distributed training inflates its attribution value via latent optimization on synthetic batches without degrading accuracy or triggering basic defenses.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Standard marginal-utility attribution evaluators remain sensitive to small synthetic batches that exploit non-IID label coverage in distributed settings.","pith_extraction_headline":"A single participant can inflate its measured attribution value in distributed training while preserving global utility."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.15520/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T15:31:17.698670Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T15:20:35.041792Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T14:22:03.243928Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T14:21:54.047113Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"shingle_duplication","ran_at":"2026-05-19T13:49:41.843124Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T13:49:41.380369Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T13:33:22.628839Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"788a264bbdaff6a6ed9598a993708f418fdf45c77aed9e760ce7854c1ed28188"},"references":{"count":37,"sample":[{"doi":"","year":null,"title":"I., Cevher, V ., and Muehlebach, M","work_id":"379e9b71-cc84-4005-be37-c8bb22497030","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Shapley estimated explanation (shep): A fast post-hoc attribution method for interpreting intelligent fault diagnosis.arXiv preprint arXiv:2504.03773,","work_id":"865ea3ea-38a0-462d-88d0-395ae824e25f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Scaling laws for the value of individual data points in machine learning","work_id":"4c4b4bd0-83d8-4e72-b013-cf03dbf9a9f9","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Fair and efficient contribution val- uation for vertical federated learning","work_id":"aa69b578-df04-41a9-abd5-ec7c20dfe1b9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"How to probe: Simple yet effective techniques for improving post-hoc explanations.arXiv preprint arXiv:2503.00641,","work_id":"b8345052-df6c-4b81-999e-7e52235c76f4","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":37,"snapshot_sha256":"158aec986a5c2e0d8dff39dc6a9fd1c9c894890796c18af0b9b4097b9e835519","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"bf9357b8c04a8809f8798a7d851948a446c962793bc1b621cab1d03e0ce99bc7"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}