{"paper":{"title":"Attention Sinks in Diffusion Transformers: A Causal Analysis","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Attention sinks in diffusion transformers can be removed without degrading text-image alignment.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Brian Summa, Fangzheng Wu","submitted_at":"2026-05-10T04:14:07Z","abstract_excerpt":"Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageRewar"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at k=1; only under stronger interventions (k≥10) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless sink-specific -- ∼6× larger than equal-budget random masking","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the paired interventions on score and value paths causally isolate the contribution of attention sinks without introducing uncontrolled side effects on the diffusion trajectory, and that the chosen proxy metrics (CLIP-T, HPS-v2) faithfully measure semantic alignment independent of low-level perceptual style.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Attention sinks in diffusion transformers can be removed without degrading text-image alignment.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"5d9106a606de5b247459bd5c8c2047bf90bef75a06903fae46f3bbdf20fdf1b2"},"source":{"id":"2605.09313","kind":"arxiv","version":3},"verdict":{"id":"7e0ea0bb-4f3a-4756-ab87-19b767ad6588","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-13T05:55:24.296652Z","strongest_claim":"removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at k=1; only under stronger interventions (k≥10) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless sink-specific -- ∼6× larger than equal-budget random masking","one_line_summary":"Suppressing attention sinks in diffusion transformers does not degrade CLIP-T alignment at moderate levels but induces sink-specific perceptual shifts six times larger than equal-budget random masking.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the paired interventions on score and value paths causally isolate the contribution of attention sinks without introducing uncontrolled side effects on the diffusion trajectory, and that the chosen proxy metrics (CLIP-T, HPS-v2) faithfully measure semantic alignment independent of low-level perceptual style.","pith_extraction_headline":"Attention sinks in diffusion transformers can be removed without degrading text-image alignment."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.09313/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"claim_evidence","ran_at":"2026-05-20T08:02:04.573722Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T20:33:48.497602Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T13:31:17.482701Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T10:21:09.363475Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"c586d68a8334933d2be4fba0b372f4625af8d76e433820e9653d05f2318cabf1"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}