{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:NCTT23H2SLTJGYAMII3FZVBGCV","short_pith_number":"pith:NCTT23H2","schema_version":"1.0","canonical_sha256":"68a73d6cfa92e693600c42365cd426156bc5a66b75aacc328a6a6900cd2a6b84","source":{"kind":"arxiv","id":"2603.12277","version":5},"attestation_state":"computed","paper":{"title":"Prompt Injection as Role Confusion","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Language models fall for prompt injection because they judge text by its sound rather than its actual source.","cross_cats":["cs.AI","cs.CR"],"primary_cat":"cs.CL","authors_text":"Charles Ye, Dylan Hadfield-Menell, Jasmine Cui","submitted_at":"2026-02-22T18:43:34Z","abstract_excerpt":"LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive \"who is speaking,\" and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reaso"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2603.12277","kind":"arxiv","version":5},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2026-02-22T18:43:34Z","cross_cats_sorted":["cs.AI","cs.CR"],"title_canon_sha256":"b797b173eb31f7e2e55d66f702ce648ba2e7efc582da7d634a5eede42cfbb69b","abstract_canon_sha256":"1fd173ffe76f4c1fdacae62c0b259dcddab24491ce607557d4de3b39cc718967"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-06-01T01:02:36.889753Z","signature_b64":"AKXOMh2JF8MqncVJNGyMKyRRi0sAVC/0IIh8WW9cMIW5ICvxuOHU1H6NpWUaDiSSngCumGkiiTma1oCJ4PJOCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"68a73d6cfa92e693600c42365cd426156bc5a66b75aacc328a6a6900cd2a6b84","last_reissued_at":"2026-06-01T01:02:36.888342Z","signature_status":"signed_v1","first_computed_at":"2026-06-01T01:02:36.888342Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Prompt Injection as Role Confusion","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Language models fall for prompt injection because they judge text by its sound rather than its actual source.","cross_cats":["cs.AI","cs.CR"],"primary_cat":"cs.CL","authors_text":"Charles Ye, Dylan Hadfield-Menell, Jasmine Cui","submitted_at":"2026-02-22T18:43:34Z","abstract_excerpt":"LLMs see the world as a single stream of text, partitioned into roles like <user> or <tool>. We trace prompt injection to role confusion: models perceive the source of text from how it sounds, not its labeled role. A command hidden in a webpage hijacks an agent simply because it sounds like <user> text, despite its <tool> label. We design role probes to measure how LLMs internally perceive \"who is speaking,\" and find that injected text occupies the same representational space as the trusted role it imitates. We demonstrate this with CoT Forgery, a zero-shot attack that injects fabricated reaso"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from... the degree of role confusion strongly predicts attack success... introducing a unifying framework that reframes prompt injection not as an ad-hoc exploit but as a measurable consequence of how models represent role.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That role probes accurately measure internal role perception and that this perception causally drives the behavioral prompt injection success rather than merely correlating with it.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Language models confuse roles based on how text sounds rather than its true source, enabling measurable prompt injection attacks via role probes that predict success rates.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Language models fall for prompt injection because they judge text by its sound rather than its actual source.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"18e581a2991d76704e0791f72f54fde6657f23236688ea4afe2afaa6580e9072"},"source":{"id":"2603.12277","kind":"arxiv","version":5},"verdict":{"id":"c4bfed71-938c-4b47-a457-e8e91eefbd93","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T20:16:10.402848Z","strongest_claim":"We trace this failure to role confusion: models infer the source of text based on how it sounds, not where it actually comes from... the degree of role confusion strongly predicts attack success... introducing a unifying framework that reframes prompt injection not as an ad-hoc exploit but as a measurable consequence of how models represent role.","one_line_summary":"Language models confuse roles based on how text sounds rather than its true source, enabling measurable prompt injection attacks via role probes that predict success rates.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That role probes accurately measure internal role perception and that this perception causally drives the behavioral prompt injection success rather than merely correlating with it.","pith_extraction_headline":"Language models fall for prompt injection because they judge text by its sound rather than its actual source."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2603.12277/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"6364e14973ab922a7a5098016c123ce8502b8f8c3a180e42133c13123c176434"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2603.12277","created_at":"2026-06-01T01:02:36.888575+00:00"},{"alias_kind":"arxiv_version","alias_value":"2603.12277v5","created_at":"2026-06-01T01:02:36.888575+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2603.12277","created_at":"2026-06-01T01:02:36.888575+00:00"},{"alias_kind":"pith_short_12","alias_value":"NCTT23H2SLTJ","created_at":"2026-06-01T01:02:36.888575+00:00"},{"alias_kind":"pith_short_16","alias_value":"NCTT23H2SLTJGYAM","created_at":"2026-06-01T01:02:36.888575+00:00"},{"alias_kind":"pith_short_8","alias_value":"NCTT23H2","created_at":"2026-06-01T01:02:36.888575+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV","json":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV.json","graph_json":"https://pith.science/api/pith-number/NCTT23H2SLTJGYAMII3FZVBGCV/graph.json","events_json":"https://pith.science/api/pith-number/NCTT23H2SLTJGYAMII3FZVBGCV/events.json","paper":"https://pith.science/paper/NCTT23H2"},"agent_actions":{"view_html":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV","download_json":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV.json","view_paper":"https://pith.science/paper/NCTT23H2","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2603.12277&json=true","fetch_graph":"https://pith.science/api/pith-number/NCTT23H2SLTJGYAMII3FZVBGCV/graph.json","fetch_events":"https://pith.science/api/pith-number/NCTT23H2SLTJGYAMII3FZVBGCV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV/action/storage_attestation","attest_author":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV/action/author_attestation","sign_citation":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV/action/citation_signature","submit_replication":"https://pith.science/pith/NCTT23H2SLTJGYAMII3FZVBGCV/action/replication_record"}},"created_at":"2026-06-01T01:02:36.888575+00:00","updated_at":"2026-06-01T01:02:36.888575+00:00"}