{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:C5W5P6KVGEHNQOJOI7SI2L27Z2","short_pith_number":"pith:C5W5P6KV","schema_version":"1.0","canonical_sha256":"176dd7f955310ed8392e47e48d2f5fceb0214103eb77daf9ae84779d57b047bb","source":{"kind":"arxiv","id":"2111.09543","version":4},"attestation_state":"computed","paper":{"title":"DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DeBERTaV3 replaces masked language modeling with replaced token detection and introduces gradient-disentangled embedding sharing to raise accuracy on natural language understanding benchmarks.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Jianfeng Gao, Pengcheng He, Weizhu Chen","submitted_at":"2021-11-18T06:48:00Z","abstract_excerpt":"This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the \"tug-of-war\" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamic"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2111.09543","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2021-11-18T06:48:00Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"bbfea01373de9b0c2623a49541e84f761d6cab8f9bc1eb766acce36e5188d917","abstract_canon_sha256":"f35e93f6ffbc86590f62ab257c29e8323fbfd90db0896f8a2e22f75307f307a9"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.519522Z","signature_b64":"NvP0XxtDFkUQcBwuakGHvwkccM4Yb0zWxneIahS3EthmM3lRVJpVlUi5QcfdJ91K6dQexXfipyx0AGCSDeydCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"176dd7f955310ed8392e47e48d2f5fceb0214103eb77daf9ae84779d57b047bb","last_reissued_at":"2026-05-17T23:38:52.519089Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.519089Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing","license":"http://creativecommons.org/licenses/by/4.0/","headline":"DeBERTaV3 replaces masked language modeling with replaced token detection and introduces gradient-disentangled embedding sharing to raise accuracy on natural language understanding benchmarks.","cross_cats":["cs.LG"],"primary_cat":"cs.CL","authors_text":"Jianfeng Gao, Pengcheng He, Weizhu Chen","submitted_at":"2021-11-18T06:48:00Z","abstract_excerpt":"This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more sample-efficient pre-training task. Our analysis shows that vanilla embedding sharing in ELECTRA hurts training efficiency and model performance. This is because the training losses of the discriminator and the generator pull token embeddings in different directions, creating the \"tug-of-war\" dynamics. We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamic"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed gains come from the gradient-disentangled sharing rather than from other unstated differences in training schedule, data order, or hyper-parameters between the new runs and the cited DeBERTa/ELECTRA baselines.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"DeBERTaV3 improves DeBERTa by switching to replaced token detection pre-training and using gradient-disentangled embedding sharing, reaching 91.37% on GLUE and new SOTA on XNLI zero-shot.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"DeBERTaV3 replaces masked language modeling with replaced token detection and introduces gradient-disentangled embedding sharing to raise accuracy on natural language understanding benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0a8765f68352b0095408459e810ab2b5ffc2c46cb226296ff4c7027976b0144e"},"source":{"id":"2111.09543","kind":"arxiv","version":4},"verdict":{"id":"16f8b19b-50a4-4a5e-9eab-860559247bc0","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T12:43:34.120976Z","strongest_claim":"the DeBERTaV3 Large model achieves a 91.37% average score, which is 1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art (SOTA) among the models with a similar structure.","one_line_summary":"DeBERTaV3 improves DeBERTa by switching to replaced token detection pre-training and using gradient-disentangled embedding sharing, reaching 91.37% on GLUE and new SOTA on XNLI zero-shot.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed gains come from the gradient-disentangled sharing rather than from other unstated differences in training schedule, data order, or hyper-parameters between the new runs and the cited DeBERTa/ELECTRA baselines.","pith_extraction_headline":"DeBERTaV3 replaces masked language modeling with replaced token detection and introduces gradient-disentangled embedding sharing to raise accuracy on natural language understanding benchmarks."},"references":{"count":29,"sample":[{"doi":"","year":2005,"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","ref_index":1,"cited_arxiv_id":"2005.14165","is_internal_anchor":true},{"doi":"","year":2017,"title":"Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation","work_id":"b884e540-64cb-4ba2-8bc3-ed507116ef2c","ref_index":2,"cited_arxiv_id":"1708.00055","is_internal_anchor":true},{"doi":"","year":null,"title":"Xlm-e: Cross-lingual language model pre-training via electra","work_id":"83df32e3-23a3-45d6-9309-83e4754d6e56","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"Xnli: Evaluating cross-lingual sentence representations","work_id":"86645862-6fcc-432e-9362-8e78c10b2759","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Bert: Pre-training of deep bidirectional transformers for language understanding","work_id":"693b70ad-3022-4615-938e-7752341ec181","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":29,"snapshot_sha256":"172f439b9a71b3472d6222cb44401148cebbf1f3bd095dc20ffb75f65564a805","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a8e9a39262aa431ab0f61a93a5641b8064b9fae81cfb5ed3bada3b28c0de223e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2111.09543","created_at":"2026-05-17T23:38:52.519159+00:00"},{"alias_kind":"arxiv_version","alias_value":"2111.09543v4","created_at":"2026-05-17T23:38:52.519159+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2111.09543","created_at":"2026-05-17T23:38:52.519159+00:00"},{"alias_kind":"pith_short_12","alias_value":"C5W5P6KVGEHN","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"C5W5P6KVGEHNQOJO","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"C5W5P6KV","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":41,"internal_anchor_count":41,"sample":[{"citing_arxiv_id":"2508.12043","citing_title":"Talk Less, Fly Lighter: Autonomous Semantic Compression for UAV Swarm Communication via LLMs","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2402.19088","citing_title":"Survey in Characterizing Semantic Change","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23728","citing_title":"GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2502.14912","citing_title":"Semantic Embeddings of Chemical Elements for Enhanced Materials Inference and Discovery","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2504.10166","citing_title":"Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2504.05902","citing_title":"Defending against Backdoor Attacks via Module Switching","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13293","citing_title":"Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20713","citing_title":"SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00155","citing_title":"Wasserstein Distributionally Robust Regret Optimization for Reinforcement Learning from Human Feedback","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18871","citing_title":"Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17528","citing_title":"CasualSynth: Generating Structurally Sound Synthetic Data","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06223","citing_title":"ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15343","citing_title":"Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11118","citing_title":"A Cascaded Generative Approach for e-Commerce Recommendations","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2506.13743","citing_title":"LTRR: Learning To Rank Retrievers for LLMs","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2510.13293","citing_title":"Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2511.06091","citing_title":"Characterizing AI Manipulation Risks in Brazilian YouTube Climate Discourse","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2110.08193","citing_title":"BBQ: A Hand-Built Bias Benchmark for Question Answering","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2603.02709","citing_title":"Sensory-Aware Sequential Recommendation via Review-Distilled Representations","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14169","citing_title":"BOOKMARKS: Efficient Active Storyline Memory for Role-playing","ref_index":90,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11118","citing_title":"A Cascaded Generative Approach for e-Commerce Recommendations","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2303.10512","citing_title":"AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27861","citing_title":"TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27723","citing_title":"Optimized Deferral for Imbalanced Settings","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06231","citing_title":"YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling","ref_index":40,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2","json":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2.json","graph_json":"https://pith.science/api/pith-number/C5W5P6KVGEHNQOJOI7SI2L27Z2/graph.json","events_json":"https://pith.science/api/pith-number/C5W5P6KVGEHNQOJOI7SI2L27Z2/events.json","paper":"https://pith.science/paper/C5W5P6KV"},"agent_actions":{"view_html":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2","download_json":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2.json","view_paper":"https://pith.science/paper/C5W5P6KV","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2111.09543&json=true","fetch_graph":"https://pith.science/api/pith-number/C5W5P6KVGEHNQOJOI7SI2L27Z2/graph.json","fetch_events":"https://pith.science/api/pith-number/C5W5P6KVGEHNQOJOI7SI2L27Z2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2/action/storage_attestation","attest_author":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2/action/author_attestation","sign_citation":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2/action/citation_signature","submit_replication":"https://pith.science/pith/C5W5P6KVGEHNQOJOI7SI2L27Z2/action/replication_record"}},"created_at":"2026-05-17T23:38:52.519159+00:00","updated_at":"2026-05-17T23:38:52.519159+00:00"}