{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2020:SRNS5FXN35VG3OA55F6GX4BVX5","short_pith_number":"pith:SRNS5FXN","schema_version":"1.0","canonical_sha256":"945b2e96eddf6a6db81de97c6bf035bf5017e48b6718f5217e1b05104d4ea43c","source":{"kind":"arxiv","id":"2003.10555","version":1},"attestation_state":"computed","paper":{"title":"ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"ELECTRA pre-trains text encoders as discriminators that detect replaced tokens, producing stronger contextual representations than BERT with the same model size, data, and compute.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Christopher D. Manning, Kevin Clark, Minh-Thang Luong, Quoc V. Le","submitted_at":"2020-03-23T21:17:42Z","abstract_excerpt":"Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a m"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2003.10555","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2020-03-23T21:17:42Z","cross_cats_sorted":[],"title_canon_sha256":"2c4d718974014f01447a4ae698420439e4c1e639b3db756ab33c4e51fee81152","abstract_canon_sha256":"ee9a436437f5b3b17c4c329508a03483153ebbdb3e8d04a29b25d6a7ad806e7c"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.197311Z","signature_b64":"TERtOfKymfPGEDczEW+HgpcffToP4VswczT9psPj30Qu9i/hz7m6IRHvkf50xAmjc0XCNiH8dfUp9Ij9IL/AAg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"945b2e96eddf6a6db81de97c6bf035bf5017e48b6718f5217e1b05104d4ea43c","last_reissued_at":"2026-05-17T23:38:48.196704Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.196704Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"ELECTRA pre-trains text encoders as discriminators that detect replaced tokens, producing stronger contextual representations than BERT with the same model size, data, and compute.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Christopher D. Manning, Kevin Clark, Minh-Thang Luong, Quoc V. Le","submitted_at":"2020-03-23T21:17:42Z","abstract_excerpt":"Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a m"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the replaced-token detection objective produces transferable contextual representations superior to those from masked language modeling when model size, data, and compute are held fixed.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ELECTRA replaces masked language modeling with replaced token detection, yielding contextual representations that outperform BERT at equal compute and match larger models like RoBERTa with far less compute.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ELECTRA pre-trains text encoders as discriminators that detect replaced tokens, producing stronger contextual representations than BERT with the same model size, data, and compute.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3cc8f1b2416205746285a883c24cc451aa32a4c3a5d66dbfcd1cb4a146cc2af7"},"source":{"id":"2003.10555","kind":"arxiv","version":1},"verdict":{"id":"b9d9fd5e-2cb7-49e7-9afc-f2e4d564e628","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T10:22:50.482051Z","strongest_claim":"the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute","one_line_summary":"ELECTRA replaces masked language modeling with replaced token detection, yielding contextual representations that outperform BERT at equal compute and match larger models like RoBERTa with far less compute.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the replaced-token detection objective produces transferable contextual representations superior to those from masked language modeling when model size, data, and compute are held fixed.","pith_extraction_headline":"ELECTRA pre-trains text encoders as discriminators that detect replaced tokens, producing stronger contextual representations than BERT with the same model size, data, and compute."},"references":{"count":17,"sample":[{"doi":"","year":null,"title":"arXiv preprint arXiv:1811.02549 , year=","work_id":"851f20c8-1e5a-4d79-8bac-46b38d0a46aa","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"10 Published as a conference paper at ICLR 2020 Daniel M","work_id":"969a10a2-c713-40b3-b33b-8d7788afee97","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1909,"title":"TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351","work_id":"40d9fbb8-3c66-44bf-955c-5b5560f1e2f8","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1907,"title":"SpanBERT: Improving pre-training by representing and predicting spans","work_id":"1d90c45c-05d0-44dc-909b-2b6e2a406c24","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1909,"title":"ALBERT: A Lite BERT for Self-supervised Learning of Language Representations","work_id":"aedf7950-7c35-4e28-a32d-bec290f51669","ref_index":5,"cited_arxiv_id":"1909.11942","is_internal_anchor":true}],"resolved_work":17,"snapshot_sha256":"4461250c60d659f6c3c202767f933f81fca4df6a10ed43f03fd711861748fce2","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2f7f114414d9ddf586384dbb7ad0ee3ec5210b4b4bede11639a161d6d2bcf1de"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2003.10555","created_at":"2026-05-17T23:38:48.196796+00:00"},{"alias_kind":"arxiv_version","alias_value":"2003.10555v1","created_at":"2026-05-17T23:38:48.196796+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2003.10555","created_at":"2026-05-17T23:38:48.196796+00:00"},{"alias_kind":"pith_short_12","alias_value":"SRNS5FXN35VG","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"SRNS5FXN35VG3OA5","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"SRNS5FXN","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":33,"internal_anchor_count":33,"sample":[{"citing_arxiv_id":"2210.15304","citing_title":"Explaining the Explainers in Graph Neural Networks: a Comparative Study","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2211.16327","citing_title":"On the Power of Foundation Models","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2410.23657","citing_title":"Secret Leak Detection in Software Issue Reports using LLMs: A Comprehensive Evaluation","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2411.10636","citing_title":"Mitigating Extrinsic Gender Bias for Bangla Classification Tasks","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2502.00414","citing_title":"Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2502.05075","citing_title":"Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2503.22693","citing_title":"Bridging Language Models and Financial Analysis","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21869","citing_title":"Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16359","citing_title":"LLM4Log: A Systematic Review of Large Language Model-based Log Analysis","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20674","citing_title":"Modular Multimodal Classification Without Fine-Tuning: A Simple Compositional Approach","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18552","citing_title":"Protein Fold Classification at Scale: Benchmarking and Pretraining","ref_index":44,"is_internal_anchor":true},{"citing_arxiv_id":"2507.15066","citing_title":"Time-RA: Towards Time Series Reasoning for Anomaly Diagnosis with LLM Feedback","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2509.11443","citing_title":"A Transformer-Based Cross-Platform Analysis of Public Discourse on the 15-Minute City Paradigm","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2008.02217","citing_title":"Hopfield Networks is All You Need","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2309.05922","citing_title":"A Survey of Hallucination in Large Foundation Models","ref_index":107,"is_internal_anchor":true},{"citing_arxiv_id":"2309.16671","citing_title":"Demystifying CLIP Data","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16359","citing_title":"LLM4Log: A Systematic Review of Large Language Model-based Log Analysis","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14352","citing_title":"Ideology Prediction of German Political Texts","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2603.29813","citing_title":"Compiling Code LLMs into Lightweight Executables","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2303.15389","citing_title":"EVA-CLIP: Improved Training Techniques for CLIP at Scale","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"1910.10683","citing_title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24940","citing_title":"ADE: Adaptive Dictionary Embeddings -- Scaling Multi-Anchor Representations to Large Language Models","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23342","citing_title":"Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04901","citing_title":"On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":46,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5","json":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5.json","graph_json":"https://pith.science/api/pith-number/SRNS5FXN35VG3OA55F6GX4BVX5/graph.json","events_json":"https://pith.science/api/pith-number/SRNS5FXN35VG3OA55F6GX4BVX5/events.json","paper":"https://pith.science/paper/SRNS5FXN"},"agent_actions":{"view_html":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5","download_json":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5.json","view_paper":"https://pith.science/paper/SRNS5FXN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2003.10555&json=true","fetch_graph":"https://pith.science/api/pith-number/SRNS5FXN35VG3OA55F6GX4BVX5/graph.json","fetch_events":"https://pith.science/api/pith-number/SRNS5FXN35VG3OA55F6GX4BVX5/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5/action/timestamp_anchor","attest_storage":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5/action/storage_attestation","attest_author":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5/action/author_attestation","sign_citation":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5/action/citation_signature","submit_replication":"https://pith.science/pith/SRNS5FXN35VG3OA55F6GX4BVX5/action/replication_record"}},"created_at":"2026-05-17T23:38:48.196796+00:00","updated_at":"2026-05-17T23:38:48.196796+00:00"}