{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:ND2NF75GUDY6JST6IAF4VIN2KV","short_pith_number":"pith:ND2NF75G","schema_version":"1.0","canonical_sha256":"68f4d2ffa6a0f1e4ca7e400bcaa1ba556b7033f8df5509364d01ad4917a1d67a","source":{"kind":"arxiv","id":"2402.00838","version":4},"attestation_state":"computed","paper":{"title":"OLMo: Accelerating the Science of Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"OLMo is a competitive open language model released with its full training data, training code, and evaluation code to enable scientific study.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Aakanksha Naik, Abhilasha Ravichander, Akshita Bhagia, Ananya Harsh Jha, Arman Cohan, Crystal Nam, David Atkinson, Dirk Groeneveld, Dustin Schwenk, Emma Strubell, Hamish Ivison, Hannaneh Hajishirzi, Ian Magnusson, Iz Beltagy, Jack Hessel, Jacob Morrison, Jennifer Dumas, Jesse Dodge, Khyathi Raghavi Chandu, Kyle Lo, Kyle Richardson, Luca Soldaini, Luke Zettlemoyer, Matthew E. Peters, Mitchell Wortsman, Nathan Lambert, Niklas Muennighoff, Nishant Subramani, Noah A. Smith, Oyvind Tafjord, Pete Walsh, Pradeep Dasigi, Rodney Kinney, Russell Authur, Saurabh Shah, Shane Arora, Tushar Khot, Valentina Pyatkin, William Merrill, Will Smith, Yanai Elazar, Yizhong Wang, Yuling Gu","submitted_at":"2024-02-01T18:28:55Z","abstract_excerpt":"Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Op"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2402.00838","kind":"arxiv","version":4},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2024-02-01T18:28:55Z","cross_cats_sorted":[],"title_canon_sha256":"d1edfe6bb041c3fcb3b826f1a0d3bb001721e5ea66f746a6fe7c393057a4ba82","abstract_canon_sha256":"d4d53a4b872203f217ea0bfe53384663009baa5cf1fe61d21c475db395032681"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.314233Z","signature_b64":"RouIsjsPMan7iIboNwCno0iReikyPmJq8yb1imC0R4w/pJy2JLvTs9sq9UyM0VO4u/TNNTngjNAUwj0MgiIcCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"68f4d2ffa6a0f1e4ca7e400bcaa1ba556b7033f8df5509364d01ad4917a1d67a","last_reissued_at":"2026-05-17T23:38:46.313699Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.313699Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"OLMo: Accelerating the Science of Language Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"OLMo is a competitive open language model released with its full training data, training code, and evaluation code to enable scientific study.","cross_cats":[],"primary_cat":"cs.CL","authors_text":"Aakanksha Naik, Abhilasha Ravichander, Akshita Bhagia, Ananya Harsh Jha, Arman Cohan, Crystal Nam, David Atkinson, Dirk Groeneveld, Dustin Schwenk, Emma Strubell, Hamish Ivison, Hannaneh Hajishirzi, Ian Magnusson, Iz Beltagy, Jack Hessel, Jacob Morrison, Jennifer Dumas, Jesse Dodge, Khyathi Raghavi Chandu, Kyle Lo, Kyle Richardson, Luca Soldaini, Luke Zettlemoyer, Matthew E. Peters, Mitchell Wortsman, Nathan Lambert, Niklas Muennighoff, Nishant Subramani, Noah A. Smith, Oyvind Tafjord, Pete Walsh, Pradeep Dasigi, Rodney Kinney, Russell Authur, Saurabh Shah, Shane Arora, Tushar Khot, Valentina Pyatkin, William Merrill, Will Smith, Yanai Elazar, Yizhong Wang, Yuling Gu","submitted_at":"2024-02-01T18:28:55Z","abstract_excerpt":"Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Op"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the released OLMo is sufficiently competitive with closed models and that the research community will actively use the openness for rigorous scientific study rather than just inference.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"OLMo is a competitive open language model released with its full training data, training code, and evaluation code to enable scientific study.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3c765d3213fb7bc2d59c01b9f788a5fdb86c64a40ed4ca11d1ec5d3bf0b8bffb"},"source":{"id":"2402.00838","kind":"arxiv","version":4},"verdict":{"id":"847cde5a-ec48-47f0-9f9e-28dc9a13c3ab","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T22:57:12.208514Z","strongest_claim":"we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code.","one_line_summary":"OLMo delivers a fully open competitive language model with training data, code, and evaluations to enable community-driven scientific research on LMs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the released OLMo is sufficiently competitive with closed models and that the research community will actively use the openness for rigorous scientific study rather than just inference.","pith_extraction_headline":"OLMo is a competitive open language model released with its full training data, training code, and evaluation code to enable scientific study."},"references":{"count":12,"sample":[{"doi":"","year":2022,"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","ref_index":1,"cited_arxiv_id":"1607.06450","is_internal_anchor":true},{"doi":"","year":2016,"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","ref_index":2,"cited_arxiv_id":"2005.14165","is_internal_anchor":true},{"doi":"","year":1996,"title":"Sidney Greenbaum and Gerald Nelson","work_id":"c0308e11-2f83-46fe-ac86-c7882b2d7c60","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"arXiv preprint arXiv:2312.10253","work_id":"f562b393-1e7c-4a73-971b-6ba9398a3229","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","ref_index":5,"cited_arxiv_id":"2401.04088","is_internal_anchor":true}],"resolved_work":12,"snapshot_sha256":"41e0801a8a2f30c8543eecfc0b89bb2492fa1a414aaf48e8396f034e7dd6b921","internal_anchors":5},"formal_canon":{"evidence_count":1,"snapshot_sha256":"a3336b2bb76efcd20f7848600b07ab1b176e1f6647605ebf27c1bbf83a8fb10f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2402.00838","created_at":"2026-05-17T23:38:46.313794+00:00"},{"alias_kind":"arxiv_version","alias_value":"2402.00838v4","created_at":"2026-05-17T23:38:46.313794+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2402.00838","created_at":"2026-05-17T23:38:46.313794+00:00"},{"alias_kind":"pith_short_12","alias_value":"ND2NF75GUDY6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"ND2NF75GUDY6JST6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"ND2NF75G","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":31,"internal_anchor_count":31,"sample":[{"citing_arxiv_id":"2503.08223","citing_title":"Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices","ref_index":150,"is_internal_anchor":true},{"citing_arxiv_id":"2503.22760","citing_title":"Malicious and Unintentional Disclosure Risks in Large Language Models for Code Generation","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13989","citing_title":"VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2404.07143","citing_title":"Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13989","citing_title":"VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12824","citing_title":"Mechanism Plausibility in Generative Agent-Based Modeling","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2412.13663","citing_title":"Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference","ref_index":140,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15865","citing_title":"From Text to DSL: Evaluating Grammar-Based Model Generation Using Open LLMs","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19568","citing_title":"m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2506.20941","citing_title":"Revisiting the Past: Data Unlearning with Model State History","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06056","citing_title":"Data Compressibility Quantifies LLM Memorization","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2509.10546","citing_title":"Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain","ref_index":38,"is_internal_anchor":true},{"citing_arxiv_id":"2406.11794","citing_title":"DataComp-LM: In search of the next generation of training sets for language models","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2601.14053","citing_title":"LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2602.10995","citing_title":"A Human-Centric Framework for Data Attribution in Large Language Models","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2403.17297","citing_title":"InternLM2 Technical Report","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2405.07987","citing_title":"The Platonic Representation Hypothesis","ref_index":242,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13989","citing_title":"VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12824","citing_title":"Mechanism Plausibility in Generative Agent-Based Modeling","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2502.02737","citing_title":"SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model","ref_index":174,"is_internal_anchor":true},{"citing_arxiv_id":"2402.19173","citing_title":"StarCoder 2 and The Stack v2: The Next Generation","ref_index":206,"is_internal_anchor":true},{"citing_arxiv_id":"2502.05171","citing_title":"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10504","citing_title":"Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00195","citing_title":"Diversity in Large Language Models under Supervised Fine-Tuning","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05365","citing_title":"ZAYA1-8B Technical Report","ref_index":15,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV","json":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV.json","graph_json":"https://pith.science/api/pith-number/ND2NF75GUDY6JST6IAF4VIN2KV/graph.json","events_json":"https://pith.science/api/pith-number/ND2NF75GUDY6JST6IAF4VIN2KV/events.json","paper":"https://pith.science/paper/ND2NF75G"},"agent_actions":{"view_html":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV","download_json":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV.json","view_paper":"https://pith.science/paper/ND2NF75G","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2402.00838&json=true","fetch_graph":"https://pith.science/api/pith-number/ND2NF75GUDY6JST6IAF4VIN2KV/graph.json","fetch_events":"https://pith.science/api/pith-number/ND2NF75GUDY6JST6IAF4VIN2KV/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV/action/timestamp_anchor","attest_storage":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV/action/storage_attestation","attest_author":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV/action/author_attestation","sign_citation":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV/action/citation_signature","submit_replication":"https://pith.science/pith/ND2NF75GUDY6JST6IAF4VIN2KV/action/replication_record"}},"created_at":"2026-05-17T23:38:46.313794+00:00","updated_at":"2026-05-17T23:38:46.313794+00:00"}