{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:FEB6QFFTRDQKX5RSDAAB5MODZL","short_pith_number":"pith:FEB6QFFT","schema_version":"1.0","canonical_sha256":"2903e814b388e0abf63218001eb1c3cadd13fc958cdc3344c85f333878871b2d","source":{"kind":"arxiv","id":"2504.21798","version":2},"attestation_state":"computed","paper":{"title":"SWE-smith: Scaling Data for Software Engineering Agents","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"SWE-smith automatically synthesizes 50k task instances from 128 Python repositories to train an open-source agent that resolves 40.2 percent of SWE-bench Verified issues.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Alexander Wettig, Binyuan Hui, Carlos E. Jimenez, Diyi Yang, John Yang, Kabir Khandpur, Kilian Lieret, Ludwig Schmidt, Ofir Press, Yanzhe Zhang","submitted_at":"2025-04-30T16:56:06Z","abstract_excerpt":"Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at sc"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.21798","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.SE","submitted_at":"2025-04-30T16:56:06Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"74e623afc4099bc2c0a46366219227b6101d99254dc432c65c1d7065bdfed02f","abstract_canon_sha256":"60c9c44a3ac8bee3645aa18c38296122a0252a61c6f42dcfb7d6f74ddfe56a75"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:52.792007Z","signature_b64":"AGH5B9mc/4CcQdVtj+phEsoLoVmWrEoTwpbnqieHDPLutqmECJdbHsW4OYAqhk7y/9qKkx2TbQC5lXoQroxpBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"2903e814b388e0abf63218001eb1c3cadd13fc958cdc3344c85f333878871b2d","last_reissued_at":"2026-05-17T23:38:52.791429Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:52.791429Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SWE-smith: Scaling Data for Software Engineering Agents","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"SWE-smith automatically synthesizes 50k task instances from 128 Python repositories to train an open-source agent that resolves 40.2 percent of SWE-bench Verified issues.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Alexander Wettig, Binyuan Hui, Carlos E. Jimenez, Diyi Yang, John Yang, Kabir Khandpur, Kilian Lieret, Ludwig Schmidt, Ofir Press, Yanzhe Zhang","submitted_at":"2025-04-30T16:56:06Z","abstract_excerpt":"Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at sc"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The automatically synthesized task instances that break tests are of sufficient quality, diversity, and realism to train models that generalize to real software engineering tasks, without requiring extensive human validation or filtering.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SWE-smith scales software engineering training data to 50k instances across 128 repositories, enabling SWE-agent-LM-32B to achieve 40.2% Pass@1 on SWE-bench Verified, state of the art among open-source models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SWE-smith automatically synthesizes 50k task instances from 128 Python repositories to train an open-source agent that resolves 40.2 percent of SWE-bench Verified issues.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3a5b743b32f898c00aafd23b74343fcb9d4df8142e7c2a063dd79c048bd6ae83"},"source":{"id":"2504.21798","kind":"arxiv","version":2},"verdict":{"id":"f16725fd-9f69-4fc6-8868-8d93d3507e6c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T10:16:25.188487Z","strongest_claim":"Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models.","one_line_summary":"SWE-smith scales software engineering training data to 50k instances across 128 repositories, enabling SWE-agent-LM-32B to achieve 40.2% Pass@1 on SWE-bench Verified, state of the art among open-source models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The automatically synthesized task instances that break tests are of sufficient quality, diversity, and realism to train models that generalize to real software engineering tasks, without requiring extensive human validation or filtering.","pith_extraction_headline":"SWE-smith automatically synthesizes 50k task instances from 128 Python repositories to train an open-source agent that resolves 40.2 percent of SWE-bench Verified issues."},"references":{"count":32,"sample":[{"doi":"","year":2024,"title":"Huatong Song, Lisheng Huang, Shuang Sun, Jinhao Jiang, Ran Le, Daixuan Cheng, Guoxin Chen, Yiwen Hu, Zongchao Chen, Wayne Xin Zhao, and 1 oth- ers","work_id":"17189f19-7774-4b97-ab44-9966bf5d6d48","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments","work_id":"793d9419-734d-45fe-9f51-d4c5a3a57cf8","ref_index":2,"cited_arxiv_id":"2404.07972","is_internal_anchor":true},{"doi":"","year":null,"title":"Occasionally, the README.md file may also contain installation instructions","work_id":"58d4d9c4-9e78-488f-99b7-d81cb063f85f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"pip install -e","work_id":"2068f316-7eba-42fd-8a18-c3d6c69098c6","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"You can usually find tests in a tests/ or test/ directory","work_id":"b236117d-9753-48f0-bd1e-5f499c7c32ca","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"2bd0ee4ccc404cecfed9e0e5f9f9c7ca5106986dd6e0770773bea07b851f38a6","internal_anchors":1},"formal_canon":{"evidence_count":2,"snapshot_sha256":"76ae2518a190d1b2d9ffb5d4245c21bbc3618400b384f7070c2390b818b797a9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.21798","created_at":"2026-05-17T23:38:52.791523+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.21798v2","created_at":"2026-05-17T23:38:52.791523+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.21798","created_at":"2026-05-17T23:38:52.791523+00:00"},{"alias_kind":"pith_short_12","alias_value":"FEB6QFFTRDQK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"FEB6QFFTRDQKX5RS","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"FEB6QFFT","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":29,"internal_anchor_count":29,"sample":[{"citing_arxiv_id":"2605.22175","citing_title":"SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?","ref_index":98,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21996","citing_title":"From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22526","citing_title":"\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2512.18552","citing_title":"Toward Training Superintelligent Software Agents through Self-Play SWE-RL","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02262","citing_title":"OmniCode: A Benchmark for Evaluating Software Engineering Agents","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06534","citing_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20833","citing_title":"MemGym: a Long-Horizon Memory Environment for LLM Agents","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18747","citing_title":"Code as Agent Harness","ref_index":138,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15222","citing_title":"PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16335","citing_title":"Beyond Verifiable Rewards: Rubric-Based GRM for Reinforced Fine-Tuning SWE Agents","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14445","citing_title":"FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12913","citing_title":"Revisiting DAgger in the Era of LLM-Agents","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13139","citing_title":"SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03044","citing_title":"JoyAI-LLM Flash: Advancing Mid-Scale LLMs with Token Efficiency","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2601.02780","citing_title":"MiMo-V2-Flash Technical Report","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09879","citing_title":"M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03546","citing_title":"ProgramBench: Can Language Models Rebuild Programs From Scratch?","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04018","citing_title":"Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24212","citing_title":"Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06534","citing_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18543","citing_title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12108","citing_title":"LLM-Based Automated Diagnosis Of Integration Test Failures At Google","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11270","citing_title":"Evaluating LLM Agents on Automated Software Analysis Tasks","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2602.15763","citing_title":"GLM-5: from Vibe Coding to Agentic Engineering","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07769","citing_title":"Coding Agents Don't Know When to Act","ref_index":11,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL","json":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL.json","graph_json":"https://pith.science/api/pith-number/FEB6QFFTRDQKX5RSDAAB5MODZL/graph.json","events_json":"https://pith.science/api/pith-number/FEB6QFFTRDQKX5RSDAAB5MODZL/events.json","paper":"https://pith.science/paper/FEB6QFFT"},"agent_actions":{"view_html":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL","download_json":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL.json","view_paper":"https://pith.science/paper/FEB6QFFT","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.21798&json=true","fetch_graph":"https://pith.science/api/pith-number/FEB6QFFTRDQKX5RSDAAB5MODZL/graph.json","fetch_events":"https://pith.science/api/pith-number/FEB6QFFTRDQKX5RSDAAB5MODZL/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL/action/timestamp_anchor","attest_storage":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL/action/storage_attestation","attest_author":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL/action/author_attestation","sign_citation":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL/action/citation_signature","submit_replication":"https://pith.science/pith/FEB6QFFTRDQKX5RSDAAB5MODZL/action/replication_record"}},"created_at":"2026-05-17T23:38:52.791523+00:00","updated_at":"2026-05-17T23:38:52.791523+00:00"}