{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:LPM2GGKOMAOVP3AIND2NNJ45OD","short_pith_number":"pith:LPM2GGKO","schema_version":"1.0","canonical_sha256":"5bd9a3194e601d57ec0868f4d6a79d70dc64d6e781232b82ed790948529fe591","source":{"kind":"arxiv","id":"2504.02605","version":1},"attestation_state":"computed","paper":{"title":"Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Aoyan Li, Daoguang Zan, Hanwu Chen, Jing Su, Kai Shen, Liangqiang Chen, Liang Xiang, Linhao Zhang, Lu Chen, Qi Liu, Rui Long, Shulin Xin, Siyao Liu, Tianyu Liu, Wei Liu, Xiaojian Zhong, Yongsheng Xiao, Yuyu Zhang, Zhirong Huang","submitted_at":"2025-04-03T14:06:17Z","abstract_excerpt":"The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the be"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2504.02605","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.SE","submitted_at":"2025-04-03T14:06:17Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"786fcc79ffc89a2a8b47161e0f7428763a97880976273b83de4674713cf59455","abstract_canon_sha256":"808b6ee12521cfa65d940dbff573db470cef0b046828badf3561d86e29929d47"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:48.786621Z","signature_b64":"/NRAVBxxuWJfjgadU+gf6nIi1G5l2srmFJ/YZFibfojIQ3DtTh1HxcvpRHNgss/n+tmpRUdrEtwLt0qJcjH0Cw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5bd9a3194e601d57ec0868f4d6a79d70dc64d6e781232b82ed790948529fe591","last_reissued_at":"2026-05-17T23:38:48.785976Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:48.785976Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Aoyan Li, Daoguang Zan, Hanwu Chen, Jing Su, Kai Shen, Liangqiang Chen, Liang Xiang, Linhao Zhang, Lu Chen, Qi Liu, Rui Long, Shulin Xin, Siyao Liu, Tianyu Liu, Wei Liu, Xiaojian Zhong, Yongsheng Xiao, Yuyu Zhang, Zhirong Huang","submitted_at":"2025-04-03T14:06:17Z","abstract_excerpt":"The task of issue resolving is to modify a codebase to generate a patch that addresses a given issue. However, existing benchmarks, such as SWE-bench, focus almost exclusively on Python, making them insufficient for evaluating Large Language Models (LLMs) across diverse software ecosystems. To address this, we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the be"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The 68 expert annotators' curation from 2,456 candidates to 1,632 instances produces an unbiased, high-quality, and representative set that accurately reflects real-world issue-resolving difficulty across languages.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"697bee0fa03bded8910dcaed581e4bf94113ae96302e6e955140d5431d109140"},"source":{"id":"2504.02605","kind":"arxiv","version":1},"verdict":{"id":"703ddaeb-a5e4-4ef7-8a3d-01251ec238e1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T06:45:01.500998Z","strongest_claim":"we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++. It includes a total of 1,632 high-quality instances, which were carefully annotated from 2,456 candidates by 68 expert annotators, ensuring that the benchmark can provide an accurate and reliable evaluation.","one_line_summary":"Multi-SWE-bench provides 1,632 high-quality issue-resolving instances across Java, TypeScript, JavaScript, Go, Rust, C, and C++ for evaluating LLMs on codebase modifications.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The 68 expert annotators' curation from 2,456 candidates to 1,632 instances produces an unbiased, high-quality, and representative set that accurately reflects real-world issue-resolving difficulty across languages.","pith_extraction_headline":"Multi-SWE-bench supplies 1632 expert-curated issue-resolving tasks across seven languages to test LLMs beyond Python-only benchmarks."},"references":{"count":23,"sample":[{"doi":"","year":2007,"title":"R. Abreu, P . Zoeteweij, and A. J. Van Gemund. On the accuracy of spectrum-based fault localization. In Testing: Academic and industrial conference practice and research techniques- MUTATION (TAICP AR","work_id":"7898c2e3-7589-4501-9f75-48e2a670f389","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2013,"title":"M. Allamanis and C. Sutton. Mining source code repositories at massive scale using language modeling. In 2013 10th working conference on mining software repositories (MSR), pages 207–216. IEEE,","work_id":"58a4b45d-f7e2-4aaf-99c4-5a9ed742cec7","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Multi-lingual evaluation of code generation models","work_id":"1283984e-956f-4584-8b1a-ae121c31626c","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","ref_index":4,"cited_arxiv_id":"2108.07732","is_internal_anchor":true},{"doi":"","year":null,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":5,"cited_arxiv_id":"2107.03374","is_internal_anchor":true}],"resolved_work":23,"snapshot_sha256":"a193678b530303929f0674249e7897ebf393c36c0349496b37a4bb5af76243a7","internal_anchors":7},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.02605","created_at":"2026-05-17T23:38:48.786076+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.02605v1","created_at":"2026-05-17T23:38:48.786076+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.02605","created_at":"2026-05-17T23:38:48.786076+00:00"},{"alias_kind":"pith_short_12","alias_value":"LPM2GGKOMAOV","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"LPM2GGKOMAOVP3AI","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"LPM2GGKO","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2605.22175","citing_title":"SWE-Mutation: Can LLMs Generate Reliable Test Suites in Software Engineering?","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21996","citing_title":"From Patches to Trajectories: Privileged Process Supervision for Software-Engineering Agents","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22526","citing_title":"\"Refactoring Runaway\": Understanding and Mitigating Tangled Refactorings in Coding Agents for Issue Resolution","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17444","citing_title":"MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair","ref_index":52,"is_internal_anchor":true},{"citing_arxiv_id":"2509.09505","citing_title":"Combating the Memory Walls: Optimization Pathways for Long-Context Agentic LLM Inference","ref_index":75,"is_internal_anchor":true},{"citing_arxiv_id":"2512.18470","citing_title":"SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18571","citing_title":"Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16322","citing_title":"Steerable Instruction Following Coding Data Synthesis with Actor-Parametric Schema Co-Evolution","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2603.20633","citing_title":"Seed1.8 Model Card: Towards Generalized Real-World Agency","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2504.19678","citing_title":"From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review","ref_index":129,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09423","citing_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","ref_index":102,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12913","citing_title":"Revisiting DAgger in the Era of LLM-Agents","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12925","citing_title":"AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13139","citing_title":"SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2509.02544","citing_title":"UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08366","citing_title":"SWE Atlas: Benchmarking Coding Agents Beyond Issue Resolution","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09423","citing_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","ref_index":102,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19742","citing_title":"PlayCoder: Making LLM-Generated GUI Code Playable","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2604.12162","citing_title":"AlphaEval: Evaluating Agents in Production","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07276","citing_title":"Signal Reshaping for GRPO in Weak-Feedback Agentic Code Repair","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2507.20534","citing_title":"Kimi K2: Open Agentic Intelligence","ref_index":91,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14709","citing_title":"HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16625","citing_title":"AdaExplore: Failure-Driven Adaptation and Diversity-Preserving Search for Efficient Kernel Generation","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18292","citing_title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","ref_index":126,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20398","citing_title":"WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning","ref_index":52,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD","json":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD.json","graph_json":"https://pith.science/api/pith-number/LPM2GGKOMAOVP3AIND2NNJ45OD/graph.json","events_json":"https://pith.science/api/pith-number/LPM2GGKOMAOVP3AIND2NNJ45OD/events.json","paper":"https://pith.science/paper/LPM2GGKO"},"agent_actions":{"view_html":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD","download_json":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD.json","view_paper":"https://pith.science/paper/LPM2GGKO","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.02605&json=true","fetch_graph":"https://pith.science/api/pith-number/LPM2GGKOMAOVP3AIND2NNJ45OD/graph.json","fetch_events":"https://pith.science/api/pith-number/LPM2GGKOMAOVP3AIND2NNJ45OD/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD/action/timestamp_anchor","attest_storage":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD/action/storage_attestation","attest_author":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD/action/author_attestation","sign_citation":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD/action/citation_signature","submit_replication":"https://pith.science/pith/LPM2GGKOMAOVP3AIND2NNJ45OD/action/replication_record"}},"created_at":"2026-05-17T23:38:48.786076+00:00","updated_at":"2026-05-17T23:38:48.786076+00:00"}