{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2026:GNDFWN72KKEFXVRCBNX2INK4V6","short_pith_number":"pith:GNDFWN72","schema_version":"1.0","canonical_sha256":"33465b37fa52885bd6220b6fa4355cafbc9d85e7634acae73e21ca62f9d45c2f","source":{"kind":"arxiv","id":"2605.17242","version":1},"attestation_state":"computed","paper":{"title":"From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements","cross_cats":[],"primary_cat":"cs.SE","authors_text":"Jiakai Xu, Jingyu Xiao, Michael R Lyu, Tingshuo Liang, Yintong Huo, Yuxuan Wan","submitted_at":"2026-05-17T03:48:41Z","abstract_excerpt":"Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation.\n  We present TDDev, a framework that automates this closed loop through three sta"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2605.17242","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","primary_cat":"cs.SE","submitted_at":"2026-05-17T03:48:41Z","cross_cats_sorted":[],"title_canon_sha256":"83c9de78f150d1c5f3f28e16a196ccf37f6a0df0b0a147a8ab48f5d0c47d32dd","abstract_canon_sha256":"ec43e51d781993157f004d2f4a81905a76800d6adf4c5cb18e48d22478f539b2"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-20T00:03:47.135289Z","signature_b64":"fgZBEPfVDaHYVtZ/UZA8+Ajf7IjMHSemob8NtoQivs19Knaol5jP7V0hZMKbnYU0dHKhmG8eAG5BjYkXvd/aCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"33465b37fa52885bd6220b6fa4355cafbc9d85e7634acae73e21ca62f9d45c2f","last_reissued_at":"2026-05-20T00:03:47.134497Z","signature_status":"signed_v1","first_computed_at":"2026-05-20T00:03:47.134497Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements","license":"http://creativecommons.org/licenses/by-nc-sa/4.0/","headline":"TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements","cross_cats":[],"primary_cat":"cs.SE","authors_text":"Jiakai Xu, Jingyu Xiao, Michael R Lyu, Tingshuo Liang, Yintong Huo, Yuxuan Wan","submitted_at":"2026-05-17T03:48:41Z","abstract_excerpt":"Coding agents can generate web applications from natural-language descriptions, yet a recent benchmark study shows that generated applications fail to meet functional requirements in over 70% of cases. The core difficulty is that web correctness cannot be assessed from source files or terminal output: the application must be deployed, exercised through simulated browser interactions, and failures must be translated into actionable repair signals -- steps that current agents cannot perform without human mediation.\n  We present TDDev, a framework that automates this closed loop through three sta"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that browser-based interaction simulation can reliably detect functional failures and translate them into structured repair reports that the coding agent can act on without human mediation, as this is presented as the core difficulty that current agents cannot perform.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"75941f3790549ea8d1c6f5b954b225741e25037969dba5e769bca64097b14668"},"source":{"id":"2605.17242","kind":"arxiv","version":1},"verdict":{"id":"ffe1d17a-0fe1-4519-857a-484dd7f6b450","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T23:17:10.038194Z","strongest_claim":"TDD infrastructure consistently improves generation quality by 34--48 percentage points over a no-TDD baseline. The central finding is that the optimal protocol depends on the model's generation style: models that build applications holistically benefit most from agentic enforcement, while models that extend code conservatively benefit from incremental enforcement.","one_line_summary":"TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that browser-based interaction simulation can reliably detect functional failures and translate them into structured repair reports that the coding agent can act on without human mediation, as this is presented as the core difficulty that current agents cannot perform.","pith_extraction_headline":"TDDev automates test-driven development so coding agents can generate functional full-stack web apps from requirements"},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.17242/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_title_agreement","ran_at":"2026-05-19T23:31:20.324652Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T23:31:15.153184Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T22:01:57.881104Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T21:33:23.795439Z","status":"skipped","version":"1.0.0","findings_count":0}],"snapshot_sha256":"eb6ab445c92655fd3787dfeaeda972ac75e4d87677c2af8d9bfb9d84a76fbe7e"},"references":{"count":59,"sample":[{"doi":"","year":2023,"title":"UI/Application Exerciser Monkey","work_id":"a1fb0888-c7c2-4728-b009-521f63774eaa","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"17+ Surprising WordPress Statistics You Should Not Miss [2024].WPDe- veloper(2024)","work_id":"f2c7d810-d80e-4ea8-8203-d433140c3915","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"How Many Websites Are There in 2024? (13 Latest Statistics).TechJury (2024)","work_id":"733aaf66-f2ee-4ae5-bb50-beb22b394b2f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Nadia Alshahwan, Jubin Chheda, Anastasia Finogenova, Beliz Gokkaya, Mark Harman, Inna Harper, Alexandru Marginean, Shubho Sengupta, and Eddy Wang","work_id":"0f3f5338-5f8f-4a35-823a-5b3cf30dc692","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3663529.3663839","year":null,"title":"InCompanion Proceedings of the ACM International Conference on Foundations of Software Engineering (FSE Companion)","work_id":"d49e6e29-0455-4018-ae88-5d75c1665b32","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":59,"snapshot_sha256":"9d0da99b3cd5e05d5009b8ac52ad8a9bb39c5898b39763618ecf16a0e70bc8dc","internal_anchors":2},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2605.17242","created_at":"2026-05-20T00:03:47.134625+00:00"},{"alias_kind":"arxiv_version","alias_value":"2605.17242v1","created_at":"2026-05-20T00:03:47.134625+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2605.17242","created_at":"2026-05-20T00:03:47.134625+00:00"},{"alias_kind":"pith_short_12","alias_value":"GNDFWN72KKEF","created_at":"2026-05-20T00:03:47.134625+00:00"},{"alias_kind":"pith_short_16","alias_value":"GNDFWN72KKEFXVRC","created_at":"2026-05-20T00:03:47.134625+00:00"},{"alias_kind":"pith_short_8","alias_value":"GNDFWN72","created_at":"2026-05-20T00:03:47.134625+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":0,"internal_anchor_count":0,"sample":[]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6","json":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6.json","graph_json":"https://pith.science/api/pith-number/GNDFWN72KKEFXVRCBNX2INK4V6/graph.json","events_json":"https://pith.science/api/pith-number/GNDFWN72KKEFXVRCBNX2INK4V6/events.json","paper":"https://pith.science/paper/GNDFWN72"},"agent_actions":{"view_html":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6","download_json":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6.json","view_paper":"https://pith.science/paper/GNDFWN72","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2605.17242&json=true","fetch_graph":"https://pith.science/api/pith-number/GNDFWN72KKEFXVRCBNX2INK4V6/graph.json","fetch_events":"https://pith.science/api/pith-number/GNDFWN72KKEFXVRCBNX2INK4V6/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6/action/timestamp_anchor","attest_storage":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6/action/storage_attestation","attest_author":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6/action/author_attestation","sign_citation":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6/action/citation_signature","submit_replication":"https://pith.science/pith/GNDFWN72KKEFXVRCBNX2INK4V6/action/replication_record"}},"created_at":"2026-05-20T00:03:47.134625+00:00","updated_at":"2026-05-20T00:03:47.134625+00:00"}