{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:JQV66JLA35IBZFBTFZ3NRYLFR7","short_pith_number":"pith:JQV66JLA","schema_version":"1.0","canonical_sha256":"4c2bef2560df501c94332e76d8e1658fe77d63b751508f4118a5d4e623f63c80","source":{"kind":"arxiv","id":"2407.10362","version":3},"attestation_state":"computed","paper":{"title":"LAB-Bench: Measuring Capabilities of Language Models for Biology Research","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Andrew D. White, Jon M. Laurent, Joseph D. Janizek, Manvitha Ponnapati, Michaela M. Hinks, Michael J. Hammerling, Michael Ruzo, Samuel G. Rodriques, Siddharth Narayanan","submitted_at":"2024-07-14T23:52:25Z","abstract_excerpt":"There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad datas"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.10362","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.AI","submitted_at":"2024-07-14T23:52:25Z","cross_cats_sorted":[],"title_canon_sha256":"e1e688186ac8a564ee4148b596d36e6270602308f20fb0f5c063dad5750372a3","abstract_canon_sha256":"93987a7bf6ec82cff30bd36782bcb0930d5cc6ddbca93afde4947e0547ac096e"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.379680Z","signature_b64":"vSStX4E5lseUKogt/ljCrX7Cmo4lgO1e38iggLBFRrRia494dbrWWoOPwXI+q+n4uIZ1H7eA8HdQ0+lV1iEPCw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"4c2bef2560df501c94332e76d8e1658fe77d63b751508f4118a5d4e623f63c80","last_reissued_at":"2026-05-17T23:38:47.379162Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.379162Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"LAB-Bench: Measuring Capabilities of Language Models for Biology Research","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Andrew D. White, Jon M. Laurent, Joseph D. Janizek, Manvitha Ponnapati, Michaela M. Hinks, Michael J. Hammerling, Michael Ruzo, Samuel G. Rodriques, Siddharth Narayanan","submitted_at":"2024-07-14T23:52:25Z","abstract_excerpt":"There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad datas"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"An AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The multiple-choice questions in LAB-Bench accurately reflect the practical capabilities required for real-world biology research tasks, rather than testing only surface-level pattern matching.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"299834ed872332ccc83d21cf388a00309e84b2662a7a614255cf124078ab9fa0"},"source":{"id":"2407.10362","kind":"arxiv","version":3},"verdict":{"id":"e0fc9f87-3847-4836-8602-697bc90dfc2d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T15:54:22.695332Z","strongest_claim":"An AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning.","one_line_summary":"LAB-Bench provides over 2,400 multiple-choice questions to measure LLM performance on real biology research tasks like literature recall, figure reading, database access, and sequence manipulation, with initial results compared against human expert biologists.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The multiple-choice questions in LAB-Bench accurately reflect the practical capabilities required for real-world biology research tasks, rather than testing only surface-level pattern matching.","pith_extraction_headline":"LAB-Bench introduces over 2,400 questions to test AI on practical biology research tasks such as literature search and sequence manipulation."},"references":{"count":59,"sample":[{"doi":"","year":2015,"title":"Joanna S Amberger, Carol A Bocchini, François Schiettecatte, Alan F Scott, and Ada Hamosh. Omim. org: Online mendelian inheritance in man (omim®), an online catalog of human genes and genetic disorder","work_id":"d9644263-9bba-42f7-96b2-73411b093b8f","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Introducing the next generation of claude, March 2024","work_id":"089ab74c-c20b-448c-b7e1-b1e3eeb8ea71","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Introducing the next generation of claude, March 2024","work_id":"45581889-1f17-4b18-a18d-dec141414b05","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","work_id":"47b597a2-a355-4305-b1e4-80666b394ccd","ref_index":4,"cited_arxiv_id":"2405.14782","is_internal_anchor":true},{"doi":"10.1038/s41586-023-06792-0","year":2023,"title":"Autonomous chemical research with large language models","work_id":"e15cebd6-c137-47c6-975e-41b70ed20de9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":59,"snapshot_sha256":"8165fd4d148049a6ebd835e6a2358a84e0f26c5931754e99119fb4b53c67db66","internal_anchors":3},"formal_canon":{"evidence_count":3,"snapshot_sha256":"64d42f3f44977608c9c52a1aa465f4896288a6370c0800e0b3abf6b2293516f9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.10362","created_at":"2026-05-17T23:38:47.379254+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.10362v3","created_at":"2026-05-17T23:38:47.379254+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.10362","created_at":"2026-05-17T23:38:47.379254+00:00"},{"alias_kind":"pith_short_12","alias_value":"JQV66JLA35IB","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"JQV66JLA35IBZFBT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"JQV66JLA","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":26,"internal_anchor_count":26,"sample":[{"citing_arxiv_id":"2605.17373","citing_title":"FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18661","citing_title":"AI for Auto-Research: Roadmap & User Guide","ref_index":96,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15766","citing_title":"BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15341","citing_title":"LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2506.06414","citing_title":"Benchmarking Misuse Mitigation Against Covert Adversaries","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2506.22598","citing_title":"RExBench: Can coding agents autonomously implement AI research extensions?","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2507.06261","citing_title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2507.11810","citing_title":"Evolving Roles of LLMs in Scientific Innovation: Assistant, Collaborator, Scientist, and Evaluator","ref_index":82,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21800","citing_title":"BioAgent Bench: An AI Agent Evaluation Suite for Bioinformatics","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2506.11763","citing_title":"DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09554","citing_title":"LABBench2: An Improved Benchmark for AI Systems Performing Biology Research","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13950","citing_title":"Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03361","citing_title":"The limits of bio-molecular modeling with large language models : a cross-scale evaluation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02934","citing_title":"PolyReal: A Benchmark for Real-World Polymer Science Workflows","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03121","citing_title":"An Independent Safety Evaluation of Kimi K2.5","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10125","citing_title":"Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10125","citing_title":"Useful for Exploration, Risky for Precision: Evaluating AI Tools in Academic Research","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24966","citing_title":"Risk Reporting for Developers' Internal AI Model Use","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06177","citing_title":"BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.01640","citing_title":"Prescriptive Scaling Laws for Data Constrained Training","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00267","citing_title":"Jailbroken Frontier Models Retain Their Capabilities","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18805","citing_title":"AI scientists produce results without reasoning scientifically","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10718","citing_title":"SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06111","citing_title":"AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2501.14249","citing_title":"Humanity's Last Exam","ref_index":32,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7","json":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7.json","graph_json":"https://pith.science/api/pith-number/JQV66JLA35IBZFBTFZ3NRYLFR7/graph.json","events_json":"https://pith.science/api/pith-number/JQV66JLA35IBZFBTFZ3NRYLFR7/events.json","paper":"https://pith.science/paper/JQV66JLA"},"agent_actions":{"view_html":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7","download_json":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7.json","view_paper":"https://pith.science/paper/JQV66JLA","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.10362&json=true","fetch_graph":"https://pith.science/api/pith-number/JQV66JLA35IBZFBTFZ3NRYLFR7/graph.json","fetch_events":"https://pith.science/api/pith-number/JQV66JLA35IBZFBTFZ3NRYLFR7/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7/action/timestamp_anchor","attest_storage":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7/action/storage_attestation","attest_author":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7/action/author_attestation","sign_citation":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7/action/citation_signature","submit_replication":"https://pith.science/pith/JQV66JLA35IBZFBTFZ3NRYLFR7/action/replication_record"}},"created_at":"2026-05-17T23:38:47.379254+00:00","updated_at":"2026-05-17T23:38:47.379254+00:00"}