{"total":12,"items":[{"citing_arxiv_id":"2606.02258","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research","primary_cat":"cs.CE","submitted_at":"2026-06-01T13:45:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28371","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-27T12:11:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes agentic framework-based reproduction with a slot-binding interface to turn 16 PHM papers into standardized, assumption-aware benchmark implementations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27210","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation","primary_cat":"quant-ph","submitted_at":"2026-05-26T16:02:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Adapts QuantumKatas to Qiskit yielding a 350-task benchmark across 26 categories and evaluates 16 LLMs in 39,200 runs, reporting performance gaps and prompting effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20520","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Open-World Evaluations for Measuring Frontier AI Capabilities","primary_cat":"cs.AI","submitted_at":"2026-05-19T21:42:32+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18661","ref_index":204,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"structured, and critique-aware figure construction. Forresult plots and data visualization, MatPlotAgent [235] uses VLM-based visual feedback to improve data visualization quality, while PlotGen [53] and PlotCraft [252] study chart generation across diverse plot types and task difficulties. CoDA [31] explores multi-agent collaboration for visualization, and ChartGPT [204] decomposes chart generation into sequential reasoning steps for handling abstract natural-language inputs. 17 More recent systems broaden the scope of generation and evaluation: SciFig [74] introduces rubric-based evaluation for pipeline figures, VisCoder [139] studies code-based visualization generation at scale, Diagram- Agent [218] targets multiple diagram categories with specialized agents, and SciFlow-Bench [255] evaluates"},{"citing_arxiv_id":"2605.18630","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T16:34:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13950","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23106","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows","primary_cat":"cs.SE","submitted_at":"2026-04-25T02:01:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.00149","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations","primary_cat":"physics.comp-ph","submitted_at":"2026-03-31T18:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":", Guo, X., Zhou, J., Inafuku, D., Xue, C., Gao, L., Yang, Z., Hein, Y., Kahn, Y., Zhou, K., Luo, D., Wilson, J.D., Reilly, J.T., Bandak, D., Press, O., Yang, L., Wang, X., Tong, H., Chia, N., Huerta, E., 17 Peng, H.: Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark (2025). https://arxiv.org/abs/2509.26574 [31] Tian, M., Gao, L., Zhang, S.D., Chen, X., Fan, C., Guo, X., Haas, R., Ji, P., Krongchon, K., Li, Y., Liu, S., Luo, D., Ma, Y., Tong, H., Trinh, K., Tian, C., Wang, Z., Wu, B., Xiong, Y., Yin, S., Zhu, M., Lieret, K., Lu, Y., Liu, G., Du, Y., Tao, T., Press, O., Callan, J., Huerta, E.A., Peng, H.: Scicode: A research coding benchmark curated by scientists."},{"citing_arxiv_id":"2603.23964","ref_index":178,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments","primary_cat":"cs.AI","submitted_at":"2026-03-25T05:56:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"tends beyond software engineering; it has also fostered a certain ecosystem of reinforcement learning environments within the field of scientific programming (SciCode).[177]. • Healthcare and biology:Reinforcement learning has begun to penetrate complex medical question answering and diagnosis. Some typical examples include: Medical Reasoning (MedQA, JAMA Clinical, etc.). [178]. With the emergence of chain-of-thoughts (CoTs), reinforce- ment learning has been applied to complex, expert-level medical and biological reasoning (MedQA-USMLE, MedXpertQA, KEGG PATHW AY , EHR-based Clinical Reasoning).[179, 180, 181, 182] • Physical Sciences:Notable examples include control- ling plasma in nuclear fusion tokamak reactors [176] and"},{"citing_arxiv_id":"2510.08804","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding","primary_cat":"cs.CL","submitted_at":"2025-10-09T20:35:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15134","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-21T05:39:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}