{"total":18,"items":[{"citing_arxiv_id":"2606.29630","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SFBench: The SciFy Scientific Feasibility Benchmark","primary_cat":"cs.AI","submitted_at":"2026-06-28T22:27:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SFBench provides 197 expert-created materials science claims with feasibility scores and explanations to evaluate AI systems on scientific feasibility assessment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18648","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark","primary_cat":"physics.comp-ph","submitted_at":"2026-06-17T03:32:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13669","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Agents-K1: Towards Agent-native Knowledge Orchestration","primary_cat":"cs.AI","submitted_at":"2026-06-11T17:58:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agents-K1 is an end-to-end pipeline with a multimodal parser, 4B GRPO-trained extractor, and agent CLI that builds scientific knowledge graphs from full papers and was run on 2.46 million documents to produce Scholar-KG.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11926","ref_index":157,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Toward Generalist Autonomous Research via Hypothesis-Tree Refinement","primary_cat":"cs.CL","submitted_at":"2026-06-10T10:57:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02859","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions","primary_cat":"cs.CL","submitted_at":"2026-06-01T20:21:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An economy of agents using auctions and wealth accumulation produces emergent multi-step reasoning that outperforms monolithic baselines on five agentic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01145","ref_index":285,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches","primary_cat":"cs.AI","submitted_at":"2026-05-31T10:27:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A survey of RLM use in 28 disciplines reveals uneven adoption and introduces a maturity assessment framework showing larger gaps when limited to public resources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22681","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Forecasting Scientific Progress with Artificial Intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-21T16:23:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18661","ref_index":209,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI for Auto-Research: Roadmap & User Guide","primary_cat":"cs.AI","submitted_at":"2026-05-18T17:08:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"also raises a central question for this stage: whether apparent novelty corresponds to executable and impactful research. Subsequent work has explored three ways to strengthen direct generation. First,iterative refinementuses feedback loops to improve idea specificity and reduce shallow novelty. ResearchAgent [10] incorporates academic graph feedback to refine generated ideas, SciMON [209] iteratively compares candidate ideas against prior work to mitigate the tendency of direct LLM prompting toward shallow contributions, and Chain of Ideas [102] organizes literature into progressive reasoning chains that outperform simple prompting baselines. Second,learned quality signalsintroduce explicit scoring or optimization objectives. Spark [168] combines"},{"citing_arxiv_id":"2605.16217","ref_index":41,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Argus: Evidence Assembly for Scalable Deep Research Agents","primary_cat":"cs.CL","submitted_at":"2026-05-15T17:29:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13950","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:00:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13301","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling","primary_cat":"cs.AI","submitted_at":"2026-05-13T10:13:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06326","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"tation is both context-inefficient and error-prone, especially for large-scale search and multi-case exploration [22]. 6.4 Generalization Analysis Cross-domain transfer.Although trained only on math data, the learned interleaved reasoning pattern transfers to different domains and tasks. As shown in Table 9, across diverse benchmarks, including FrontierScience [38], which requires scientific computation, LiveCodeBench [16], which evaluates code generation ability, and the knowledge-intensive GPQA-Diamond [30], our models consistently achieve non-trivial improvements over the base models, with gains of up to 14.5%. Cross-model transfer.Our recipe is not specific to a single model family. We apply the recipe to"},{"citing_arxiv_id":"2605.05873","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency","primary_cat":"stat.ML","submitted_at":"2026-05-07T08:41:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03185","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AI and the Research-Education Environment of Physics","primary_cat":"physics.ed-ph","submitted_at":"2026-05-04T22:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":1.0,"formal_verification":"none","one_line_summary":"A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01489","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-02T15:26:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SciResearcher is a new agentic data-construction framework that trains an 8B model via supervised fine-tuning and reinforcement learning to reach 19.46% on HLE-Bio/Chem-Gold and 13-15% gains on related biology and literature benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"SciResearcher-8B-SFT 12.75 31.52 47.67 SciResearcher-8B-RL 19.46 35.87 49.42 -pass@331.54 51.09 60.47 Table 2: Training data composition.# Tasksindi- cates number of QA pairs and# Stepsindicates number of step-level messages for training. Dataset # Tasks # Steps SciResearcherQA-Concept 371 2,872 SciResearcherQA-Compute 104 951 TRQA-Literature [48] 172 932 SciBench [40] 80 350 Total 727 5,105 Data and TrainingThe composition of the training data is summarized in Table 2. In ad- dition to the two data types generated by SciRe- searcher, we introduce two auxiliary data sources to improve distributional balance. First, we in- corporate a small subset of SciBench [ 40] as a source of relatively simple scientific reason-"},{"citing_arxiv_id":"2604.17406","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale","primary_cat":"cs.AI","submitted_at":"2026-04-19T12:26:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09836","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"COMPOSITE-Stem","primary_cat":"cs.AI","submitted_at":"2026-04-10T19:08:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09554","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LABBench2: An Improved Benchmark for AI Systems Performing Biology Research","primary_cat":"cs.AI","submitted_at":"2026-02-04T18:50:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}