hub Canonical reference

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic · 2025 · cs.AI · arXiv 2502.18864

Canonical reference. 88% of citing Pith papers cite this work as background.

76 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 76 citing papers arXiv PDF

abstract

Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 baseline 1 other 1

citation-polarity summary

background 21 unclear 2 baseline 1

claims ledger

abstract Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accele

co-cited works

representative citing papers

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies

cs.SI · 2026-05-09 · unverdicted · novelty 7.0

Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0 · 3 refs

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

End-to-end autonomous scientific discovery on a real optical platform

cs.AI · 2026-04-29 · unverdicted · novelty 7.0

An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than single-pass models like LigandMPNN.

VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

cs.MA · 2026-04-13 · unverdicted · novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

Kosmos: An AI Scientist for Autonomous Discovery

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

cs.AI · 2025-06-16 · unverdicted · novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

cs.SE · 2025-05-27 · unverdicted · novelty 7.0

Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

cs.HC · 2024-09-23 · unverdicted · novelty 7.0

Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.

Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.

MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

cs.LG · 2026-05-15 · conditional · novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin

LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.

Unlocking LLM Creativity in Science through Analogical Reasoning

cs.AI · 2026-05-11 · conditional · novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.

Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator accuracy threshold.

Common-agency Games for Multi-Objective Test-Time Alignment

cs.GT · 2026-05-08 · unverdicted · novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.

citing papers explorer

Showing 50 of 76 citing papers.

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints cs.AI · 2026-05-18 · unverdicted · none · ref 18 · internal anchor
Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 3 · internal anchor
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations physics.chem-ph · 2026-04-03 · conditional · none · ref 13 · internal anchor
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
Evaluating Large Language Models in Scientific Discovery cs.AI · 2025-12-17 · unverdicted · none · ref 20 · internal anchor
The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.
Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 8 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 41 · internal anchor
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies cs.SI · 2026-05-09 · unverdicted · none · ref 17 · internal anchor
Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 42 · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 16 · 3 links · internal anchor
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
End-to-end autonomous scientific discovery on a real optical platform cs.AI · 2026-04-29 · unverdicted · none · ref 9 · internal anchor
An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unverdicted · none · ref 116 · internal anchor
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.
RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design cs.LG · 2026-04-19 · unverdicted · none · ref 5 · internal anchor
RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than single-pass models like LigandMPNN.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems cs.MA · 2026-04-13 · unverdicted · none · ref 22 · internal anchor
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Kosmos: An AI Scientist for Autonomous Discovery cs.AI · 2025-11-04 · unverdicted · none · ref 3 · internal anchor
Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.
AlphaEvolve: A coding agent for scientific and algorithmic discovery cs.AI · 2025-06-16 · unverdicted · none · ref 34 · internal anchor
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.
Code Researcher: Deep Research Agent for Large Systems Code and Commit History cs.SE · 2025-05-27 · unverdicted · none · ref 14 · internal anchor
Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation cs.HC · 2024-09-23 · unverdicted · none · ref 22 · internal anchor
Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.
Hack-Verifiable Environments: Towards Evaluating Reward Hacking at Scale cs.LG · 2026-05-20 · unverdicted · none · ref 5 · internal anchor
Presents Hack-Verifiable TextArena, a benchmark that embeds verifiable reward hacking opportunities into environments to enable deterministic measurement of exploitation by language models.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility cs.LG · 2026-05-15 · conditional · none · ref 12 · internal anchor
MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological failures, and that the cheapest system outperforms the most expensive by a wide margin
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design cs.LG · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
Unlocking LLM Creativity in Science through Analogical Reasoning cs.AI · 2026-05-11 · conditional · none · ref 19 · internal anchor
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
GRAFT-ATHENA: Self-Improving Agentic Teams for Autonomous Discovery and Evolutionary Numerical Algorithms cs.LG · 2026-05-11 · unverdicted · none · ref 1 · internal anchor
GRAFT-ATHENA projects combinatorial method choices into factored trees that embed as fingerprints in a metric space, enabling an agentic system to accumulate experience across domains and autonomously discover new numerical techniques for physics-informed problems.
Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions cs.LG · 2026-05-11 · unverdicted · none · ref 172 · internal anchor
SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator accuracy threshold.
Common-agency Games for Multi-Objective Test-Time Alignment cs.GT · 2026-05-08 · unverdicted · none · ref 83 · internal anchor
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery cs.AI · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
Hypothesis generation and updating in large language models cs.LG · 2026-05-07 · unverdicted · none · ref 8 · internal anchor
LLMs exhibit Bayesian-like hypothesis updating with strong-sampling bias and an evaluation-generation gap but generalize poorly outside observed data.
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments cs.HC · 2026-04-30 · unverdicted · none · ref 9 · internal anchor
AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, novelty, and insight than generic LLMs.
Human Cognition in Machines: A Unified Perspective of World Models cs.RO · 2026-04-17 · unverdicted · none · ref 56 · internal anchor
The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and proposes Epistemic World Models as a new category for scientific discovery agents.
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research cs.LG · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 12 · internal anchor
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 33 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
AIRA_2: Overcoming Bottlenecks in AI Research Agents cs.AI · 2026-03-27 · conditional · none · ref 7 · internal anchor
AIRA₂ improves AI research agents via asynchronous multi-GPU workers, hidden consistent evaluation, and interactive ReAct agents, reaching 81.5-83.1% percentile rank on MLE-bench-30 and exceeding human SOTA on 6 of 20 AIRS-Bench tasks.
Bridging the Experimental Last Mile: Digitizing Laboratory Know-How for Safe AI-Assisted Support cs.HC · 2026-03-16 · conditional · none · ref 9 · internal anchor
A video-plus-RAG human-in-the-loop system digitizes site-specific laboratory know-how and supplies safe, grounded guidance for experiments such as powder X-ray diffraction.
"When to Hand Off, When to Work Together": Expanding Human-Agent Co-Creative Collaboration through Concurrent Interaction cs.HC · 2026-03-02 · unverdicted · none · ref 23 · internal anchor
Concurrent human-agent interactions occur in 31.8% of turns and follow five action patterns explained by six triggers and four enabling factors, enabled by a context-aware design probe called CLEO.
The 2025 AI Agent Index: Documenting Technical and Safety Features of Deployed Agentic AI Systems cs.CY · 2026-02-19 · accept · none · ref 56 · internal anchor
The 2025 AI Agent Index catalogs technical and safety details for 30 deployed AI agents and finds low developer transparency on safety, evaluations, and societal impacts.
Glia: A Human-Inspired AI for Automated Systems Design and Optimization cs.AI · 2025-10-31 · unverdicted · none · ref 16 · internal anchor
Glia deploys a multi-agent LLM workflow with reasoning, experimentation, and analysis agents to generate interpretable algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters, reaching human-expert performance levels.
Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics cs.AI · 2025-10-14 · unverdicted · none · ref 29 · internal anchor
Ax-Prover is a tool-using multi-agent LLM system that matches state-of-the-art provers on public math benchmarks and outperforms them on new abstract-algebra and quantum-theory benchmarks while also assisting an expert with a cryptography proof.
An AI system to help scientists write expert-level empirical software cs.AI · 2025-09-08 · unverdicted · none · ref 36 · 2 links · internal anchor
ERA combines LLMs and tree search to produce expert-level empirical software that outperforms top human methods on single-cell analysis leaderboards and CDC COVID-19 forecasts.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling cs.CL · 2025-08-12 · unverdicted · none · ref 14 · internal anchor
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
General Agentic Planning Through Simulative Reasoning with World Models cs.AI · 2025-07-31 · conditional · none · ref 12 · internal anchor
SiRA uses LLM world models for simulative reasoning to achieve up to 124% higher task completion and 32.2% navigation success versus reactive baselines in web environments.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 12 · internal anchor
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 2 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
Sibyl-AutoResearch: Autonomous Research Needs Self-Evolving Trial-and-Error Harnesses, Not Paper Generators cs.MA · 2026-05-21 · unverdicted · none · ref 12 · internal anchor
Sibyl-AutoResearch introduces self-evolving trial-and-error harnesses with auditable conversion units that link trial signals to updated research behaviors and harness repairs in autonomous systems.
Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks cs.AI · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
A multi-agent harness autonomously generates functional single-page VIS apps with linked views for scientific data tasks using coordinated skills for analysis, planning, implementation, and evaluation.
Pramana: A Protocol-Layer Treatment of Claim Verification in Autonomous Agent Networks cs.CR · 2026-05-19 · unverdicted · none · ref 42 · internal anchor
Pramana defines a typed ClaimAttestation protocol with four variants and verify operations, specifies its lifecycle in TLA+, model-checks it with TLC, and provides a tested Python implementation for auditable agent claims.
GEAR: Genetic AutoResearch for Agentic Code Evolution cs.NE · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
GEAR applies genetic algorithms to maintain and evolve multiple research states in autonomous code agents, outperforming single-path baselines by continuing to discover improvements over extended runs.
From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model hep-ex · 2026-05-04 · unverdicted · none · ref 19 · internal anchor
HEP-CoPilot is a new multi-agent retrieval framework that retrieves, reconstructs, and compares experimental limits from HEP literature and HEPData to support interpretation of beyond-Standard-Model searches.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment cs.CL · 2026-04-27 · unverdicted · none · ref 5 · 2 links · internal anchor
TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources and interactive user refinement.
pAI/MSc: ML Theory Research with Humans on the Loop cs.AI · 2026-04-22 · unverdicted · none · ref 69 · internal anchor
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript draft in ML theory.

Towards an AI co-scientist

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer