Aviary: training language agents on challenging scientific tasks

Albert Bou; Andrew D. White; Geemi Wellawatte; James D. Braza; Jon Laurent; Manu Ponnapati; Ori Kabeli; Ryan-Rhys Griffiths; Sam Cox; Samuel G. Rodriques

arxiv: 2412.21154 · v1 · pith:FW7WKZDGnew · submitted 2024-12-30 · 💻 cs.AI · cs.CL· cs.LG

Aviary: training language agents on challenging scientific tasks

Siddharth Narayanan , James D. Braza , Ryan-Rhys Griffiths , Manu Ponnapati , Albert Bou , Jon Laurent , Ori Kabeli , Geemi Wellawatte

show 3 more authors

Sam Cox Samuel G. Rodriques Andrew D. White

This is my paper

classification 💻 cs.AI cs.CLcs.LG

keywords agentslanguagetasksenvironmentsscientificaviarychallengingcycles

0 comments

read the original abstract

Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 6.0

SIGA is a coding-agent adapter using retrieval, procedural memory, and validation gates that raises success rate on GEOS from 0.720 to 0.789 while cutting variance 16x and matching expert quality in minutes instead of hours.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 unverdicted novelty 6.0

An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
cs.AI 2026-02 unverdicted novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
cs.CL 2025-09 unverdicted novelty 6.0

CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters
cs.AI 2026-06 unverdicted novelty 5.0

SIGA adapters let off-the-shelf coding agents produce complete, valid configurations for multiphysics simulators like GEOS in minutes rather than hours, with self-evolution further improving performance on held-out cases.
URSA: The Universal Research and Scientific Agent
cs.AI 2025-06 unverdicted novelty 4.0

URSA is a modular agent ecosystem that uses LLMs and scientific tools to accelerate research tasks of varying complexity.