pith. sign in

arxiv: 2412.21154 · v1 · pith:FW7WKZDGnew · submitted 2024-12-30 · 💻 cs.AI · cs.CL· cs.LG

Aviary: training language agents on challenging scientific tasks

classification 💻 cs.AI cs.CLcs.LG
keywords agentslanguagetasksenvironmentsscientificaviarychallengingcycles
0
0 comments X
read the original abstract

Solving complex real-world tasks requires cycles of actions and observations. This is particularly true in science, where tasks require many cycles of analysis, tool use, and experimentation. Language agents are promising for automating intellectual tasks in science because they can interact with tools via natural language or code. Yet their flexibility creates conceptual and practical challenges for software implementations, since agents may comprise non-standard components such as internal reasoning, planning, tool usage, as well as the inherent stochasticity of temperature-sampled language models. Here, we introduce Aviary, an extensible gymnasium for language agents. We formalize agents as policies solving language-grounded partially observable Markov decision processes, which we term language decision processes. We then implement five environments, including three challenging scientific environments: (1) manipulating DNA constructs for molecular cloning, (2) answering research questions by accessing scientific literature, and (3) engineering protein stability. These environments were selected for their focus on multi-step reasoning and their relevance to contemporary biology research. Finally, with online training and scaling inference-time compute, we show that language agents backed by open-source, non-frontier LLMs can match and exceed both frontier LLM agents and human experts on multiple tasks at up to 100x lower inference cost.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

  2. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.

  3. Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters

    cs.AI 2026-06 unverdicted novelty 6.0

    SIGA is a coding-agent adapter using retrieval, procedural memory, and validation gates that raises success rate on GEOS from 0.720 to 0.789 while cutting variance 16x and matching expert quality in minutes instead of hours.

  4. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 unverdicted novelty 6.0

    An LLM entity-tagging pipeline plus multi-agent system extracts ~6.3M nuanced records from 22.5M PubMed papers across six tasks with lower measured error than existing curated databases.

  5. LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

    cs.AI 2026-02 unverdicted novelty 6.0

    LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

  6. CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics

    cs.CL 2025-09 unverdicted novelty 6.0

    CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.

  7. Auto-Configuring Scientific Simulators with Lightweight Coding-Agent Adapters

    cs.AI 2026-06 unverdicted novelty 5.0

    SIGA adapters let off-the-shelf coding agents produce complete, valid configurations for multiphysics simulators like GEOS in minutes rather than hours, with self-evolution further improving performance on held-out cases.

  8. URSA: The Universal Research and Scientific Agent

    cs.AI 2025-06 unverdicted novelty 4.0

    URSA is a modular agent ecosystem that uses LLMs and scientific tools to accelerate research tasks of varying complexity.