arxiv: 2604.12387 · v1 · submitted 2026-04-14 · 🧬 q-bio.GN

Recognition: unknown

oxo-call: Documentation-grounded Skill Augmentation for Accurate Bioinformatics Command-line Generation with Large Language Models

Yun Peng , Yujun Sun , Jia Ding , Bin Yan , Zhangyu Wang , Chunyang Wang , Chenyang Shu , Jian-Guo Zhou

show 1 more author

Shixiang Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:15 UTC · model grok-4.3

classification 🧬 q-bio.GN

keywords bioinformatics command-line toolslarge language modelsdocumentation groundingskill augmentationgenomic analysiscommand generationreproducible workflowsnatural language interfaces

0 comments

The pith

oxo-call generates accurate bioinformatics command lines from natural language by grounding large language models in complete tool documentation and expert skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops oxo-call to address the challenge that bioinformatics tools have varied and complex command-line syntax, which hinders productive research. It achieves this by using documentation-first grounding to supply the model with full, version-specific help text for tools and curated skill augmentation to include expert concepts, pitfalls, and examples. A reader would care because this combination could allow researchers to describe analysis tasks in plain language and receive reliable, executable commands, along with features for workflows, extensibility, and privacy-preserving local use.

Core claim

oxo-call translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. The system ships more than 150 built-in skills covering 44 analytical categories in a single binary, logs every command with provenance for reproducibility, and includes a DAG-based workflow engine plus support for user-defined skills and local inference.

What carries the argument

The dual mechanism of documentation-first grounding and curated skill augmentation that supplies LLMs with version-specific tool documentation and domain expertise to produce correct commands.

Load-bearing premise

That providing an LLM with the complete version-specific help text plus curated expert skills will reliably generate accurate and executable bioinformatics commands without hallucinations or version-specific errors.

What would settle it

A controlled test in which natural language queries for specific bioinformatics tasks are input to oxo-call, and the output commands are checked for exact matches to the documented tool syntax, parameters, and options for the intended versions.

read the original abstract

Command-line bioinformatics tools remain essential for genomic analysis, yet their diversity in syntax and parameterization presents a persistent barrier to productive research. We present oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model (LLM) with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. oxo-call (v0.10) ships >150 built-in skills covering 44 analytical categories, from variant calling and genome assembly to single-cell transcriptomics, compiled into a single, statically linked binary. Every generated command is logged with provenance metadata to support reproducible research. oxo-call also provides a DAG-based workflow engine, extensibility through user-defined and community skills via the Model Context Protocol, and support for local LLM inference to address data-privacy requirements. oxo-call is freely available for academic use at https://traitome.github.io/oxo-call/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

oxo-call is a new Rust-based LLM assistant for turning natural language into bioinformatics commands via full docs and expert skills, but the paper asserts accuracy without any benchmarks or tests.

read the letter

oxo-call is a new Rust-based LLM assistant for turning natural language into bioinformatics commands via full docs and expert skills, but the paper asserts accuracy without any benchmarks or tests. The system ships with over 150 skills spanning 44 categories like variant calling and single-cell work, all compiled into one binary. It logs every command with provenance, includes a DAG workflow engine, lets users add skills through the Model Context Protocol, and supports local LLM runs for privacy. These are concrete engineering choices that address real pain points in command-line bioinformatics. The documentation-first grounding and curated skills approach is a reasonable way to steer the model away from syntax errors and common pitfalls. What the paper does well is deliver a practical, self-contained tool rather than another abstract framework. The soft spot is the complete absence of evaluation. The central claim is that these two strategies produce accurate, executable commands across tools, yet there are no success rates, held-out tests, baseline comparisons, or failure-mode breakdowns. Without those numbers the effectiveness stays an assumption. This paper is for bioinformaticians who want help with tool syntax and for researchers building domain-specific LLM assistants. A reader looking for implementation ideas on grounding and skill libraries could find it useful. It deserves a serious referee because it is a working system with clear scope and open code, but any review should require quantitative results before the accuracy claims can be taken as demonstrated. I would recommend sending it to peer review with the expectation of added benchmarks and error analysis in revision.

Referee Report

1 major / 0 minor

Summary. The paper presents oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into bioinformatics tool invocations. It relies on two strategies: documentation-first grounding, which supplies the LLM with complete version-specific help text for each tool, and curated skill augmentation, which provides domain-expert concepts, common pitfalls, and worked examples. The system ships with over 150 built-in skills spanning 44 analytical categories, includes a DAG-based workflow engine, supports user-defined and community skills via the Model Context Protocol, enables local LLM inference for privacy, and logs all commands with provenance metadata. The tool is released as a statically linked binary for academic use.

Significance. If the accuracy of generated commands can be demonstrated, oxo-call would address a practical barrier in bioinformatics by helping researchers correctly invoke diverse CLI tools without deep syntax knowledge. The combination of full documentation grounding and pre-curated expert skills targets known LLM failure modes such as hallucinations and version mismatches, while features like provenance logging and local inference support reproducibility and data-privacy requirements common in the field.

major comments (1)

Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical value of oxo-call in addressing command-line barriers in bioinformatics. We address the single major comment below.

read point-by-point responses

Referee: [—] Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.

Authors: We agree that the manuscript provides no quantitative evidence, such as success rates, error analyses, held-out test sets, baseline comparisons, or failure-mode evaluations, to support the claims of accuracy or the elimination of hallucinations and version mismatches. The paper is structured as a tool-description manuscript focused on the system architecture, the documentation-first grounding approach, the curated skill set spanning 44 categories, the DAG workflow engine, provenance logging, and extensibility features. These elements are presented as design choices intended to mitigate known LLM limitations in command generation, but we acknowledge that their effectiveness is not empirically validated here. We will revise the abstract to qualify or remove the unsubstantiated assertions of producing 'accurate' commands 'without hallucinations or version mismatches.' We will also add an explicit limitations section stating that generated commands should be verified by users and that systematic benchmarking against alternative prompting strategies remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: software architecture description with no derivations or fitted predictions

full rationale

The paper is a system description of the oxo-call tool, outlining its documentation-first grounding and skill augmentation strategies as design features. It contains no equations, no parameter fitting, no predictive claims, and no self-citations that serve as load-bearing premises for any result. All content is presented as implemented functionality and extensibility options without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering description of a software assistant rather than a theoretical model, so it introduces no free parameters, mathematical axioms, or postulated scientific entities.

pith-pipeline@v0.9.0 · 5520 in / 1143 out tokens · 48973 ms · 2026-05-10T14:15:02.208362+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages · 5 internal anchors

[1]

Bioconda: sustainable and comprehensive software distribution for the life sciences

Grüning B, Dale R, Sjödin A, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15:475–6

2018
[2]

Challenges and recommendations to improve the installability and archival stability of omics computational tools

Mangul S, Mosqueiro T, Abdill RJ, et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 2019;17:e3000333

2019
[3]

A survey of best practices for RNA-seq data analysis

Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13

2016
[4]

Salmon provides fast and bias-aware quantification of transcript expression

Patro R, Duggal G, Love MI, et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417–9

2017
[5]

Evaluating Large Language Models Trained on Code

Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Code Llama: Open Foundation Models for Code

Rozière B, Gehring J, Gloeckle F, et al. Code Llama: open foundation models for code. arXiv preprint arXiv:2308.12950. 2023

work page internal anchor Pith review arXiv 2023
[7]

Survey of hallucination in natural language generation

Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55:1–38

2023
[8]

Ten simple rules for reproducible computational research

Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9:e1003285

2013
[9]

Enhancing reproducibility for computational methods

Stodden V, McNutt M, Bailey DH, et al. Enhancing reproducibility for computational methods. Science. 2016;354:1240–1

2016
[10]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459–74

2020
[11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

GitHub Copilot

GitHub. GitHub Copilot. https://docs.github.com/en/copilot. Accessed 2026

2026
[13]

Amazon CodeWhisperer

Amazon Web Services. Amazon CodeWhisperer. https://aws.amazon.com/codewhisperer/. Accessed 2026

2026
[14]

A platform for the biomedical application of large language models

Lobentanzer S, Feng S, Bruderer N, et al. A platform for the biomedical application of large language models. Nat Biotechnol. 2025;43:166-9

2025
[15]

Opportunities and challenges for ChatGPT and large language models in biomedicine and health

Tian S, Jin Q, Yeganova L, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. 2024;25:bbad493

2024
[16]

Sustainable data analysis with Snakemake

Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33

2021
[17]

Nextflow enables reproducible computational workflows

Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9

2017
[18]

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

Afgan E, Baker D, Batut B, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46:W537–44

2018
[19]

GPT-4 Technical Report

Achiam J, Adler S, Agarwal S, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

The FAIR guiding principles for scientific data management and stewardship

Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018

2016
[21]

The Claude model family

Anthropic. The Claude model family. https://docs.anthropic.com/en/docs/about- claude/models. Accessed 2026

2026
[22]

Privacy and artificial intelligence: challenges for protecting health information in a new era

Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22:122

2021
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Food and Drug Administration

U.S. Food and Drug Administration. Software as a Medical Device (SaMD): clinical evaluation. FDA Guidance Document. 2017

2017
[25]

Best practices for scientific computing

Wilson G, Aruliah DA, Brown CT, et al. Best practices for scientific computing. PLoS Biol. 2014;12:e1001745

2014
[26]

BioContainers: an open-source and community-driven framework for software standardization

da Veiga Leprevost F, Grüning BA, Alves Aflitos S, et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017;33:2580–2

2017
[27]

Recommendations for the packaging and containerizing of bioinformatics software

Gruening B, Sallou O, Moreno P, et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 2018;7:742

2018
[28]

Minimap2: pairwise alignment for nucleotide sequences

Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100

2018
[29]

Scaling accurate genetic variant discovery to tens of thousands of samples

Poplin R, Ruano-Rubio V, DePristo MA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017;201178. Figure legends Figure 1. System design, feature scope, and skill coverage of oxo-call. (a) Four-stage pipeline: documentation resolution fetches and caches the target tool ’ s complete help text; skill loading inje...

2017