Recognition: unknown
oxo-call: Documentation-grounded Skill Augmentation for Accurate Bioinformatics Command-line Generation with Large Language Models
Pith reviewed 2026-05-10 14:15 UTC · model grok-4.3
The pith
oxo-call generates accurate bioinformatics command lines from natural language by grounding large language models in complete tool documentation and expert skills.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
oxo-call translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. The system ships more than 150 built-in skills covering 44 analytical categories in a single binary, logs every command with provenance for reproducibility, and includes a DAG-based workflow engine plus support for user-defined skills and local inference.
What carries the argument
The dual mechanism of documentation-first grounding and curated skill augmentation that supplies LLMs with version-specific tool documentation and domain expertise to produce correct commands.
Load-bearing premise
That providing an LLM with the complete version-specific help text plus curated expert skills will reliably generate accurate and executable bioinformatics commands without hallucinations or version-specific errors.
What would settle it
A controlled test in which natural language queries for specific bioinformatics tasks are input to oxo-call, and the output commands are checked for exact matches to the documented tool syntax, parameters, and options for the intended versions.
read the original abstract
Command-line bioinformatics tools remain essential for genomic analysis, yet their diversity in syntax and parameterization presents a persistent barrier to productive research. We present oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model (LLM) with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. oxo-call (v0.10) ships >150 built-in skills covering 44 analytical categories, from variant calling and genome assembly to single-cell transcriptomics, compiled into a single, statically linked binary. Every generated command is logged with provenance metadata to support reproducible research. oxo-call also provides a DAG-based workflow engine, extensibility through user-defined and community skills via the Model Context Protocol, and support for local LLM inference to address data-privacy requirements. oxo-call is freely available for academic use at https://traitome.github.io/oxo-call/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into bioinformatics tool invocations. It relies on two strategies: documentation-first grounding, which supplies the LLM with complete version-specific help text for each tool, and curated skill augmentation, which provides domain-expert concepts, common pitfalls, and worked examples. The system ships with over 150 built-in skills spanning 44 analytical categories, includes a DAG-based workflow engine, supports user-defined and community skills via the Model Context Protocol, enables local LLM inference for privacy, and logs all commands with provenance metadata. The tool is released as a statically linked binary for academic use.
Significance. If the accuracy of generated commands can be demonstrated, oxo-call would address a practical barrier in bioinformatics by helping researchers correctly invoke diverse CLI tools without deep syntax knowledge. The combination of full documentation grounding and pre-curated expert skills targets known LLM failure modes such as hallucinations and version mismatches, while features like provenance logging and local inference support reproducibility and data-privacy requirements common in the field.
major comments (1)
- Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the potential practical value of oxo-call in addressing command-line barriers in bioinformatics. We address the single major comment below.
read point-by-point responses
-
Referee: [—] Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.
Authors: We agree that the manuscript provides no quantitative evidence, such as success rates, error analyses, held-out test sets, baseline comparisons, or failure-mode evaluations, to support the claims of accuracy or the elimination of hallucinations and version mismatches. The paper is structured as a tool-description manuscript focused on the system architecture, the documentation-first grounding approach, the curated skill set spanning 44 categories, the DAG workflow engine, provenance logging, and extensibility features. These elements are presented as design choices intended to mitigate known LLM limitations in command generation, but we acknowledge that their effectiveness is not empirically validated here. We will revise the abstract to qualify or remove the unsubstantiated assertions of producing 'accurate' commands 'without hallucinations or version mismatches.' We will also add an explicit limitations section stating that generated commands should be verified by users and that systematic benchmarking against alternative prompting strategies remains future work. revision: yes
Circularity Check
No circularity: software architecture description with no derivations or fitted predictions
full rationale
The paper is a system description of the oxo-call tool, outlining its documentation-first grounding and skill augmentation strategies as design features. It contains no equations, no parameter fitting, no predictive claims, and no self-citations that serve as load-bearing premises for any result. All content is presented as implemented functionality and extensibility options without any reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bioconda: sustainable and comprehensive software distribution for the life sciences
Grüning B, Dale R, Sjödin A, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15:475–6
2018
-
[2]
Challenges and recommendations to improve the installability and archival stability of omics computational tools
Mangul S, Mosqueiro T, Abdill RJ, et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 2019;17:e3000333
2019
-
[3]
A survey of best practices for RNA-seq data analysis
Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13
2016
-
[4]
Salmon provides fast and bias-aware quantification of transcript expression
Patro R, Duggal G, Love MI, et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417–9
2017
-
[5]
Evaluating Large Language Models Trained on Code
Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Code Llama: Open Foundation Models for Code
Rozière B, Gehring J, Gloeckle F, et al. Code Llama: open foundation models for code. arXiv preprint arXiv:2308.12950. 2023
work page internal anchor Pith review arXiv 2023
-
[7]
Survey of hallucination in natural language generation
Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55:1–38
2023
-
[8]
Ten simple rules for reproducible computational research
Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9:e1003285
2013
-
[9]
Enhancing reproducibility for computational methods
Stodden V, McNutt M, Bailey DH, et al. Enhancing reproducibility for computational methods. Science. 2016;354:1240–1
2016
-
[10]
Retrieval-augmented generation for knowledge- intensive NLP tasks
Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459–74
2020
-
[11]
Retrieval-Augmented Generation for Large Language Models: A Survey
Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
GitHub Copilot
GitHub. GitHub Copilot. https://docs.github.com/en/copilot. Accessed 2026
2026
-
[13]
Amazon CodeWhisperer
Amazon Web Services. Amazon CodeWhisperer. https://aws.amazon.com/codewhisperer/. Accessed 2026
2026
-
[14]
A platform for the biomedical application of large language models
Lobentanzer S, Feng S, Bruderer N, et al. A platform for the biomedical application of large language models. Nat Biotechnol. 2025;43:166-9
2025
-
[15]
Opportunities and challenges for ChatGPT and large language models in biomedicine and health
Tian S, Jin Q, Yeganova L, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. 2024;25:bbad493
2024
-
[16]
Sustainable data analysis with Snakemake
Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33
2021
-
[17]
Nextflow enables reproducible computational workflows
Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9
2017
-
[18]
The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update
Afgan E, Baker D, Batut B, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46:W537–44
2018
-
[19]
Achiam J, Adler S, Agarwal S, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
The FAIR guiding principles for scientific data management and stewardship
Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018
2016
-
[21]
The Claude model family
Anthropic. The Claude model family. https://docs.anthropic.com/en/docs/about- claude/models. Accessed 2026
2026
-
[22]
Privacy and artificial intelligence: challenges for protecting health information in a new era
Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22:122
2021
-
[23]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Food and Drug Administration
U.S. Food and Drug Administration. Software as a Medical Device (SaMD): clinical evaluation. FDA Guidance Document. 2017
2017
-
[25]
Best practices for scientific computing
Wilson G, Aruliah DA, Brown CT, et al. Best practices for scientific computing. PLoS Biol. 2014;12:e1001745
2014
-
[26]
BioContainers: an open-source and community-driven framework for software standardization
da Veiga Leprevost F, Grüning BA, Alves Aflitos S, et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017;33:2580–2
2017
-
[27]
Recommendations for the packaging and containerizing of bioinformatics software
Gruening B, Sallou O, Moreno P, et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 2018;7:742
2018
-
[28]
Minimap2: pairwise alignment for nucleotide sequences
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100
2018
-
[29]
Scaling accurate genetic variant discovery to tens of thousands of samples
Poplin R, Ruano-Rubio V, DePristo MA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017;201178. Figure legends Figure 1. System design, feature scope, and skill coverage of oxo-call. (a) Four-stage pipeline: documentation resolution fetches and caches the target tool ’ s complete help text; skill loading inje...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.