pith. machine review for the scientific record. sign in

arxiv: 2604.12387 · v1 · submitted 2026-04-14 · 🧬 q-bio.GN

Recognition: unknown

oxo-call: Documentation-grounded Skill Augmentation for Accurate Bioinformatics Command-line Generation with Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:15 UTC · model grok-4.3

classification 🧬 q-bio.GN
keywords bioinformatics command-line toolslarge language modelsdocumentation groundingskill augmentationgenomic analysiscommand generationreproducible workflowsnatural language interfaces
0
0 comments X

The pith

oxo-call generates accurate bioinformatics command lines from natural language by grounding large language models in complete tool documentation and expert skills.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops oxo-call to address the challenge that bioinformatics tools have varied and complex command-line syntax, which hinders productive research. It achieves this by using documentation-first grounding to supply the model with full, version-specific help text for tools and curated skill augmentation to include expert concepts, pitfalls, and examples. A reader would care because this combination could allow researchers to describe analysis tasks in plain language and receive reliable, executable commands, along with features for workflows, extensibility, and privacy-preserving local use.

Core claim

oxo-call translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. The system ships more than 150 built-in skills covering 44 analytical categories in a single binary, logs every command with provenance for reproducibility, and includes a DAG-based workflow engine plus support for user-defined skills and local inference.

What carries the argument

The dual mechanism of documentation-first grounding and curated skill augmentation that supplies LLMs with version-specific tool documentation and domain expertise to produce correct commands.

Load-bearing premise

That providing an LLM with the complete version-specific help text plus curated expert skills will reliably generate accurate and executable bioinformatics commands without hallucinations or version-specific errors.

What would settle it

A controlled test in which natural language queries for specific bioinformatics tasks are input to oxo-call, and the output commands are checked for exact matches to the documented tool syntax, parameters, and options for the intended versions.

read the original abstract

Command-line bioinformatics tools remain essential for genomic analysis, yet their diversity in syntax and parameterization presents a persistent barrier to productive research. We present oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into accurate tool invocations through two complementary strategies: documentation-first grounding, which provides the large language model (LLM) with the complete, version-specific help text of each target tool, and curated skill augmentation, which primes the model with domain-expert concepts, common pitfalls, and worked examples. oxo-call (v0.10) ships >150 built-in skills covering 44 analytical categories, from variant calling and genome assembly to single-cell transcriptomics, compiled into a single, statically linked binary. Every generated command is logged with provenance metadata to support reproducible research. oxo-call also provides a DAG-based workflow engine, extensibility through user-defined and community skills via the Model Context Protocol, and support for local LLM inference to address data-privacy requirements. oxo-call is freely available for academic use at https://traitome.github.io/oxo-call/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents oxo-call, a Rust-based command-line assistant that translates natural-language task descriptions into bioinformatics tool invocations. It relies on two strategies: documentation-first grounding, which supplies the LLM with complete version-specific help text for each tool, and curated skill augmentation, which provides domain-expert concepts, common pitfalls, and worked examples. The system ships with over 150 built-in skills spanning 44 analytical categories, includes a DAG-based workflow engine, supports user-defined and community skills via the Model Context Protocol, enables local LLM inference for privacy, and logs all commands with provenance metadata. The tool is released as a statically linked binary for academic use.

Significance. If the accuracy of generated commands can be demonstrated, oxo-call would address a practical barrier in bioinformatics by helping researchers correctly invoke diverse CLI tools without deep syntax knowledge. The combination of full documentation grounding and pre-curated expert skills targets known LLM failure modes such as hallucinations and version mismatches, while features like provenance logging and local inference support reproducibility and data-privacy requirements common in the field.

major comments (1)
  1. Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical value of oxo-call in addressing command-line barriers in bioinformatics. We address the single major comment below.

read point-by-point responses
  1. Referee: [—] Abstract: The central claim that documentation-first grounding plus curated skill augmentation produces 'accurate' executable commands 'without hallucinations or version mismatches' is unsupported by any quantitative evidence. The manuscript describes the architecture and the >150 skills but reports no success rates, error analysis, held-out test sets, baseline comparisons to other LLM prompting methods, or failure-mode evaluation. This absence renders the effectiveness of the two strategies an untested assumption rather than a validated result.

    Authors: We agree that the manuscript provides no quantitative evidence, such as success rates, error analyses, held-out test sets, baseline comparisons, or failure-mode evaluations, to support the claims of accuracy or the elimination of hallucinations and version mismatches. The paper is structured as a tool-description manuscript focused on the system architecture, the documentation-first grounding approach, the curated skill set spanning 44 categories, the DAG workflow engine, provenance logging, and extensibility features. These elements are presented as design choices intended to mitigate known LLM limitations in command generation, but we acknowledge that their effectiveness is not empirically validated here. We will revise the abstract to qualify or remove the unsubstantiated assertions of producing 'accurate' commands 'without hallucinations or version mismatches.' We will also add an explicit limitations section stating that generated commands should be verified by users and that systematic benchmarking against alternative prompting strategies remains future work. revision: yes

Circularity Check

0 steps flagged

No circularity: software architecture description with no derivations or fitted predictions

full rationale

The paper is a system description of the oxo-call tool, outlining its documentation-first grounding and skill augmentation strategies as design features. It contains no equations, no parameter fitting, no predictive claims, and no self-citations that serve as load-bearing premises for any result. All content is presented as implemented functionality and extensibility options without any reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an engineering description of a software assistant rather than a theoretical model, so it introduces no free parameters, mathematical axioms, or postulated scientific entities.

pith-pipeline@v0.9.0 · 5520 in / 1143 out tokens · 48973 ms · 2026-05-10T14:15:02.208362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 5 canonical work pages · 5 internal anchors

  1. [1]

    Bioconda: sustainable and comprehensive software distribution for the life sciences

    Grüning B, Dale R, Sjödin A, et al. Bioconda: sustainable and comprehensive software distribution for the life sciences. Nat Methods. 2018;15:475–6

  2. [2]

    Challenges and recommendations to improve the installability and archival stability of omics computational tools

    Mangul S, Mosqueiro T, Abdill RJ, et al. Challenges and recommendations to improve the installability and archival stability of omics computational tools. PLoS Biol. 2019;17:e3000333

  3. [3]

    A survey of best practices for RNA-seq data analysis

    Conesa A, Madrigal P, Tarazona S, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13

  4. [4]

    Salmon provides fast and bias-aware quantification of transcript expression

    Patro R, Duggal G, Love MI, et al. Salmon provides fast and bias-aware quantification of transcript expression. Nat Methods. 2017;14:417–9

  5. [5]

    Evaluating Large Language Models Trained on Code

    Chen M, Tworek J, Jun H, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. 2021

  6. [6]

    Code Llama: Open Foundation Models for Code

    Rozière B, Gehring J, Gloeckle F, et al. Code Llama: open foundation models for code. arXiv preprint arXiv:2308.12950. 2023

  7. [7]

    Survey of hallucination in natural language generation

    Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55:1–38

  8. [8]

    Ten simple rules for reproducible computational research

    Sandve GK, Nekrutenko A, Taylor J, Hovig E. Ten simple rules for reproducible computational research. PLoS Comput Biol. 2013;9:e1003285

  9. [9]

    Enhancing reproducibility for computational methods

    Stodden V, McNutt M, Bailey DH, et al. Enhancing reproducibility for computational methods. Science. 2016;354:1240–1

  10. [10]

    Retrieval-augmented generation for knowledge- intensive NLP tasks

    Lewis P, Perez E, Piktus A, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks. Adv Neural Inf Process Syst. 2020;33:9459–74

  11. [11]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Gao Y, Xiong Y, Gao X, et al. Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:2312.10997. 2023

  12. [12]

    GitHub Copilot

    GitHub. GitHub Copilot. https://docs.github.com/en/copilot. Accessed 2026

  13. [13]

    Amazon CodeWhisperer

    Amazon Web Services. Amazon CodeWhisperer. https://aws.amazon.com/codewhisperer/. Accessed 2026

  14. [14]

    A platform for the biomedical application of large language models

    Lobentanzer S, Feng S, Bruderer N, et al. A platform for the biomedical application of large language models. Nat Biotechnol. 2025;43:166-9

  15. [15]

    Opportunities and challenges for ChatGPT and large language models in biomedicine and health

    Tian S, Jin Q, Yeganova L, et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform. 2024;25:bbad493

  16. [16]

    Sustainable data analysis with Snakemake

    Mölder F, Jablonski KP, Letcher B, et al. Sustainable data analysis with Snakemake. F1000Res. 2021;10:33

  17. [17]

    Nextflow enables reproducible computational workflows

    Di Tommaso P, Chatzou M, Floden EW, et al. Nextflow enables reproducible computational workflows. Nat Biotechnol. 2017;35:316–9

  18. [18]

    The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update

    Afgan E, Baker D, Batut B, et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res. 2018;46:W537–44

  19. [19]

    GPT-4 Technical Report

    Achiam J, Adler S, Agarwal S, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774. 2023

  20. [20]

    The FAIR guiding principles for scientific data management and stewardship

    Wilkinson MD, Dumontier M, Aalbersberg IJ, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data. 2016;3:160018

  21. [21]

    The Claude model family

    Anthropic. The Claude model family. https://docs.anthropic.com/en/docs/about- claude/models. Accessed 2026

  22. [22]

    Privacy and artificial intelligence: challenges for protecting health information in a new era

    Murdoch B. Privacy and artificial intelligence: challenges for protecting health information in a new era. BMC Med Ethics. 2021;22:122

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron H, Martin L, Stone K, et al. Llama 2: open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. 2023

  24. [24]

    Food and Drug Administration

    U.S. Food and Drug Administration. Software as a Medical Device (SaMD): clinical evaluation. FDA Guidance Document. 2017

  25. [25]

    Best practices for scientific computing

    Wilson G, Aruliah DA, Brown CT, et al. Best practices for scientific computing. PLoS Biol. 2014;12:e1001745

  26. [26]

    BioContainers: an open-source and community-driven framework for software standardization

    da Veiga Leprevost F, Grüning BA, Alves Aflitos S, et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017;33:2580–2

  27. [27]

    Recommendations for the packaging and containerizing of bioinformatics software

    Gruening B, Sallou O, Moreno P, et al. Recommendations for the packaging and containerizing of bioinformatics software. F1000Res. 2018;7:742

  28. [28]

    Minimap2: pairwise alignment for nucleotide sequences

    Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018;34:3094–100

  29. [29]

    Scaling accurate genetic variant discovery to tens of thousands of samples

    Poplin R, Ruano-Rubio V, DePristo MA, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv. 2017;201178. Figure legends Figure 1. System design, feature scope, and skill coverage of oxo-call. (a) Four-stage pipeline: documentation resolution fetches and caches the target tool ’ s complete help text; skill loading inje...