Progressive Multi-Agent Reasoning for Biological Perturbation Prediction

Chanyoung Park; Hyomin Kim; Jaechang Lim; Junhyeok Jeon; Sang-Yeon Hwang; Sungsoo Ahn; Woo Youn Kim; Yinhua Piao; Yunhak Oh

arxiv: 2602.07408 · v2 · submitted 2026-02-07 · 💻 cs.AI · cs.MA

Progressive Multi-Agent Reasoning for Biological Perturbation Prediction

Hyomin Kim , Sang-Yeon Hwang , Jaechang Lim , Yinhua Piao , Yunhak Oh , Woo Youn Kim , Chanyoung Park , Sungsoo Ahn

show 1 more author

Junhyeok Jeon

This is my paper

Pith reviewed 2026-05-16 06:43 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent reasoninggene regulation predictionchemical perturbationsLINCSQA benchmarkbiological knowledge graphscausal structurebulk-cell experimentsdrug discovery AI

0 comments

The pith

A multi-agent system lets smaller models predict gene responses to chemical perturbations by using confident predictions to guide harder cases through shared causal structure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates LINCSQA, a benchmark focused on how genes change under chemical perturbations in bulk cells, an area central to drug discovery that prior work largely ignored. It introduces PBio-Agent, a multi-agent setup that sequences tasks by difficulty and refines knowledge iteratively so that reliable gene predictions supply context for uncertain ones. The central mechanism rests on the observation that genes hit by one perturbation tend to share underlying causal links. This lets the system outperform standard baselines on both LINCSQA and PerturbQA while allowing smaller language models to succeed without extra training.

Core claim

PBio-Agent integrates specialized agents that draw on biological knowledge graphs with a synthesis agent and coherence judges; its key step is difficulty-aware sequencing in which confidently predicted genes supply causal context for more difficult ones because all genes affected by the same perturbation share causal structure.

What carries the argument

The progressive multi-agent sequencing that routes confident gene predictions to contextualize harder ones on the basis of shared causal structure within a single perturbation.

If this is right

Smaller language models become viable for explaining complex bulk-cell perturbation responses without fine-tuning.
Drug-discovery pipelines can incorporate more accurate chemical-perturbation forecasts in bulk settings.
The same sequencing principle extends directly to other entangled causal-reasoning tasks in biology.
Multi-agent refinement reduces the need for large single-model scale on high-dimensional biological data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to predicting outcomes in other systems where inputs share latent causal factors, such as metabolic pathway modeling.
Adding live experimental feedback loops into the synthesis agent could close the gap between predicted and measured responses.
Evaluating the method across multiple cell lines would test whether the shared-causal-structure premise holds beyond the training distribution.

Load-bearing premise

Genes affected by the same perturbation share enough causal structure that predictions made confidently on some genes can reliably improve predictions on the rest.

What would settle it

A controlled test on LINCSQA in which the progressive sequencing step is removed so every gene is predicted independently, and overall accuracy shows no decline.

Figures

Figures reproduced from arXiv: 2602.07408 by Chanyoung Park, Hyomin Kim, Jaechang Lim, Junhyeok Jeon, Sang-Yeon Hwang, Sungsoo Ahn, Woo Youn Kim, Yinhua Piao, Yunhak Oh.

**Figure 1.** Figure 1: Overview of the LINCSQA benchmark construction. (i) Quality control: Filtering LINCS L1000 Level 5 signatures for high-quality compound treatments. (ii) (b) Tier Selection: Hierarchical pairing of compounds to cell lines using a two-tier strategy. Tier 1 (clinical consensus) requires strict clinical indication alignment, where the compound’s approved therapeutic use must match the cell line’s disease origi… view at source ↗

**Figure 2.** Figure 2: Overview of PBIO-AGENT. (a) Difficulty aware data sorting: We order data using a composite score derived from the product of two metrics. LLM self-consistency measures prediction stability over multiple trials. Biological relatedness of perturbation and gene is fetched from the STRING database. (b) Progressive reasoning: PBIO-AGENT processes genes from easy to hard to build iterative context. High confiden… view at source ↗

**Figure 3.** Figure 3: Agreement ratios and target (A375 cell line) rank comparison for BRAF inhibitors. Agreement ratios for vemurafenib (left) and dabrafenib (right) with target ranks (numbers above bars) showing A375’s ranking among six cell lines. Only PBio-Agent-8B consistently achieves rank 1 in A375 (BRAF V600E-mutant), while baseline models show higher agreement in wild-type cell lines, demonstrating PBio-Agent-8B’s abil… view at source ↗

**Figure 4.** Figure 4: Agreement ratios of PBIO-AGENT across KRAS G12C-mutants with varying drug sensitivity. H358 (sensitive), H2122 (intermediate), and SW1573 (resistant) cells were treated with ARS-1620 and evaluated at 4h, 24h, and 72h. Higher agreement in sensitive H358 reflects coherent KRAS inhibition response, while lower agreement in resistant SW1573 indicates bypass pathway activation that decouples transcriptional c… view at source ↗

read the original abstract

Predicting gene regulation responses to biological perturbations requires reasoning about underlying biological causalities. While large language models (LLMs) show promise for such tasks, they are often overwhelmed by the entangled nature of high-dimensional perturbation results. Moreover, recent works have primarily focused on genetic perturbations in single-cell experiments, leaving bulk-cell chemical perturbations, which is central to drug discovery, largely unexplored. Motivated by this, we present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations in bulk-cell environments. We further propose PBio-Agent, a multi-agent framework that integrates difficulty-aware task sequencing with iterative knowledge refinement. Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases. The framework employs specialized agents enriched with biological knowledge graphs, while a synthesis agent integrates outputs and specialized judges ensure logical coherence. PBio-Agent outperforms existing baselines on both LINCSQA and PerturbQA, enabling even smaller models to predict and explain complex biological processes without additional training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a benchmark for bulk chemical perturbations and a multi-agent framework, but the shared causal structure premise for its sequencing needs more backing.

read the letter

This paper introduces LINCSQA as a benchmark for predicting gene responses to chemical perturbations in bulk-cell settings, plus PBio-Agent, a multi-agent LLM system that sequences tasks from easy to hard genes based on the idea that they share causal structure. The new benchmark addresses a real gap, since most prior work has looked at genetic perturbations in single cells while drug discovery often deals with chemical changes in bulk. The framework uses agents specialized with biological knowledge graphs, a synthesis agent to combine outputs, and judges to check for logical issues. It also includes iterative knowledge refinement. The claim is that this setup lets even smaller models handle complex predictions and explanations without any additional training, and it shows better results than baselines on LINCSQA and PerturbQA. One thing that works is the practical framing around multi-agent collaboration for entangled biological data. Breaking it down by difficulty and using confident predictions to inform tougher cases is a reasonable way to manage high-dimensional outputs. The soft spot is around the central mechanism. The shared causal structure assumption drives the difficulty-aware sequencing, but the description does not include ablations that isolate the effect of the ordering or any analysis linking prediction difficulty to biological features like pathway membership. If those checks are missing, the outperformance could trace back to the knowledge graph enrichment or the judge components alone. The abstract also lacks quantitative details like specific metrics or dataset sizes, so the strength of the empirical support is difficult to assess from the summary alone. This is relevant for researchers in AI applied to biology and drug discovery who need benchmarks for perturbation effects. Readers working on multi-agent methods for scientific reasoning might find the agent specialization and refinement process useful to adapt. Given the new benchmark and the applied focus, the paper deserves a serious referee to examine the full experiments and any supporting analyses. I would recommend sending it to peer review rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces LINCSQA, a new benchmark for predicting gene regulation responses to chemical perturbations in bulk-cell settings, and proposes PBio-Agent, a multi-agent LLM framework. The framework uses difficulty-aware task sequencing grounded in the claim that genes affected by the same perturbation share causal structure, allowing confident early predictions to inform harder cases. Specialized agents are enriched with biological knowledge graphs, a synthesis agent integrates outputs, and judge agents enforce coherence. The authors report that PBio-Agent outperforms baselines on both LINCSQA and PerturbQA, enabling smaller models to handle complex predictions without fine-tuning.

Significance. If the progressive sequencing mechanism is shown to be responsible for gains rather than knowledge-graph enrichment or judging alone, the work would offer a practical route to interpretable, training-free biological reasoning with LLMs. This could be relevant for drug-discovery pipelines that rely on bulk perturbation data, where current single-cell-focused methods leave a gap.

major comments (3)

[Experiments / §4] The central claim (abstract and §3) that difficulty-aware sequencing works because 'genes affected by the same perturbation share causal structure' is load-bearing, yet the manuscript provides no ablation that isolates sequencing order from the rest of the multi-agent pipeline. A direct comparison to a non-sequential (e.g., parallel or random-order) multi-agent variant is required to establish that the progressive aspect, rather than knowledge-graph access or judging, drives the reported gains on LINCSQA and PerturbQA.
[Method / §3.2] No quantitative test of the shared-causality premise appears (e.g., correlation between per-gene prediction difficulty and biological pathway overlap, or between early confident predictions and downstream accuracy lift). Without such evidence, outperformance could be explained by the knowledge-graph or judge modules alone, undermining the key insight.
[Results / §4.2] Table 2 and Figure 4 report aggregate metrics but omit per-perturbation breakdowns, error bars, or statistical significance tests against baselines. Given that the benchmark is newly introduced, these details are necessary to assess whether the claimed improvements are robust.

minor comments (2)

[Notation / §3] Notation for agent roles and knowledge-graph integration is introduced in §3.1 but not consistently reused in the experimental section, making it difficult to map specific components to the reported ablations.
[Method / §3.3] The description of the synthesis agent (Eq. 3) does not specify how conflicting outputs from specialized agents are resolved when judge scores are tied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which has helped clarify the presentation of our core contributions. We address each major comment below and have revised the manuscript accordingly to provide stronger empirical support for the progressive sequencing mechanism and the shared-causality premise.

read point-by-point responses

Referee: [Experiments / §4] The central claim (abstract and §3) that difficulty-aware sequencing works because 'genes affected by the same perturbation share causal structure' is load-bearing, yet the manuscript provides no ablation that isolates sequencing order from the rest of the multi-agent pipeline. A direct comparison to a non-sequential (e.g., parallel or random-order) multi-agent variant is required to establish that the progressive aspect, rather than knowledge-graph access or judging, drives the reported gains on LINCSQA and PerturbQA.

Authors: We agree that an ablation isolating the progressive sequencing is essential. In the revised manuscript we have added a new ablation study (new Table 3 and §4.3) that compares the full PBio-Agent against (i) a parallel multi-agent variant and (ii) a random-order variant while keeping knowledge-graph enrichment and judge modules identical. The results show statistically significant gains attributable to difficulty-aware ordering on both LINCSQA and PerturbQA, confirming that sequencing contributes beyond the other components. revision: yes
Referee: [Method / §3.2] No quantitative test of the shared-causality premise appears (e.g., correlation between per-gene prediction difficulty and biological pathway overlap, or between early confident predictions and downstream accuracy lift). Without such evidence, outperformance could be explained by the knowledge-graph or judge modules alone, undermining the key insight.

Authors: We acknowledge the absence of direct quantitative validation. We have added new analyses in §3.2 and Appendix C: (1) Spearman correlations between per-gene difficulty scores and pathway overlap (KEGG/Reactome), and (2) accuracy lift as a function of the number of early confident predictions. Both analyses yield positive, statistically significant correlations, providing direct support for the shared-causality premise and showing that early predictions measurably improve downstream accuracy. revision: yes
Referee: [Results / §4.2] Table 2 and Figure 4 report aggregate metrics but omit per-perturbation breakdowns, error bars, or statistical significance tests against baselines. Given that the benchmark is newly introduced, these details are necessary to assess whether the claimed improvements are robust.

Authors: We agree that granular reporting is required for a new benchmark. The revised manuscript expands Table 2 with per-perturbation breakdowns, adds error bars (standard deviation across 5 runs) to Figure 4, and includes paired t-test p-values comparing PBio-Agent against all baselines. These additions demonstrate that the reported gains are consistent across perturbations and statistically significant. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework relies on external assumptions and benchmarks

full rationale

The paper states its key insight as an explicit assumption (genes affected by the same perturbation share causal structure) rather than deriving it from model outputs or self-referential definitions. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The multi-agent setup (specialized agents, synthesis, judges, knowledge graphs) is presented as an engineering framework whose performance is evaluated on external benchmarks (LINCSQA, PerturbQA), with no reduction of claims to inputs by construction. This is a standard non-circular case.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of shared causal structure among co-perturbed genes and the effectiveness of multi-agent collaboration with external knowledge graphs for refinement.

axioms (1)

domain assumption Genes affected by the same perturbation share causal structure
Key insight stated in abstract that enables confident predictions to help with challenging cases.

pith-pipeline@v0.9.0 · 5505 in / 1138 out tokens · 37737 ms · 2026-05-16T06:43:12.611541+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Neema, N., Mukherjee, S., Shah, S., Ramakrishnan, G., and Venkatesh, G

Instruction-tuned 24B parameter language model, available at Hugging Face. Neema, N., Mukherjee, S., Shah, S., Ramakrishnan, G., and Venkatesh, G. From amateur to master: Infusing knowledge into llms via automated curriculum learning. arXiv preprint arXiv:2510.26336, 2025. Perfetto, L., Briganti, L., Calderone, A., Cerquone Per- petuini, A., Iannuccelli, ...

work page arXiv 2025
[2]

Focus on: Basal expression of target/perturb genes and key driver mutations (e.g., BRAF V600E)

work page
[3]

If the target gene is not expressed, it cannot be downregulated further

work page
[4]

USER PROMPT: Analyze context: Cell Line: {cell line}, Perturbation: {pertor moa}, Target Gene: {targetgene} Network agent You are a Systems Biology expert

Use ONLY biological facts related to the specific cell line. USER PROMPT: Analyze context: Cell Line: {cell line}, Perturbation: {pertor moa}, Target Gene: {targetgene} Network agent You are a Systems Biology expert. Trace the regulatory path from the perturbation target to the gene of interest. OUTPUT FORMAT (STRICT - JSON ONLY): { ”networkreasoning”: ”S...

work page
[5]

Trace paths: (PerturbationTarget) -(relationship)-¿ (Intermediate) -(relationship)-¿ (Target- Gene)

work page
[6]

Distinguish between ’Activity change’ and ’Expression change’

work page
[7]

Identify feedback loops or compensatory mechanisms

work page
[8]

Use biological knowledge graph’s pathway context if provided. USER PROMPT: Trace the network path: - Start Point (Perturbation Target):{perttarget} - End Point (Target Gene):{targetgene} Is there a known transcriptional or signaling link between these nodes? Mechanism agent You are a Molecular Pharmacologist. Define the immediate molecular consequence of ...

work page
[9]

Does the reasoning explicitly or implicitly copy the direction (up/down) from prior cases?

work page
[10]

- Using history direction as the primary or sole justification is NOT allowed

Is the final direction justified by perturbation-specific reasoning, or merely by similarity to previous genes? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Using history as contextual background is ALLOWED. - Using history direction as the primary or sole justification is NOT allowed. -...

work page
[11]

Consistent reference to the given cell line?

work page
[12]

Correct reference to the perturbation (gene or MoA)?

work page
[13]

Correct and consistent reference to the target gene?

work page
[14]

- Do NOT judge biological correctness or the final answer

Avoidance of unrelated cell lines, genes, or drugs? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Penalize ONLY explicit mismatches or hallucinated entities. - Do NOT judge biological correctness or the final answer. USER PROMPT: Inputs: Cell Line:{cellline}, Perturbation:{pertor moa}, Ta...

work page
[15]

Does the reasoning argue for upregulation while the answer says downregulated?

work page
[16]

Does the reasoning argue for downregulation while the answer says upregulated?

work page
[17]

- If ANY inconsistency is found, verdict MUST be ”problematic”

Is the final answer unsupported or contradicted by the reasoning? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Do NOT judge biological validity / grounding / history usage. - If ANY inconsistency is found, verdict MUST be ”problematic”. USER PROMPT: Canonical Reasoning:{canonicalreasonin...

work page

[1] [1]

Neema, N., Mukherjee, S., Shah, S., Ramakrishnan, G., and Venkatesh, G

Instruction-tuned 24B parameter language model, available at Hugging Face. Neema, N., Mukherjee, S., Shah, S., Ramakrishnan, G., and Venkatesh, G. From amateur to master: Infusing knowledge into llms via automated curriculum learning. arXiv preprint arXiv:2510.26336, 2025. Perfetto, L., Briganti, L., Calderone, A., Cerquone Per- petuini, A., Iannuccelli, ...

work page arXiv 2025

[2] [2]

Focus on: Basal expression of target/perturb genes and key driver mutations (e.g., BRAF V600E)

work page

[3] [3]

If the target gene is not expressed, it cannot be downregulated further

work page

[4] [4]

USER PROMPT: Analyze context: Cell Line: {cell line}, Perturbation: {pertor moa}, Target Gene: {targetgene} Network agent You are a Systems Biology expert

Use ONLY biological facts related to the specific cell line. USER PROMPT: Analyze context: Cell Line: {cell line}, Perturbation: {pertor moa}, Target Gene: {targetgene} Network agent You are a Systems Biology expert. Trace the regulatory path from the perturbation target to the gene of interest. OUTPUT FORMAT (STRICT - JSON ONLY): { ”networkreasoning”: ”S...

work page

[5] [5]

Trace paths: (PerturbationTarget) -(relationship)-¿ (Intermediate) -(relationship)-¿ (Target- Gene)

work page

[6] [6]

Distinguish between ’Activity change’ and ’Expression change’

work page

[7] [7]

Identify feedback loops or compensatory mechanisms

work page

[8] [8]

Use biological knowledge graph’s pathway context if provided. USER PROMPT: Trace the network path: - Start Point (Perturbation Target):{perttarget} - End Point (Target Gene):{targetgene} Is there a known transcriptional or signaling link between these nodes? Mechanism agent You are a Molecular Pharmacologist. Define the immediate molecular consequence of ...

work page

[9] [9]

Does the reasoning explicitly or implicitly copy the direction (up/down) from prior cases?

work page

[10] [10]

- Using history direction as the primary or sole justification is NOT allowed

Is the final direction justified by perturbation-specific reasoning, or merely by similarity to previous genes? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Using history as contextual background is ALLOWED. - Using history direction as the primary or sole justification is NOT allowed. -...

work page

[11] [11]

Consistent reference to the given cell line?

work page

[12] [12]

Correct reference to the perturbation (gene or MoA)?

work page

[13] [13]

Correct and consistent reference to the target gene?

work page

[14] [14]

- Do NOT judge biological correctness or the final answer

Avoidance of unrelated cell lines, genes, or drugs? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Penalize ONLY explicit mismatches or hallucinated entities. - Do NOT judge biological correctness or the final answer. USER PROMPT: Inputs: Cell Line:{cellline}, Perturbation:{pertor moa}, Ta...

work page

[15] [15]

Does the reasoning argue for upregulation while the answer says downregulated?

work page

[16] [16]

Does the reasoning argue for downregulation while the answer says upregulated?

work page

[17] [17]

- If ANY inconsistency is found, verdict MUST be ”problematic”

Is the final answer unsupported or contradicted by the reasoning? OUTPUT FORMAT (STRICT - JSON ONLY): { ”verdict”: ”problematic” or ”not-problematic”, ”feedback”: ”...” } RULES: - Do NOT judge biological validity / grounding / history usage. - If ANY inconsistency is found, verdict MUST be ”problematic”. USER PROMPT: Canonical Reasoning:{canonicalreasonin...

work page