pith. sign in

arxiv: 2606.18933 · v2 · pith:DIEWGJY5new · submitted 2026-06-17 · 💻 cs.LG · cs.IR· stat.ME

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Pith reviewed 2026-06-26 21:41 UTC · model grok-4.3

classification 💻 cs.LG cs.IRstat.ME
keywords active feature acquisitionzero-shotlarge language modelsMarkov random fieldmaximum entropyinflammatory bowel diseasetop-k identificationdiscriminative statistics
0
0 comments X

The pith

Eliciting discriminative unary and pairwise statistics from an LLM and closing them under maximum entropy enables zero-shot active feature acquisition without labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for active feature acquisition that requires no labeled training data by using an LLM to provide only the statistics that distinguish classes. It asks the LLM for unary deviations and pairwise co-variations rather than full class-conditional distributions, then applies a maximum-entropy closure to build a usable probabilistic model for deciding which features to acquire next. This separation lets the LLM contribute its domain knowledge while the acquisition policy is handled separately. Evaluated on IBD patient data for both binary classification and top-k identification, the resulting policies outperform the LLM itself and prior methods, with the largest gains on the hardest patients.

Core claim

By restricting LLM queries to the sufficient statistics of a Markov random field and resolving the resulting gauge ambiguity via maximum-entropy closure, the framework produces acquisition policies that work in a zero-shot setting and outperform all existing methods on the most ambiguous cases in the IBD cohort.

What carries the argument

Maximum-entropy closure of LLM-elicited unary deviations and pairwise co-variations to construct an MRF for guiding sequential feature acquisition.

If this is right

  • The method applies to both binary classification and top-k identification tasks.
  • The LLM returns reliable discriminative statistics rather than per-class marginals.
  • The closed model outperforms the LLM on both real labels and the LLM's own extracted beliefs.
  • On the hardest patients the top-k policy markedly outperforms existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar elicitation could apply to other domains with expensive labels such as medical imaging or legal document review.
  • The approach may extend to settings where the LLM is queried for higher-order interactions if the closure can be generalized.
  • Testing the framework on non-medical datasets would show whether the gains are specific to heterogeneous patient data or more general.

Load-bearing premise

The LLM returns only discriminative statistics and the maximum-entropy closure correctly resolves the gauge ambiguity without introducing bias that invalidates the acquisition policy.

What would settle it

Observing that the acquisition policy derived from the closed MRF performs no better than random selection or supervised baselines on the hardest IBD cases would falsify the central claim.

read the original abstract

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a zero-shot active feature acquisition framework that elicits unary deviations and pairwise co-variations from LLMs to form sufficient statistics for a Markov random field, uses a maximum-entropy closure to resolve gauge ambiguity caused by the LLM providing only discriminative statistics, and demonstrates its effectiveness on an Inflammatory Bowel Disease patient dataset for binary classification and top-k identification tasks, with notable outperformance on the hardest patients compared to existing methods and the LLM itself.

Significance. This approach has the potential to enable active feature acquisition in domains with limited labeled data by leveraging LLM domain knowledge in a structured manner without requiring the LLM to perform sequential planning. The principled handling of incomplete statistics via max-ent closure is a key technical contribution. The empirical evaluation on a real clinical dataset provides evidence of practical utility, particularly if the gains on hard cases are robust.

major comments (2)
  1. [Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.
  2. [Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.
minor comments (2)
  1. [Abstract] Abstract: the description of the two settings (binary classification and top-k identification) would benefit from naming the primary performance metrics used in each.
  2. [Throughout] Notation: ensure consistent use of symbols for unary deviations and pairwise co-variations across the elicitation and closure steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We appreciate the recognition of the framework's potential and the technical contribution of the max-ent closure. We address each major comment below.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.

    Authors: We thank the referee for this observation. The comparison against real labels is fully independent of the LLM beliefs. To eliminate any risk of perceived circularity in the 'on its own extracted beliefs' results, we will revise the evaluation section to include quantitative alignment metrics (e.g., agreement rates and correlation between LLM-derived statistics and real labels) and explicitly discuss how these metrics inform the interpretation of the LLM-belief comparison. revision: yes

  2. Referee: [Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.

    Authors: We agree that an explicit verification would strengthen the methodological claim. The max-ent closure yields the unique maximum-entropy distribution consistent with the provided statistics and therefore introduces no additional bias beyond those constraints. In the revision we will add a small-scale synthetic verification (new appendix) demonstrating that, when the elicited statistics are consistent with a fully specified ground-truth MRF, the acquisition policy obtained after closure is identical to the policy obtained from the fully specified model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central derivation elicits unary/pairwise statistics from the LLM, applies a maximum-entropy closure for gauge resolution, and evaluates the resulting acquisition policy on an external IBD patient cohort with real labels. This provides an independent benchmark. The secondary comparison to the LLM's own extracted beliefs does not reduce the policy derivation or main performance claims to a self-referential fit. No equations, self-citations, or fitted-input-as-prediction steps are visible that would collapse the claimed results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLM-elicited unary and pairwise statistics are sufficient to parameterize an MRF whose max-ent closure yields a useful acquisition policy; no free parameters are explicitly fitted in the abstract, but the closure itself acts as an implicit modeling choice.

axioms (2)
  • domain assumption LLM returns only discriminative (class-separating) statistics rather than full per-class marginals
    Stated in the abstract as the reason classical AFA is precluded and max-ent closure is required.
  • domain assumption Maximum-entropy distribution is the appropriate closure for the elicited statistics
    Invoked to resolve gauge ambiguity without further justification visible in the abstract.

pith-pipeline@v0.9.1-grok · 5777 in / 1541 out tokens · 19154 ms · 2026-06-26T21:41:21.668710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 14 canonical work pages

  1. [1]

    The Claude 3 model family: Opus, Sonnet, Haiku

    Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. An- thropic Technical Report, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  2. [2]

    Biopsy expression profiling of an adult inflammatory bowel disease cohort

    Carmen Argmann, Mayte Suárez-Fariñas, Ruiqi Hou, and Haritz Irizar. Biopsy expression profiling of an adult inflammatory bowel disease cohort. https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE193677, 2022. GEO Accession GSE193677

  3. [3]

    Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021

    Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021. URLhttp://jmlr.org/papers/v22/18-546.html

  4. [4]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  5. [5]

    Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

  6. [6]

    Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959

    Herman Chernoff. Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959. doi: 10.1214/aoms/1177706205

  7. [7]

    Gusev, and Andrew I

    Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V . Gusev, and Andrew I. Cooper. Language-based bayesian optimization research assistant (BORA). InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelligence, pages 4967–4975. International Joint Conferences on Artificial Intelligence Organization, 2025. doi: 10.24963/ijc...

  8. [8]

    White, and Su-In Lee

    Ian Connick Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan J. White, and Su-In Lee. Learning to maximize mutual information for dynamic feature selection. InInternational Conference on Machine Learning (ICML 2023), volume 202 ofProceedings of Machine Learning Research, pages 6424–6447. PMLR, 2023. URL https://proceedings.mlr. press/v202/covert23a.html

  9. [9]

    Large language bayes, 2025

    Justin Domke. Large language bayes, 2025

  10. [10]

    Estimating conditional mutual information for dynamic feature selection

    Soham Gadgil, Ian Connick Covert, and Su-In Lee. Estimating conditional mutual information for dynamic feature selection. InThe Twelfth International Conference on Learning Represen- tations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/forum?id= Oju2Qu9jvn. 12

  11. [11]

    Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023

    Aritra Ghosh and Andrew Lan. Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023. doi: 10.1609/aaai.v37i6.25934

  12. [12]

    LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510

    Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.838

  13. [13]

    Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019

    Jaromír Janisch, Tomáš Pevný, and Viliam Lisý. Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019. doi: 10.1609/aaai.v33i01.33013959

  14. [14]

    E. T. Jaynes. Information theory and statistical mechanics.Phys. Rev., 106:620–630, May 1957. doi: 10.1103/PhysRev.106.620. URL https://link.aps.org/doi/10.1103/PhysRev. 106.620

  15. [15]

    Edwin T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4): 620–630, 1957

  16. [16]

    Language models (mostly) know what they know, 2022

    Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

  17. [17]

    Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 22895– 22907. P...

  18. [18]

    Thought of search: Planning with language models through the lens of efficiency

    Michael Katz, Harsha Kokel, Kavitha Srinivas, and Shirin Sohrabi. Thought of search: Planning with language models through the lens of efficiency. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=lNCsyA5uS1

  19. [19]

    Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

    Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, and Shalmali Joshi. Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

  20. [20]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/ 2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

  21. [21]

    Active feature acquisition with generative surrogate models

    Yang Li and Junier Oliva. Active feature acquisition with generative surrogate models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6450–6459. PMLR, 2021. URL https://proceedings.mlr.press/v139/li21p.html

  22. [22]

    Large language models to enhance bayesian optimization

    Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/ forum?id=OOxotBmGol

  23. [23]

    EDDI: efficient dynamic discovery of high-value information with partial V AE

    Chao Ma, Sebastian Tschiatschek, Konstantina Palla, José Miguel Hernández-Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: efficient dynamic discovery of high-value information with partial V AE. InProceedings of the 36th International Conference on Machine Learning (ICML 2019), volume 97 ofProceedings of Machine Learning Research, pages 4234–4243. PMLR,

  24. [24]

    URLhttp://proceedings.mlr.press/v97/ma19c.html. 13

  25. [25]

    Melville, M

    P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition for classifier induction. InFourth IEEE International Conference on Data Mining (ICDM’04), pages 483–486. IEEE, 2004. doi: 10.1109/ICDM.2004.10075

  26. [26]

    Mooij and Hilbert J

    Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of the sum– product algorithm.IEEE Transactions on Information Theory, 53(12):4422–4437, 2007. doi: 10.1109/TIT.2007.909166

  27. [27]

    Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

    Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, and Parisa Kordjamshidi. Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

  28. [28]

    GPT-4 technical report, 2023

    OpenAI. GPT-4 technical report, 2023

  29. [29]

    A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

    Arman Rahbar, Linus Aronsson, and Morteza Haghir Chehreghani. A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

  30. [30]

    Hanson-wright inequality and sub-gaussian concentration

    Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra- tion.Electronic Communications in Probability, 18, 2013. doi: 10.1214/ECP.v18-2865

  31. [31]

    Active feature-value acquisition

    Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature-value acquisition. Management Science, 55(4):664–684, 2009. doi: 10.1287/mnsc.1080.0952

  32. [32]

    AFABench: A generic framework for benchmarking active feature acquisition, 2025

    Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, and Morteza Haghir Chehreghani. AFABench: A generic framework for benchmarking active feature acquisition, 2025. URL https://doi.org/10.48550/arXiv.2508.14734

  33. [33]

    Joint active feature acquisition and classification with variable-size set encoding

    Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. InAdvances in Neural Information Processing Systems, vol- ume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf

  34. [34]

    On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024

    Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024. URL https://doi. org/10.48550/arXiv.2402.08115. arXiv admin note: text overlap with arXiv:2310.12397

  35. [35]

    Sekhar Tatikonda and Michael I. Jordan. Loopy belief propagation and gibbs measures. InUAI ’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, pages 493–500. Morgan Kaufmann, 2002

  36. [36]

    Acquisition conditioned oracle for nongreedy active feature acquisition

    Michael Valancius, Maxwell Lennon, and Junier Oliva. Acquisition conditioned oracle for nongreedy active feature acquisition. InProceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48957–48975. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/valancius24a.html

  37. [37]

    Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change

    Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps://openreview.net/ forum?id=YXogl4uQUO

  38. [38]

    On the planning abilities of large language models – a critical investigation

    Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=X6dEqXIsEW

  39. [39]

    Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945

    Abraham Wald. Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945. doi: 10.1214/aoms/1177731118

  40. [40]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-...

  41. [41]

    The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012

    Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012. doi: 10.1016/j.jcss.2011.12.028

  42. [42]

    ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning

    Sara Zannone, José Miguel Hernández-Lobato, Cheng Zhang, and Konstantina Palla. ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning. InReal-world Sequential Decision Making Workshop at ICML, 2019

  43. [43]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th Interna- tional Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021. URL https://proceedings.mlr.press/ v139/zhao21c.html

  44. [44]

    a is true

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

  45. [45]

    , EK} of interest

    Phenotype-directed unary elicitation.A clinical collaborator (physician) specifies the phenotype set {E1, . . . , EK} of interest. For each phenotype E, we query three independent LLMs with unary prompts asking for (i) whether a candidate gene is associated with E, and (ii) the direction of effect underE(up- or down-regulation). System prompt: unary elici...

  46. [46]

    First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction

    Consensus-and-evidence filtering.The unary responses from the three LLMs are filtered in two passes. First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction. Second, aliterature-evidencecheck requires each surviving association to be supported by curated literature sources of sufficient rel...

  47. [47]

    given that the expression of gene l is shifted in direction d, how does this modify the likelihood that gene j is also shifted in the same direction under phenotype E?

    Pairwise enumeration on survivors.After filtering by the LLM’s responses, we enumerated all gene pairs within each surviving set as the universe of pairwise queries. Each prompt was augmented with retrieval-grounded context: top-ranked articles from PubMed and bioRxiv, filtered for relevance to the gene, the phenotype, and the IBD context. The LLM was ins...