Zero-Shot Active Feature Acquisition via LLM-Elicitation

Binyamin Perets; Natalie Mendelson; Shai Shen-Orr; Shie Mannor; Shiran Vainberg; Yehuda Chowers

arxiv: 2606.18933 · v2 · pith:DIEWGJY5new · submitted 2026-06-17 · 💻 cs.LG · cs.IR· stat.ME

Zero-Shot Active Feature Acquisition via LLM-Elicitation

Binyamin Perets , Natalie Mendelson , Shiran Vainberg , Yehuda Chowers , Shai Shen-Orr , Shie Mannor This is my paper

Pith reviewed 2026-06-26 21:41 UTC · model grok-4.3

classification 💻 cs.LG cs.IRstat.ME

keywords active feature acquisitionzero-shotlarge language modelsMarkov random fieldmaximum entropyinflammatory bowel diseasetop-k identificationdiscriminative statistics

0 comments

The pith

Eliciting discriminative unary and pairwise statistics from an LLM and closing them under maximum entropy enables zero-shot active feature acquisition without labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework for active feature acquisition that requires no labeled training data by using an LLM to provide only the statistics that distinguish classes. It asks the LLM for unary deviations and pairwise co-variations rather than full class-conditional distributions, then applies a maximum-entropy closure to build a usable probabilistic model for deciding which features to acquire next. This separation lets the LLM contribute its domain knowledge while the acquisition policy is handled separately. Evaluated on IBD patient data for both binary classification and top-k identification, the resulting policies outperform the LLM itself and prior methods, with the largest gains on the hardest patients.

Core claim

By restricting LLM queries to the sufficient statistics of a Markov random field and resolving the resulting gauge ambiguity via maximum-entropy closure, the framework produces acquisition policies that work in a zero-shot setting and outperform all existing methods on the most ambiguous cases in the IBD cohort.

What carries the argument

Maximum-entropy closure of LLM-elicited unary deviations and pairwise co-variations to construct an MRF for guiding sequential feature acquisition.

If this is right

The method applies to both binary classification and top-k identification tasks.
The LLM returns reliable discriminative statistics rather than per-class marginals.
The closed model outperforms the LLM on both real labels and the LLM's own extracted beliefs.
On the hardest patients the top-k policy markedly outperforms existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar elicitation could apply to other domains with expensive labels such as medical imaging or legal document review.
The approach may extend to settings where the LLM is queried for higher-order interactions if the closure can be generalized.
Testing the framework on non-medical datasets would show whether the gains are specific to heterogeneous patient data or more general.

Load-bearing premise

The LLM returns only discriminative statistics and the maximum-entropy closure correctly resolves the gauge ambiguity without introducing bias that invalidates the acquisition policy.

What would settle it

Observing that the acquisition policy derived from the closed MRF performs no better than random selection or supervised baselines on the hardest IBD cases would falsify the central claim.

read the original abstract

Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move—eliciting only discriminative MRF stats from an LLM then closing with max-ent for zero-shot AFA—looks workable and shows gains on hard IBD cases, though the evaluation setup needs closer scrutiny.

read the letter

The new piece is the disciplined split: ask the LLM only for unary deviations and pairwise co-variations that distinguish classes, then use maximum-entropy closure to fix the gauge ambiguity that comes from getting discriminative rather than full marginals. That lets them run active feature acquisition without labeled data for the model. On the IBD cohort the top-k policy beats existing methods on the hardest patients and also beats the LLM itself when tested against real labels.

It does a clean job separating what the LLM is good at (supplying those statistics) from what it is not (sequential planning). The clinical setting is a reasonable test bed where labels are scarce and heterogeneity is high.

The soft spots are in the evaluation and the closure step. Testing against the LLM's own extracted beliefs introduces circularity that is not fully quantified, even if the real-label results are the main claim. Without seeing the exact derivation of the max-ent closure or error bars on the acquisition curves, it is hard to tell how sensitive the gains are to the choice of closure or to noise in the LLM responses. The assumption that the LLM consistently returns only the discriminative statistics also needs explicit checks.

This is for people working on active learning or decision support in medical domains with limited labels. It is not a broad theoretical advance but a practical engineering step that could be useful.

I would send it to peer review. The idea is coherent and the reported gains on the hardest cases are worth checking with full details and more controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces a zero-shot active feature acquisition framework that elicits unary deviations and pairwise co-variations from LLMs to form sufficient statistics for a Markov random field, uses a maximum-entropy closure to resolve gauge ambiguity caused by the LLM providing only discriminative statistics, and demonstrates its effectiveness on an Inflammatory Bowel Disease patient dataset for binary classification and top-k identification tasks, with notable outperformance on the hardest patients compared to existing methods and the LLM itself.

Significance. This approach has the potential to enable active feature acquisition in domains with limited labeled data by leveraging LLM domain knowledge in a structured manner without requiring the LLM to perform sequential planning. The principled handling of incomplete statistics via max-ent closure is a key technical contribution. The empirical evaluation on a real clinical dataset provides evidence of practical utility, particularly if the gains on hard cases are robust.

major comments (2)

[Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.
[Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.

minor comments (2)

[Abstract] Abstract: the description of the two settings (binary classification and top-k identification) would benefit from naming the primary performance metrics used in each.
[Throughout] Notation: ensure consistent use of symbols for unary deviations and pairwise co-variations across the elicitation and closure steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. We appreciate the recognition of the framework's potential and the technical contribution of the max-ent closure. We address each major comment below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.

Authors: We thank the referee for this observation. The comparison against real labels is fully independent of the LLM beliefs. To eliminate any risk of perceived circularity in the 'on its own extracted beliefs' results, we will revise the evaluation section to include quantitative alignment metrics (e.g., agreement rates and correlation between LLM-derived statistics and real labels) and explicitly discuss how these metrics inform the interpretation of the LLM-belief comparison. revision: yes
Referee: [Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.

Authors: We agree that an explicit verification would strengthen the methodological claim. The max-ent closure yields the unique maximum-entropy distribution consistent with the provided statistics and therefore introduces no additional bias beyond those constraints. In the revision we will add a small-scale synthetic verification (new appendix) demonstrating that, when the elicited statistics are consistent with a fully specified ground-truth MRF, the acquisition policy obtained after closure is identical to the policy obtained from the fully specified model. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The central derivation elicits unary/pairwise statistics from the LLM, applies a maximum-entropy closure for gauge resolution, and evaluates the resulting acquisition policy on an external IBD patient cohort with real labels. This provides an independent benchmark. The secondary comparison to the LLM's own extracted beliefs does not reduce the policy derivation or main performance claims to a self-referential fit. No equations, self-citations, or fitted-input-as-prediction steps are visible that would collapse the claimed results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLM-elicited unary and pairwise statistics are sufficient to parameterize an MRF whose max-ent closure yields a useful acquisition policy; no free parameters are explicitly fitted in the abstract, but the closure itself acts as an implicit modeling choice.

axioms (2)

domain assumption LLM returns only discriminative (class-separating) statistics rather than full per-class marginals
Stated in the abstract as the reason classical AFA is precluded and max-ent closure is required.
domain assumption Maximum-entropy distribution is the appropriate closure for the elicited statistics
Invoked to resolve gauge ambiguity without further justification visible in the abstract.

pith-pipeline@v0.9.1-grok · 5777 in / 1541 out tokens · 19154 ms · 2026-06-26T21:41:21.668710+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 14 canonical work pages

[1]

The Claude 3 model family: Opus, Sonnet, Haiku

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. An- thropic Technical Report, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

2024
[2]

Biopsy expression profiling of an adult inflammatory bowel disease cohort

Carmen Argmann, Mayte Suárez-Fariñas, Ruiqi Hou, and Haritz Irizar. Biopsy expression profiling of an adult inflammatory bowel disease cohort. https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE193677, 2022. GEO Accession GSE193677

2022
[3]

Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021. URLhttp://jmlr.org/papers/v22/18-546.html

2021
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[5]

Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

2023
[6]

Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959

Herman Chernoff. Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959. doi: 10.1214/aoms/1177706205

work page doi:10.1214/aoms/1177706205 1959
[7]

Gusev, and Andrew I

Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V . Gusev, and Andrew I. Cooper. Language-based bayesian optimization research assistant (BORA). InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelligence, pages 4967–4975. International Joint Conferences on Artificial Intelligence Organization, 2025. doi: 10.24963/ijc...

work page doi:10.24963/ijcai.2025/553 2025
[8]

White, and Su-In Lee

Ian Connick Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan J. White, and Su-In Lee. Learning to maximize mutual information for dynamic feature selection. InInternational Conference on Machine Learning (ICML 2023), volume 202 ofProceedings of Machine Learning Research, pages 6424–6447. PMLR, 2023. URL https://proceedings.mlr. press/v202/covert23a.html

2023
[9]

Large language bayes, 2025

Justin Domke. Large language bayes, 2025

2025
[10]

Estimating conditional mutual information for dynamic feature selection

Soham Gadgil, Ian Connick Covert, and Su-In Lee. Estimating conditional mutual information for dynamic feature selection. InThe Twelfth International Conference on Learning Represen- tations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/forum?id= Oju2Qu9jvn. 12

2024
[11]

Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023

Aritra Ghosh and Andrew Lan. Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023. doi: 10.1609/aaai.v37i6.25934

work page doi:10.1609/aaai.v37i6.25934 2023
[12]

LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510

Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.838

work page doi:10.18653/v1/2025.findings-emnlp.838 2025
[13]

Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019

Jaromír Janisch, Tomáš Pevný, and Viliam Lisý. Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019. doi: 10.1609/aaai.v33i01.33013959

work page doi:10.1609/aaai.v33i01.33013959 2019
[14]

E. T. Jaynes. Information theory and statistical mechanics.Phys. Rev., 106:620–630, May 1957. doi: 10.1103/PhysRev.106.620. URL https://link.aps.org/doi/10.1103/PhysRev. 106.620

work page doi:10.1103/physrev.106.620 1957
[15]

Edwin T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4): 620–630, 1957

1957
[16]

Language models (mostly) know what they know, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv 2022
[17]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 22895– 22907. P...

2024
[18]

Thought of search: Planning with language models through the lens of efficiency

Michael Katz, Harsha Kokel, Kavitha Srinivas, and Shirin Sohrabi. Thought of search: Planning with language models through the lens of efficiency. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=lNCsyA5uS1

2024
[19]

Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, and Shalmali Joshi. Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

Pith/arXiv arXiv 2025
[20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/ 2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

2022
[21]

Active feature acquisition with generative surrogate models

Yang Li and Junier Oliva. Active feature acquisition with generative surrogate models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6450–6459. PMLR, 2021. URL https://proceedings.mlr.press/v139/li21p.html

2021
[22]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/ forum?id=OOxotBmGol

2024
[23]

EDDI: efficient dynamic discovery of high-value information with partial V AE

Chao Ma, Sebastian Tschiatschek, Konstantina Palla, José Miguel Hernández-Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: efficient dynamic discovery of high-value information with partial V AE. InProceedings of the 36th International Conference on Machine Learning (ICML 2019), volume 97 ofProceedings of Machine Learning Research, pages 4234–4243. PMLR,

2019
[24]

URLhttp://proceedings.mlr.press/v97/ma19c.html. 13
[25]

Melville, M

P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition for classifier induction. InFourth IEEE International Conference on Data Mining (ICDM’04), pages 483–486. IEEE, 2004. doi: 10.1109/ICDM.2004.10075

work page doi:10.1109/icdm.2004.10075 2004
[26]

Mooij and Hilbert J

Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of the sum– product algorithm.IEEE Transactions on Information Theory, 53(12):4422–4437, 2007. doi: 10.1109/TIT.2007.909166

work page doi:10.1109/tit.2007.909166 2007
[27]

Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, and Parisa Kordjamshidi. Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

2025
[28]

GPT-4 technical report, 2023

OpenAI. GPT-4 technical report, 2023

2023
[29]

A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

Arman Rahbar, Linus Aronsson, and Morteza Haghir Chehreghani. A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

arXiv 2025
[30]

Hanson-wright inequality and sub-gaussian concentration

Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra- tion.Electronic Communications in Probability, 18, 2013. doi: 10.1214/ECP.v18-2865

work page doi:10.1214/ecp.v18-2865 2013
[31]

Active feature-value acquisition

Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature-value acquisition. Management Science, 55(4):664–684, 2009. doi: 10.1287/mnsc.1080.0952

work page doi:10.1287/mnsc.1080.0952 2009
[32]

AFABench: A generic framework for benchmarking active feature acquisition, 2025

Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, and Morteza Haghir Chehreghani. AFABench: A generic framework for benchmarking active feature acquisition, 2025. URL https://doi.org/10.48550/arXiv.2508.14734

work page doi:10.48550/arxiv.2508.14734 2025
[33]

Joint active feature acquisition and classification with variable-size set encoding

Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. InAdvances in Neural Information Processing Systems, vol- ume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf

2018
[34]

On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024. URL https://doi. org/10.48550/arXiv.2402.08115. arXiv admin note: text overlap with arXiv:2310.12397

work page doi:10.48550/arxiv.2402.08115 2024
[35]

Sekhar Tatikonda and Michael I. Jordan. Loopy belief propagation and gibbs measures. InUAI ’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, pages 493–500. Morgan Kaufmann, 2002

2002
[36]

Acquisition conditioned oracle for nongreedy active feature acquisition

Michael Valancius, Maxwell Lennon, and Junier Oliva. Acquisition conditioned oracle for nongreedy active feature acquisition. InProceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48957–48975. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/valancius24a.html

2024
[37]

Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps://openreview.net/ forum?id=YXogl4uQUO

2023
[38]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=X6dEqXIsEW

2023
[39]

Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945

Abraham Wald. Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945. doi: 10.1214/aoms/1177731118

work page doi:10.1214/aoms/1177731118 1945
[40]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-...

2022
[41]

The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012. doi: 10.1016/j.jcss.2011.12.028

work page doi:10.1016/j.jcss.2011.12.028 2012
[42]

ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning

Sara Zannone, José Miguel Hernández-Lobato, Cheng Zhang, and Konstantina Palla. ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning. InReal-world Sequential Decision Making Workshop at ICML, 2019

2019
[43]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th Interna- tional Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021. URL https://proceedings.mlr.press/ v139/zhao21c.html

2021
[44]

a is true

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

2023
[45]

, EK} of interest

Phenotype-directed unary elicitation.A clinical collaborator (physician) specifies the phenotype set {E1, . . . , EK} of interest. For each phenotype E, we query three independent LLMs with unary prompts asking for (i) whether a candidate gene is associated with E, and (ii) the direction of effect underE(up- or down-regulation). System prompt: unary elici...
[46]

First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction

Consensus-and-evidence filtering.The unary responses from the three LLMs are filtered in two passes. First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction. Second, aliterature-evidencecheck requires each surviving association to be supported by curated literature sources of sufficient rel...
[47]

given that the expression of gene l is shifted in direction d, how does this modify the likelihood that gene j is also shifted in the same direction under phenotype E?

Pairwise enumeration on survivors.After filtering by the LLM’s responses, we enumerated all gene pairs within each surviving set as the universe of pairwise queries. Each prompt was augmented with retrieval-grounded context: top-ranked articles from PubMed and bioRxiv, filtered for relevance to the gene, the phenotype, and the IBD context. The LLM was ins...

[1] [1]

The Claude 3 model family: Opus, Sonnet, Haiku

Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. An- thropic Technical Report, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

2024

[2] [2]

Biopsy expression profiling of an adult inflammatory bowel disease cohort

Carmen Argmann, Mayte Suárez-Fariñas, Ruiqi Hou, and Haritz Irizar. Biopsy expression profiling of an adult inflammatory bowel disease cohort. https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE193677, 2022. GEO Accession GSE193677

2022

[3] [3]

Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021

Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021. URLhttp://jmlr.org/papers/v22/18-546.html

2021

[4] [4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020

[5] [5]

Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023

2023

[6] [6]

Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959

Herman Chernoff. Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959. doi: 10.1214/aoms/1177706205

work page doi:10.1214/aoms/1177706205 1959

[7] [7]

Gusev, and Andrew I

Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V . Gusev, and Andrew I. Cooper. Language-based bayesian optimization research assistant (BORA). InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelligence, pages 4967–4975. International Joint Conferences on Artificial Intelligence Organization, 2025. doi: 10.24963/ijc...

work page doi:10.24963/ijcai.2025/553 2025

[8] [8]

White, and Su-In Lee

Ian Connick Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan J. White, and Su-In Lee. Learning to maximize mutual information for dynamic feature selection. InInternational Conference on Machine Learning (ICML 2023), volume 202 ofProceedings of Machine Learning Research, pages 6424–6447. PMLR, 2023. URL https://proceedings.mlr. press/v202/covert23a.html

2023

[9] [9]

Large language bayes, 2025

Justin Domke. Large language bayes, 2025

2025

[10] [10]

Estimating conditional mutual information for dynamic feature selection

Soham Gadgil, Ian Connick Covert, and Su-In Lee. Estimating conditional mutual information for dynamic feature selection. InThe Twelfth International Conference on Learning Represen- tations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/forum?id= Oju2Qu9jvn. 12

2024

[11] [11]

Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023

Aritra Ghosh and Andrew Lan. Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023. doi: 10.1609/aaai.v37i6.25934

work page doi:10.1609/aaai.v37i6.25934 2023

[12] [12]

LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510

Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.838

work page doi:10.18653/v1/2025.findings-emnlp.838 2025

[13] [13]

Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019

Jaromír Janisch, Tomáš Pevný, and Viliam Lisý. Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019. doi: 10.1609/aaai.v33i01.33013959

work page doi:10.1609/aaai.v33i01.33013959 2019

[14] [14]

E. T. Jaynes. Information theory and statistical mechanics.Phys. Rev., 106:620–630, May 1957. doi: 10.1103/PhysRev.106.620. URL https://link.aps.org/doi/10.1103/PhysRev. 106.620

work page doi:10.1103/physrev.106.620 1957

[15] [15]

Edwin T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4): 620–630, 1957

1957

[16] [16]

Language models (mostly) know what they know, 2022

Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...

Pith/arXiv arXiv 2022

[17] [17]

Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks

Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 22895– 22907. P...

2024

[18] [18]

Thought of search: Planning with language models through the lens of efficiency

Michael Katz, Harsha Kokel, Kavitha Srinivas, and Shirin Sohrabi. Thought of search: Planning with language models through the lens of efficiency. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=lNCsyA5uS1

2024

[19] [19]

Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, and Shalmali Joshi. Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025

Pith/arXiv arXiv 2025

[20] [20]

Large language models are zero-shot reasoners

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/ 2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

2022

[21] [21]

Active feature acquisition with generative surrogate models

Yang Li and Junier Oliva. Active feature acquisition with generative surrogate models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6450–6459. PMLR, 2021. URL https://proceedings.mlr.press/v139/li21p.html

2021

[22] [22]

Large language models to enhance bayesian optimization

Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/ forum?id=OOxotBmGol

2024

[23] [23]

EDDI: efficient dynamic discovery of high-value information with partial V AE

Chao Ma, Sebastian Tschiatschek, Konstantina Palla, José Miguel Hernández-Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: efficient dynamic discovery of high-value information with partial V AE. InProceedings of the 36th International Conference on Machine Learning (ICML 2019), volume 97 ofProceedings of Machine Learning Research, pages 4234–4243. PMLR,

2019

[24] [24]

URLhttp://proceedings.mlr.press/v97/ma19c.html. 13

[25] [25]

Melville, M

P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition for classifier induction. InFourth IEEE International Conference on Data Mining (ICDM’04), pages 483–486. IEEE, 2004. doi: 10.1109/ICDM.2004.10075

work page doi:10.1109/icdm.2004.10075 2004

[26] [26]

Mooij and Hilbert J

Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of the sum– product algorithm.IEEE Transactions on Information Theory, 53(12):4422–4437, 2007. doi: 10.1109/TIT.2007.909166

work page doi:10.1109/tit.2007.909166 2007

[27] [27]

Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, and Parisa Kordjamshidi. Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025

2025

[28] [28]

GPT-4 technical report, 2023

OpenAI. GPT-4 technical report, 2023

2023

[29] [29]

A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

Arman Rahbar, Linus Aronsson, and Morteza Haghir Chehreghani. A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025

arXiv 2025

[30] [30]

Hanson-wright inequality and sub-gaussian concentration

Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra- tion.Electronic Communications in Probability, 18, 2013. doi: 10.1214/ECP.v18-2865

work page doi:10.1214/ecp.v18-2865 2013

[31] [31]

Active feature-value acquisition

Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature-value acquisition. Management Science, 55(4):664–684, 2009. doi: 10.1287/mnsc.1080.0952

work page doi:10.1287/mnsc.1080.0952 2009

[32] [32]

AFABench: A generic framework for benchmarking active feature acquisition, 2025

Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, and Morteza Haghir Chehreghani. AFABench: A generic framework for benchmarking active feature acquisition, 2025. URL https://doi.org/10.48550/arXiv.2508.14734

work page doi:10.48550/arxiv.2508.14734 2025

[33] [33]

Joint active feature acquisition and classification with variable-size set encoding

Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. InAdvances in Neural Information Processing Systems, vol- ume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf

2018

[34] [34]

On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024

Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024. URL https://doi. org/10.48550/arXiv.2402.08115. arXiv admin note: text overlap with arXiv:2310.12397

work page doi:10.48550/arxiv.2402.08115 2024

[35] [35]

Sekhar Tatikonda and Michael I. Jordan. Loopy belief propagation and gibbs measures. InUAI ’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, pages 493–500. Morgan Kaufmann, 2002

2002

[36] [36]

Acquisition conditioned oracle for nongreedy active feature acquisition

Michael Valancius, Maxwell Lennon, and Junier Oliva. Acquisition conditioned oracle for nongreedy active feature acquisition. InProceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48957–48975. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/valancius24a.html

2024

[37] [37]

Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change

Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps://openreview.net/ forum?id=YXogl4uQUO

2023

[38] [38]

On the planning abilities of large language models – a critical investigation

Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=X6dEqXIsEW

2023

[39] [39]

Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945

Abraham Wald. Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945. doi: 10.1214/aoms/1177731118

work page doi:10.1214/aoms/1177731118 1945

[40] [40]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-...

2022

[41] [41]

The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012

Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012. doi: 10.1016/j.jcss.2011.12.028

work page doi:10.1016/j.jcss.2011.12.028 2012

[42] [42]

ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning

Sara Zannone, José Miguel Hernández-Lobato, Cheng Zhang, and Konstantina Palla. ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning. InReal-world Sequential Decision Making Workshop at ICML, 2019

2019

[43] [43]

Calibrate before use: Improving few-shot performance of language models

Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th Interna- tional Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021. URL https://proceedings.mlr.press/ v139/zhao21c.html

2021

[44] [44]

a is true

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...

2023

[45] [45]

, EK} of interest

Phenotype-directed unary elicitation.A clinical collaborator (physician) specifies the phenotype set {E1, . . . , EK} of interest. For each phenotype E, we query three independent LLMs with unary prompts asking for (i) whether a candidate gene is associated with E, and (ii) the direction of effect underE(up- or down-regulation). System prompt: unary elici...

[46] [46]

First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction

Consensus-and-evidence filtering.The unary responses from the three LLMs are filtered in two passes. First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction. Second, aliterature-evidencecheck requires each surviving association to be supported by curated literature sources of sufficient rel...

[47] [47]

given that the expression of gene l is shifted in direction d, how does this modify the likelihood that gene j is also shifted in the same direction under phenotype E?

Pairwise enumeration on survivors.After filtering by the LLM’s responses, we enumerated all gene pairs within each surviving set as the universe of pairwise queries. Each prompt was augmented with retrieval-grounded context: top-ranked articles from PubMed and bioRxiv, filtered for relevance to the gene, the phenotype, and the IBD context. The LLM was ins...