Zero-Shot Active Feature Acquisition via LLM-Elicitation
Pith reviewed 2026-06-26 21:41 UTC · model grok-4.3
The pith
Eliciting discriminative unary and pairwise statistics from an LLM and closing them under maximum entropy enables zero-shot active feature acquisition without labeled data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By restricting LLM queries to the sufficient statistics of a Markov random field and resolving the resulting gauge ambiguity via maximum-entropy closure, the framework produces acquisition policies that work in a zero-shot setting and outperform all existing methods on the most ambiguous cases in the IBD cohort.
What carries the argument
Maximum-entropy closure of LLM-elicited unary deviations and pairwise co-variations to construct an MRF for guiding sequential feature acquisition.
If this is right
- The method applies to both binary classification and top-k identification tasks.
- The LLM returns reliable discriminative statistics rather than per-class marginals.
- The closed model outperforms the LLM on both real labels and the LLM's own extracted beliefs.
- On the hardest patients the top-k policy markedly outperforms existing methods.
Where Pith is reading between the lines
- Similar elicitation could apply to other domains with expensive labels such as medical imaging or legal document review.
- The approach may extend to settings where the LLM is queried for higher-order interactions if the closure can be generalized.
- Testing the framework on non-medical datasets would show whether the gains are specific to heterogeneous patient data or more general.
Load-bearing premise
The LLM returns only discriminative statistics and the maximum-entropy closure correctly resolves the gauge ambiguity without introducing bias that invalidates the acquisition policy.
What would settle it
Observing that the acquisition policy derived from the closed MRF performs no better than random selection or supervised baselines on the hardest IBD cases would falsify the central claim.
read the original abstract
Active feature acquisition (AFA) sequentially selects which features to observe to reach a classification or ranking decision. Its central limitation is reliance on large amount of labeled data to fit probabilistic models guiding acquisition. Large language models (LLMs) supply unsupervised domain knowledge, but are poor sequential planners. Asking one to both know and decide conflates capabilities best kept separate. Here, we develop a framework for zero-shot AFA through disciplined elicitation: asking the LLM only for what it can be trusted to return, the unary deviations and pairwise co-variations that are the sufficient statistics of a Markov random field (MRF). We apply our framework to two settings: binary classification and top-$k$ identification. In practice, the LLM reliably returns only discriminative statistics, what distinguishes the classes rather than each class in isolation, which precludes classical AFA. We apply a maximum-entropy closure that resolves this gauge ambiguity. We evaluate on a cohort of Inflammatory Bowel Disease (IBD) patients, an active clinical setting where diagnostic ambiguity and patient heterogeneity obstruct stable treatment strategies. Our framework outperforms the LLM both on real labels and on its own extracted beliefs. Where it matters most, on the hardest patients, our top-$k$ acquisition policy markedly outperforms all existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a zero-shot active feature acquisition framework that elicits unary deviations and pairwise co-variations from LLMs to form sufficient statistics for a Markov random field, uses a maximum-entropy closure to resolve gauge ambiguity caused by the LLM providing only discriminative statistics, and demonstrates its effectiveness on an Inflammatory Bowel Disease patient dataset for binary classification and top-k identification tasks, with notable outperformance on the hardest patients compared to existing methods and the LLM itself.
Significance. This approach has the potential to enable active feature acquisition in domains with limited labeled data by leveraging LLM domain knowledge in a structured manner without requiring the LLM to perform sequential planning. The principled handling of incomplete statistics via max-ent closure is a key technical contribution. The empirical evaluation on a real clinical dataset provides evidence of practical utility, particularly if the gains on hard cases are robust.
major comments (2)
- [Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.
- [Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.
minor comments (2)
- [Abstract] Abstract: the description of the two settings (binary classification and top-k identification) would benefit from naming the primary performance metrics used in each.
- [Throughout] Notation: ensure consistent use of symbols for unary deviations and pairwise co-variations across the elicitation and closure steps.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. We appreciate the recognition of the framework's potential and the technical contribution of the max-ent closure. We address each major comment below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the claim that the framework 'outperforms the LLM both on real labels and on its own extracted beliefs' requires clarification on the degree of alignment between the LLM-derived beliefs and the real labels; without this, the 'on its own extracted beliefs' comparison risks partial circularity even though the real-label results are independent.
Authors: We thank the referee for this observation. The comparison against real labels is fully independent of the LLM beliefs. To eliminate any risk of perceived circularity in the 'on its own extracted beliefs' results, we will revise the evaluation section to include quantitative alignment metrics (e.g., agreement rates and correlation between LLM-derived statistics and real labels) and explicitly discuss how these metrics inform the interpretation of the LLM-belief comparison. revision: yes
-
Referee: [Method] Method section on max-ent closure: the assertion that the closure resolves gauge ambiguity without introducing bias that affects the acquisition policy is load-bearing for the zero-shot claim; an explicit derivation or small-scale verification that the resulting policy remains unbiased relative to a fully-specified MRF would strengthen the central methodological contribution.
Authors: We agree that an explicit verification would strengthen the methodological claim. The max-ent closure yields the unique maximum-entropy distribution consistent with the provided statistics and therefore introduces no additional bias beyond those constraints. In the revision we will add a small-scale synthetic verification (new appendix) demonstrating that, when the elicited statistics are consistent with a fully specified ground-truth MRF, the acquisition policy obtained after closure is identical to the policy obtained from the fully specified model. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The central derivation elicits unary/pairwise statistics from the LLM, applies a maximum-entropy closure for gauge resolution, and evaluates the resulting acquisition policy on an external IBD patient cohort with real labels. This provides an independent benchmark. The secondary comparison to the LLM's own extracted beliefs does not reduce the policy derivation or main performance claims to a self-referential fit. No equations, self-citations, or fitted-input-as-prediction steps are visible that would collapse the claimed results by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM returns only discriminative (class-separating) statistics rather than full per-class marginals
- domain assumption Maximum-entropy distribution is the appropriate closure for the elicited statistics
Reference graph
Works this paper leans on
-
[1]
The Claude 3 model family: Opus, Sonnet, Haiku
Anthropic. The Claude 3 model family: Opus, Sonnet, Haiku. An- thropic Technical Report, 2024. URL https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
2024
-
[2]
Biopsy expression profiling of an adult inflammatory bowel disease cohort
Carmen Argmann, Mayte Suárez-Fariñas, Ruiqi Hou, and Haritz Irizar. Biopsy expression profiling of an adult inflammatory bowel disease cohort. https://www.ncbi.nlm.nih.gov/ geo/query/acc.cgi?acc=GSE193677, 2022. GEO Accession GSE193677
2022
-
[3]
Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021
Viktor Bengs, Róbert Busa-Fekete, Adil El Mesaoudi-Paul, and Eyke Hüllermeier. Preference- based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021. URLhttp://jmlr.org/papers/v22/18-546.html
2021
-
[4]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
2020
-
[5]
Sparks of artificial general intelligence: Early experiments with GPT-4, 2023
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023
2023
-
[6]
Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959
Herman Chernoff. Sequential design of experiments.The Annals of Mathematical Statistics, 30 (3):755–770, 1959. doi: 10.1214/aoms/1177706205
-
[7]
Abdoulatif Cissé, Xenophon Evangelopoulos, Vladimir V . Gusev, and Andrew I. Cooper. Language-based bayesian optimization research assistant (BORA). InProceedings of the Thirty- Fourth International Joint Conference on Artificial Intelligence, pages 4967–4975. International Joint Conferences on Artificial Intelligence Organization, 2025. doi: 10.24963/ijc...
-
[8]
White, and Su-In Lee
Ian Connick Covert, Wei Qiu, Mingyu Lu, Nayoon Kim, Nathan J. White, and Su-In Lee. Learning to maximize mutual information for dynamic feature selection. InInternational Conference on Machine Learning (ICML 2023), volume 202 ofProceedings of Machine Learning Research, pages 6424–6447. PMLR, 2023. URL https://proceedings.mlr. press/v202/covert23a.html
2023
-
[9]
Large language bayes, 2025
Justin Domke. Large language bayes, 2025
2025
-
[10]
Estimating conditional mutual information for dynamic feature selection
Soham Gadgil, Ian Connick Covert, and Su-In Lee. Estimating conditional mutual information for dynamic feature selection. InThe Twelfth International Conference on Learning Represen- tations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/forum?id= Oju2Qu9jvn. 12
2024
-
[11]
Aritra Ghosh and Andrew Lan. Difa: Differentiable feature acquisition.Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):7705–7713, 2023. doi: 10.1609/aaai.v37i6.25934
-
[12]
Rushil Gupta, Jason Hartford, and Bang Liu. LLMs for bayesian optimization in scientific domains: Are we there yet? InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 15482–15510. Association for Computational Linguistics, 2025. doi: 10.18653/v1/2025.findings-emnlp.838
-
[13]
Jaromír Janisch, Tomáš Pevný, and Viliam Lisý. Classification with costly features using deep reinforcement learning.Proceedings of the AAAI Conference on Artificial Intelligence, 33(01): 3959–3966, 2019. doi: 10.1609/aaai.v33i01.33013959
-
[14]
E. T. Jaynes. Information theory and statistical mechanics.Phys. Rev., 106:620–630, May 1957. doi: 10.1103/PhysRev.106.620. URL https://link.aps.org/doi/10.1103/PhysRev. 106.620
-
[15]
Edwin T. Jaynes. Information theory and statistical mechanics.Physical Review, 106(4): 620–630, 1957
1957
-
[16]
Language models (mostly) know what they know, 2022
Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec,...
Pith/arXiv arXiv 2022
-
[17]
Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks
Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Mudit Verma, Kaya Stechly, Siddhant Bhambri, Lucas Paul Saldyt, and Anil B Murthy. Position: LLMs can’t plan, but can help planning in LLM-modulo frameworks. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 22895– 22907. P...
2024
-
[18]
Thought of search: Planning with language models through the lens of efficiency
Michael Katz, Harsha Kokel, Kavitha Srinivas, and Shirin Sohrabi. Thought of search: Planning with language models through the lens of efficiency. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum? id=lNCsyA5uS1
2024
-
[19]
Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025
Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, and Shalmali Joshi. Learning-to- measure: In-context active feature acquisition.arXiv preprint arXiv:2510.12624, 2025
Pith/arXiv arXiv 2025
-
[20]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/ 2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html
2022
-
[21]
Active feature acquisition with generative surrogate models
Yang Li and Junier Oliva. Active feature acquisition with generative surrogate models. InProceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6450–6459. PMLR, 2021. URL https://proceedings.mlr.press/v139/li21p.html
2021
-
[22]
Large language models to enhance bayesian optimization
Tennison Liu, Nicolás Astorga, Nabeel Seedat, and Mihaela van der Schaar. Large language models to enhance bayesian optimization. InThe Twelfth International Conference on Learning Representations (ICLR 2024). OpenReview.net, 2024. URL https://openreview.net/ forum?id=OOxotBmGol
2024
-
[23]
EDDI: efficient dynamic discovery of high-value information with partial V AE
Chao Ma, Sebastian Tschiatschek, Konstantina Palla, José Miguel Hernández-Lobato, Sebastian Nowozin, and Cheng Zhang. EDDI: efficient dynamic discovery of high-value information with partial V AE. InProceedings of the 36th International Conference on Machine Learning (ICML 2019), volume 97 ofProceedings of Machine Learning Research, pages 4234–4243. PMLR,
2019
-
[24]
URLhttp://proceedings.mlr.press/v97/ma19c.html. 13
-
[25]
P. Melville, M. Saar-Tsechansky, F. Provost, and R. Mooney. Active feature-value acquisition for classifier induction. InFourth IEEE International Conference on Data Mining (ICDM’04), pages 483–486. IEEE, 2004. doi: 10.1109/ICDM.2004.10075
-
[26]
Joris M. Mooij and Hilbert J. Kappen. Sufficient conditions for convergence of the sum– product algorithm.IEEE Transactions on Information Theory, 53(12):4422–4437, 2007. doi: 10.1109/TIT.2007.909166
-
[27]
Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025
Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, and Parisa Kordjamshidi. Extracting proba- bilistic knowledge from large language models for Bayesian network parameterization, 2025
2025
-
[28]
GPT-4 technical report, 2023
OpenAI. GPT-4 technical report, 2023
2023
-
[29]
A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025
Arman Rahbar, Linus Aronsson, and Morteza Haghir Chehreghani. A survey on active feature acquisition strategies.arXiv preprint arXiv:2502.11067, 2025
arXiv 2025
-
[30]
Hanson-wright inequality and sub-gaussian concentration
Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentra- tion.Electronic Communications in Probability, 18, 2013. doi: 10.1214/ECP.v18-2865
-
[31]
Active feature-value acquisition
Maytal Saar-Tsechansky, Prem Melville, and Foster Provost. Active feature-value acquisition. Management Science, 55(4):664–684, 2009. doi: 10.1287/mnsc.1080.0952
-
[32]
AFABench: A generic framework for benchmarking active feature acquisition, 2025
Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, and Morteza Haghir Chehreghani. AFABench: A generic framework for benchmarking active feature acquisition, 2025. URL https://doi.org/10.48550/arXiv.2508.14734
-
[33]
Joint active feature acquisition and classification with variable-size set encoding
Hajin Shim, Sung Ju Hwang, and Eunho Yang. Joint active feature acquisition and classification with variable-size set encoding. InAdvances in Neural Information Processing Systems, vol- ume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_ files/paper/2018/file/e5841df2166dd424a57127423d276bbe-Paper.pdf
2018
-
[34]
Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. On the self-verification limi- tations of large language models on reasoning and planning tasks, 2024. URL https://doi. org/10.48550/arXiv.2402.08115. arXiv admin note: text overlap with arXiv:2310.12397
-
[35]
Sekhar Tatikonda and Michael I. Jordan. Loopy belief propagation and gibbs measures. InUAI ’02, Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, pages 493–500. Morgan Kaufmann, 2002
2002
-
[36]
Acquisition conditioned oracle for nongreedy active feature acquisition
Michael Valancius, Maxwell Lennon, and Junier Oliva. Acquisition conditioned oracle for nongreedy active feature acquisition. InProceedings of the 41st International Conference on Ma- chine Learning, volume 235 ofProceedings of Machine Learning Research, pages 48957–48975. PMLR, 2024. URLhttps://proceedings.mlr.press/v235/valancius24a.html
2024
-
[37]
Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change
Karthik Valmeekam, Matthew Marquez, Alberto Olmo, Sarath Sreedharan, and Subbarao Kambhampati. Planbench: An extensible benchmark for evaluating large language models on planning and reasoning about change. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps://openreview.net/ forum?id=YXogl4uQUO
2023
-
[38]
On the planning abilities of large language models – a critical investigation
Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models – a critical investigation. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=X6dEqXIsEW
2023
-
[39]
Abraham Wald. Sequential tests of statistical hypotheses.The Annals of Mathematical Statistics, 16(2):117–186, 1945. doi: 10.1214/aoms/1177731118
-
[40]
Chi, Quoc V
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reason- ing in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/ 9d5609613524ecf4f15af0f7b31abca4-...
2022
-
[41]
The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012
Yisong Yue, Josef Broder, Robert Kleinberg, and Thorsten Joachims. The k-armed dueling bandits problem.Journal of Computer and System Sciences, 78(5):1538–1556, 2012. doi: 10.1016/j.jcss.2011.12.028
-
[42]
ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning
Sara Zannone, José Miguel Hernández-Lobato, Cheng Zhang, and Konstantina Palla. ODIN: Optimal discovery of high-value INformation using model-based deep reinforcement learning. InReal-world Sequential Decision Making Workshop at ICML, 2019
2019
-
[43]
Calibrate before use: Improving few-shot performance of language models
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. InProceedings of the 38th Interna- tional Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 12697–12706. PMLR, 2021. URL https://proceedings.mlr.press/ v139/zhao21c.html
2021
-
[44]
a is true
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InThirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL https://ope...
2023
-
[45]
, EK} of interest
Phenotype-directed unary elicitation.A clinical collaborator (physician) specifies the phenotype set {E1, . . . , EK} of interest. For each phenotype E, we query three independent LLMs with unary prompts asking for (i) whether a candidate gene is associated with E, and (ii) the direction of effect underE(up- or down-regulation). System prompt: unary elici...
-
[46]
First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction
Consensus-and-evidence filtering.The unary responses from the three LLMs are filtered in two passes. First, a cross-LLMconsensuscheck retains only gene–phenotype associations agreed on by the LLMs with consistent direction. Second, aliterature-evidencecheck requires each surviving association to be supported by curated literature sources of sufficient rel...
-
[47]
given that the expression of gene l is shifted in direction d, how does this modify the likelihood that gene j is also shifted in the same direction under phenotype E?
Pairwise enumeration on survivors.After filtering by the LLM’s responses, we enumerated all gene pairs within each surviving set as the universe of pairwise queries. Each prompt was augmented with retrieval-grounded context: top-ranked articles from PubMed and bioRxiv, filtered for relevance to the gene, the phenotype, and the IBD context. The LLM was ins...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.