What Language Models Know But Don't Say: Non-Generative Prior Extraction for Generalization

Mohammad M. Ghassemi; Sara Rezaeimanesh

arxiv: 2601.17609 · v2 · submitted 2026-01-24 · 💻 cs.CL

What Language Models Know But Don't Say: Non-Generative Prior Extraction for Generalization

Sara Rezaeimanesh , Mohammad M. Ghassemi This is my paper

Pith reviewed 2026-05-16 10:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords LoIDlogit-informed distributionsprior extractionBayesian logistic regressionout-of-distribution generalizationlanguage model knowledgetabular data

0 comments

The pith

LoID extracts informative priors from LLMs by probing token logits on opposing sentences to boost Bayesian logistic regression on OOD data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LoID, a deterministic method for extracting prior distributions from large language models for use in Bayesian logistic regression. It accesses the model's token-level predictions on carefully constructed positive and negative sentences about each feature to determine the strength and reliability of its beliefs. This avoids reliance on generated text and provides informative priors when labeled data is scarce or biased. The method is tested on ten real-world tabular datasets under out-of-distribution conditions and shows substantial gains over baselines.

Core claim

LoID extracts informative prior distributions by measuring how consistently an LLM favors one semantic direction over another in token predictions across varied phrasings, enabling Bayesian logistic regression to recover up to 59% of the performance gap to an oracle model trained on full data while outperforming AutoElicit and LLMProcesses on 8 out of 10 datasets.

What carries the argument

LoID (Logit-Informed Distributions), which derives prior parameters from the consistency of LLM logit preferences for positive versus negative feature impact statements.

If this is right

Provides a reproducible way to integrate LLM knowledge into Bayesian inference without text generation.
Outperforms methods that rely on LLM-generated text or in-context numerical predictions on most datasets.
Recovers significant performance under covariate shift in small training sets.
Applicable to domains like medicine and finance where data is costly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LoID's non-generative approach might generalize to extracting beliefs for other probabilistic models.
It could be combined with other elicitation techniques for more robust priors.
Testing on datasets where LLM knowledge is known to be inaccurate would reveal limits.

Load-bearing premise

The assumption that an LLM's token-level logit preferences on hand-crafted positive/negative sentences reliably encode accurate, domain-appropriate prior beliefs about each feature's causal direction and strength.

What would settle it

A counterexample would be a dataset where Bayesian logistic regression using LoID priors achieves lower AUC than the same model with uninformative priors in the synthetic OOD setting.

Figures

Figures reproduced from arXiv: 2601.17609 by Mohammad M. Ghassemi, Sara Rezaeimanesh.

**Figure 1.** Figure 1: LoID Pipeline: To extract informative priors, LoID uses multiple sentence templates for each feature and [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Hyperparameter ablation results demonstrating that prior uncertainty (beta) is the critical factor for [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

In domains like medicine and finance, large-scale labeled data is costly and often unavailable, leading to models trained on small datasets that struggle to generalize to real-world populations. Large language models contain extensive knowledge from years of research across these domains. We propose LoID (Logit-Informed Distributions), a deterministic method for extracting informative prior distributions for Bayesian logistic regression by directly accessing their token-level predictions. Rather than relying on generated text, we probe the model's confidence in opposing semantic directions (positive vs. negative impact) through carefully constructed sentences. By measuring how consistently the LLM favors one direction across diverse phrasings, we extract the strength and reliability of the model's belief about each feature's influence. We evaluate LoID on ten real-world tabular datasets under synthetic out-of-distribution (OOD) settings characterized by covariate shift, where the training data represents only a subset of the population. We compare our approach against (1) standard uninformative priors, (2) AutoElicit, a recent method that prompts LLMs to generate priors via text completions, (3) LLMProcesses, a method that uses LLMs to generate numerical predictions through in-context learning and (4) an oracle-style upper bound derived from fitting logistic regression on the full dataset. We assess performance using Area Under the Curve (AUC). Across datasets, LoID significantly improves performance over logistic regression trained on OOD data, recovering up to \textbf{59\%} of the performance gap relative to the oracle model. LoID outperforms AutoElicit and LLMProcessesc on 8 out of 10 datasets, while providing a reproducible and computationally efficient mechanism for integrating LLM knowledge into Bayesian inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoID's non-generative logit probing for priors is a distinct technical move, but the abstract gives too little on probe construction and validation to judge whether the priors are reliable or just picking up text correlations.

read the letter

The main thing to know is that LoID extracts priors for Bayesian logistic regression by measuring how consistently an LLM's token logits favor positive versus negative statements about each feature, instead of generating text. This deterministic step avoids some of the variability in the generative baselines it cites and looks like a practical way to turn LLM knowledge into usable distributions for small-data settings. The reported results on ten tabular datasets under covariate shift show gains over uninformative priors and the other LLM methods, closing up to 59% of the gap to the oracle in some cases and beating the baselines on eight datasets. The efficiency and reproducibility claims are straightforward advantages for applied work in medicine or finance. The soft spots are more substantial. The abstract describes carefully constructed sentences but supplies no examples, no sensitivity checks on phrasing, and no statistical tests on the AUC improvements. There is also no evidence presented that the extracted signs or magnitudes match known causal effects in any domain rather than observational correlations from the LLM's training data. If the priors are mis-specified under shift, the method could degrade performance instead of helping. This work is for researchers who want to inject external knowledge into Bayesian models on tabular data without heavy prompting or generation overhead. A reader focused on prior elicitation or hybrid LLM-statistical methods would find the core idea worth examining even if the current evidence is preliminary. It deserves a serious referee because the problem is real and the non-generative extraction is different enough from prior work to merit full review and requests for probe details plus domain sanity checks.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LoID (Logit-Informed Distributions), a deterministic method to extract prior distributions for Bayesian logistic regression from LLMs by measuring consistency of token-level logit differences on hand-crafted positive versus negative sentences asserting each feature's impact. It evaluates the approach on ten real-world tabular datasets under synthetic covariate-shift OOD regimes, claiming AUC gains that recover up to 59% of the gap to an oracle logistic regression fit on the full data and outperform AutoElicit and LLMProcesses on 8/10 datasets.

Significance. If the logit-derived priors prove to be directionally accurate and domain-appropriate rather than reflections of training-data correlations, the method would supply a reproducible, non-generative route for injecting LLM knowledge into Bayesian models, improving generalization in low-data regimes such as medicine and finance while remaining computationally lighter than generative elicitation baselines.

major comments (3)

[§3 (Method)] §3 (Method): the central extraction procedure relies on 'carefully constructed sentences' whose templates, number of phrasings, and exact positive/negative wording are not supplied; without these, it is impossible to assess whether the reported logit consistency encodes causal feature effects or merely linguistic or correlational artifacts.
[§4 (Experiments)] §4 (Experiments): the abstract and results claim 'significant' AUC improvements and up to 59% gap recovery, yet no statistical significance tests, standard errors, or ablation on probe phrasing are reported; this leaves the performance claims unverifiable and the comparison to AutoElicit/LLMProcesses on 8/10 datasets difficult to interpret.
[§4.2 (Results)] §4.2 (Results): no check is performed that the sign or magnitude of the extracted priors matches known causal directions in any of the ten domains; under covariate shift this omission is load-bearing, because LLM logits may encode spurious associations that would mis-specify the Bayesian posterior rather than improve it.

minor comments (2)

[Abstract] Abstract: 'LLMProcessesc' is a typographical error and should read 'LLMProcesses'.
[Abstract] Abstract: the description of the synthetic OOD covariate-shift construction is absent; a brief statement of how the training subset is selected would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: [§3 (Method)] §3 (Method): the central extraction procedure relies on 'carefully constructed sentences' whose templates, number of phrasings, and exact positive/negative wording are not supplied; without these, it is impossible to assess whether the reported logit consistency encodes causal feature effects or merely linguistic or correlational artifacts.

Authors: We agree that the exact templates and phrasings are necessary for full reproducibility and to evaluate potential artifacts. In the revised manuscript we will add a dedicated appendix containing all sentence templates, the number of phrasings per feature (typically 5–10), and representative positive/negative examples for each of the ten datasets. This will allow readers to inspect the construction and judge whether the logit differences reflect substantive feature effects. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments): the abstract and results claim 'significant' AUC improvements and up to 59% gap recovery, yet no statistical significance tests, standard errors, or ablation on probe phrasing are reported; this leaves the performance claims unverifiable and the comparison to AutoElicit/LLMProcesses on 8/10 datasets difficult to interpret.

Authors: We accept this criticism. The revised experiments section will report bootstrap standard errors for all AUC values and will include paired statistical tests (t-tests or Wilcoxon signed-rank) to assess the significance of improvements over baselines. We will also add an ablation study that varies the number of probe phrasings and wording diversity, demonstrating that the reported gains remain stable under these changes. revision: yes
Referee: [§4.2 (Results)] §4.2 (Results): no check is performed that the sign or magnitude of the extracted priors matches known causal directions in any of the ten domains; under covariate shift this omission is load-bearing, because LLM logits may encode spurious associations that would mis-specify the Bayesian posterior rather than improve it.

Authors: We acknowledge the importance of this validation. Because obtaining authoritative causal ground truth for every feature across ten heterogeneous domains would require substantial external expertise not available to the authors, we cannot perform a comprehensive check. In the revision we will add a limitations subsection that (a) discusses the risk of spurious correlations, (b) reports sign agreement for a small number of features where public domain knowledge exists, and (c) notes that the observed OOD performance gains provide indirect support for the priors’ utility while leaving direct causal verification to future domain-specific studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; priors extracted externally from LLM logits

full rationale

The LoID method constructs priors for Bayesian logistic regression by measuring token-level logit differences on hand-crafted positive/negative feature sentences. This extraction step operates on an external LLM and is independent of the target dataset's training data or labels. Performance claims (AUC gains, gap recovery) are evaluated empirically against uninformative priors, AutoElicit, LLMProcesses, and an oracle on held-out OOD data under covariate shift. No equations or steps reduce the central result to a fit on the same data, a self-citation chain, or a renaming of known patterns. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that LLM token logits on constructed sentences encode useful prior knowledge; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption LLM token-level predictions on opposing semantic sentences encode reliable prior beliefs about feature influence
This is the central premise that allows logit scores to be turned into prior distributions.

pith-pipeline@v0.9.0 · 5608 in / 1224 out tokens · 34529 ms · 2026-05-16T10:54:26.476044+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the logit preference score for a single prompt p_j as: logit(p_j) = log(p_j / (1 - p_j)) ... μ_j = average across paraphrases ... β_j ~ N(μ_j, σ_j²)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Rethinking tabular data understanding with large language models

Suboptimal capability of individual machine learning algorithms in modeling small-scale imbal- anced clinical data of local hospital.PLOS ONE, 19(2):e0298328. Tianyang Liu, Fei Wang, and Muhao Chen. 2023. Re- thinking tabular data understanding with large lan- guage models.Preprint, arXiv:2312.16702. Hariharan Manikandan, Yiding Jiang, and J Zico Kolter

work page arXiv 2023
[2]

Language models are weak learners.Preprint, arXiv:2306.14101. S. Moro, P. Rita, and P. Cortez. 2014. Bank Mar- keting. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5K306. Anthony O’Hagan, Christopher E. Buck, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garthwaite, David J. Jenkinson, Jeremy E. Oakley, and Tim Rakow. 2006.Uncertain Jud...

work page doi:10.24432/c5k306 2014
[3]

Harnessing large language models as post-hoc correctors.Preprint, arXiv:2402.13414. A Appendix A.1 Reproducibility Code AvailabilityTo ensure reproducibility, we will release our full codebase, including detailed instructions for running the experiments and repro- ducing all reported results. We also provide the the posterior files generated by our experi...

work page arXiv
[4]

The relationship between and is

work page
[5]

When considering , the effect on is

work page
[6]

positive

The correlation between and is . . . As shown in Figure 2, performance gains plateau after approximately 10 sentences. Around this point, performance also becomes more stable, while incorporating additional sentences incurs in- creased computational cost. Accordingly, we use 10 sentences in all experiments. A.2.2 Model Selection We compared the performanc...

work page
[7]

Bank Marketing(Moro et al., 2014): Pre- dicts whether a client will subscribe to a term deposit based on direct marketing campaign data from a Portuguese banking institution. Features include client demographics (age, job, marital status, education), campaign in- formation (contact type, number of contacts, previous campaign outcome), and economic indicat...

work page 2014
[8]

Blood Transfusion(Yeh, 2008b): Predicts whether a blood donor will donate within a given time window using data from the Blood Transfusion Service Center in Hsin-Chu City, Taiwan. Features are based on the RFMTC Dataset w/o Desc w/ Desc∆ Bank 0.692 0.673 -0.019 Blood 0.652 0.650 -0.002 Credit Score 0.756 0.756 +0.000 Diabetes 0.836 0.836 +0.000 Heart 0.88...

work page 2007
[9]

Features include checking account status, credit duration, credit history, credit amount, savings account balance, em- ployment status, and age

Give Me Some Credit(Kaggle, 2011): Pre- dicts the probability of financial distress in the next two years using credit and demographic data from Kaggle. Features include checking account status, credit duration, credit history, credit amount, savings account balance, em- ployment status, and age. The binary classifi- cation task evaluates financial risk f...

work page 2011
[10]

Diabetes(Islam et al., 2020): Predicts the likelihood of diabetes based on self-reported symptoms and demographic information. Fea- tures include age, gender, polyuria, polydip- sia, sudden weight loss, weakness, polypha- gia, genital thrush, visual blurring, itching, irritability, delayed healing, partial paresis, muscle stiffness, alopecia, and obesity....

work page 2020
[11]

The binary classification task identifies individuals with high likelihood of heart disease

Heart Disease(Janosi et al., 1988): Predicts heart disease using clinical and demographic attributes such as age, sex, resting blood pres- sure, cholesterol level, fasting blood sugar, maximum heart rate, exercise-induced angina, ST depression (oldpeak), chest pain type, rest- ing ECG results, and ST slope. The binary classification task identifies indivi...

work page 1988
[12]

Cen- sus data

Adult Income(Becker and Kohavi, 1996a): Predicts whether an individual’s annual in- come exceeds $50K based on 1994 U.S. Cen- sus data. Features include age, education level, marital status, occupation, relationship sta- tus, workclass, sex, native country, capital gain/loss, and hours worked per week. The binary classification task uses demographic and e...

work page 1994
[13]

King-Pawn chess endgame positions

Chess Endgame(Shapiro, 1983): Predicts game outcomes in King-Rook vs. King-Pawn chess endgame positions. Features encode piece positions using strength rankings, file co- ordinates, and rank coordinates for white and black pieces. The binary classification task determines winning positions in this chess endgame with 44,820 board configurations

work page 1983
[14]

Indian Liver Patient(Ramana and Venkateswarlu, 2012): Predicts whether a patient has liver disease using clinical mea- surements collected from northeast Andhra Pradesh, India. Features include age, gender, total bilirubin, direct bilirubin, alkaline phos- phatase, alanine aminotransferase (SGPT), aspartate aminotransferase (SGOT), total proteins, albumin...

work page 2012
[15]

Features include temperature (Celsius), relative humidity (%), light (Lux), CO2 concentration (ppm), humidity ratio (kg water-vapor/kg-air), and timestamp informa- tion

Occupancy Detection(Candanedo, 2016b): Predicts room occupancy status from envi- ronmental sensor measurements in an of- fice setting. Features include temperature (Celsius), relative humidity (%), light (Lux), CO2 concentration (ppm), humidity ratio (kg water-vapor/kg-air), and timestamp informa- tion. The binary classification task detects oc- cupancy u...

work page
[16]

Pima Indians Diabetes(Smith et al., 1988): Predicts the onset of diabetes within five years in females of Pima Indian heritage over age

work page 1988
[17]

The binary classification task uses medical measurements from 768 women from a population near Phoenix, Arizona with high diabetes incidence rates

Features include number of pregnancies, glucose level, blood pressure, skin thickness, insulin level, BMI, diabetes pedigree function, and age. The binary classification task uses medical measurements from 768 women from a population near Phoenix, Arizona with high diabetes incidence rates. A.4.2 OOD Splits We construct out-of-distribution (OOD) train spl...

work page
[18]

maximum uninformativeness

moderate_20_80, training on the bottom 20%; 4.tail_0_50, training on the 0–25% range; 5.tail_50_100, training on the 50–75% range. For each split, we designate a different feature as the shift variable (cycling through available fea- tures), compute the corresponding quantile thresh- olds, and construct train–test masks while enforc- ing a minimum of 50 s...

work page
[19]

This absence of gradient-based regularization toward zero can lead to poor MCMC sampling behavior and overfitting

Lack of soft regularization:Uniform distri- butions do not encode the reasonable prior belief that smaller coefficient magnitudes are more plausible than larger ones. This absence of gradient-based regularization toward zero can lead to poor MCMC sampling behavior and overfitting

work page
[20]

Incompatibility with MCMC:The flat den- sity of Uniform priors provides no gradient information to guide sampling algorithms, re- sulting in inefficient exploration and conver- gence issues, particularly visible in the Occu- pancy dataset collapse

work page
[21]

Uniform priors violate this structure, treating β= 0.1 and β= 4.9 as equally plausible a prior

Prior-likelihood mismatch:For logistic re- gression, coefficients naturally follow a scale where values near zero are common and ex- treme values are rare. Uniform priors violate this structure, treating β= 0.1 and β= 4.9 as equally plausible a prior. A.7 Computational Resources

work page
[22]

7985WX CPUwith64 cores (128 threads) at up to5.37 GHz, and256 GB of RAM

Infrastructure:Experiments were run on a workstation equipped with2 NVIDIA RTX 6000 Ada Generation GPUs(each with 50GB VRAM), aThreadripper PRO DatasetN(0,1)U(−1,1) Bank 0.61 0.35 Blood 0.63 0.59 Credit Score 0.74 0.58 Diabetes 0.82 0.32 Heart 0.91 0.84 Income 0.81 0.30 Liver 0.68 0.70 Occupancy 0.72 0.02 Jungle 0.72 0.58 Pima 0.82 0.32 Average 0.75 0.46 ...

work page
[23]

All inference was performed using both GPUs in parallel

Model Inference:We usedGemma-2-27Bas our main LLM. All inference was performed using both GPUs in parallel. Each feature prompt required about50–70 secondsto com- plete across 10 paraphrases, leading to approx- imately60 seconds * #featuresper dataset

work page
[24]

Preprocessing:Minimal preprocessing was required and was handled efficiently on a sin- gle CPU

work page
[25]

Each call took2–3 secondsand used one CPU thread

Additional Resources:Exploratory analyses, including feature selection, were performed usingGPT-4o via OpenRouter. Each call took2–3 secondsand used one CPU thread

work page

[1] [1]

Rethinking tabular data understanding with large language models

Suboptimal capability of individual machine learning algorithms in modeling small-scale imbal- anced clinical data of local hospital.PLOS ONE, 19(2):e0298328. Tianyang Liu, Fei Wang, and Muhao Chen. 2023. Re- thinking tabular data understanding with large lan- guage models.Preprint, arXiv:2312.16702. Hariharan Manikandan, Yiding Jiang, and J Zico Kolter

work page arXiv 2023

[2] [2]

Language models are weak learners.Preprint, arXiv:2306.14101. S. Moro, P. Rita, and P. Cortez. 2014. Bank Mar- keting. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C5K306. Anthony O’Hagan, Christopher E. Buck, Alireza Daneshkhah, J. Richard Eiser, Paul H. Garthwaite, David J. Jenkinson, Jeremy E. Oakley, and Tim Rakow. 2006.Uncertain Jud...

work page doi:10.24432/c5k306 2014

[3] [3]

Harnessing large language models as post-hoc correctors.Preprint, arXiv:2402.13414. A Appendix A.1 Reproducibility Code AvailabilityTo ensure reproducibility, we will release our full codebase, including detailed instructions for running the experiments and repro- ducing all reported results. We also provide the the posterior files generated by our experi...

work page arXiv

[4] [4]

The relationship between and is

work page

[5] [5]

When considering , the effect on is

work page

[6] [6]

positive

The correlation between and is . . . As shown in Figure 2, performance gains plateau after approximately 10 sentences. Around this point, performance also becomes more stable, while incorporating additional sentences incurs in- creased computational cost. Accordingly, we use 10 sentences in all experiments. A.2.2 Model Selection We compared the performanc...

work page

[7] [7]

Bank Marketing(Moro et al., 2014): Pre- dicts whether a client will subscribe to a term deposit based on direct marketing campaign data from a Portuguese banking institution. Features include client demographics (age, job, marital status, education), campaign in- formation (contact type, number of contacts, previous campaign outcome), and economic indicat...

work page 2014

[8] [8]

Blood Transfusion(Yeh, 2008b): Predicts whether a blood donor will donate within a given time window using data from the Blood Transfusion Service Center in Hsin-Chu City, Taiwan. Features are based on the RFMTC Dataset w/o Desc w/ Desc∆ Bank 0.692 0.673 -0.019 Blood 0.652 0.650 -0.002 Credit Score 0.756 0.756 +0.000 Diabetes 0.836 0.836 +0.000 Heart 0.88...

work page 2007

[9] [9]

Features include checking account status, credit duration, credit history, credit amount, savings account balance, em- ployment status, and age

Give Me Some Credit(Kaggle, 2011): Pre- dicts the probability of financial distress in the next two years using credit and demographic data from Kaggle. Features include checking account status, credit duration, credit history, credit amount, savings account balance, em- ployment status, and age. The binary classifi- cation task evaluates financial risk f...

work page 2011

[10] [10]

Diabetes(Islam et al., 2020): Predicts the likelihood of diabetes based on self-reported symptoms and demographic information. Fea- tures include age, gender, polyuria, polydip- sia, sudden weight loss, weakness, polypha- gia, genital thrush, visual blurring, itching, irritability, delayed healing, partial paresis, muscle stiffness, alopecia, and obesity....

work page 2020

[11] [11]

The binary classification task identifies individuals with high likelihood of heart disease

Heart Disease(Janosi et al., 1988): Predicts heart disease using clinical and demographic attributes such as age, sex, resting blood pres- sure, cholesterol level, fasting blood sugar, maximum heart rate, exercise-induced angina, ST depression (oldpeak), chest pain type, rest- ing ECG results, and ST slope. The binary classification task identifies indivi...

work page 1988

[12] [12]

Cen- sus data

Adult Income(Becker and Kohavi, 1996a): Predicts whether an individual’s annual in- come exceeds $50K based on 1994 U.S. Cen- sus data. Features include age, education level, marital status, occupation, relationship sta- tus, workclass, sex, native country, capital gain/loss, and hours worked per week. The binary classification task uses demographic and e...

work page 1994

[13] [13]

King-Pawn chess endgame positions

Chess Endgame(Shapiro, 1983): Predicts game outcomes in King-Rook vs. King-Pawn chess endgame positions. Features encode piece positions using strength rankings, file co- ordinates, and rank coordinates for white and black pieces. The binary classification task determines winning positions in this chess endgame with 44,820 board configurations

work page 1983

[14] [14]

Indian Liver Patient(Ramana and Venkateswarlu, 2012): Predicts whether a patient has liver disease using clinical mea- surements collected from northeast Andhra Pradesh, India. Features include age, gender, total bilirubin, direct bilirubin, alkaline phos- phatase, alanine aminotransferase (SGPT), aspartate aminotransferase (SGOT), total proteins, albumin...

work page 2012

[15] [15]

Features include temperature (Celsius), relative humidity (%), light (Lux), CO2 concentration (ppm), humidity ratio (kg water-vapor/kg-air), and timestamp informa- tion

Occupancy Detection(Candanedo, 2016b): Predicts room occupancy status from envi- ronmental sensor measurements in an of- fice setting. Features include temperature (Celsius), relative humidity (%), light (Lux), CO2 concentration (ppm), humidity ratio (kg water-vapor/kg-air), and timestamp informa- tion. The binary classification task detects oc- cupancy u...

work page

[16] [16]

Pima Indians Diabetes(Smith et al., 1988): Predicts the onset of diabetes within five years in females of Pima Indian heritage over age

work page 1988

[17] [17]

The binary classification task uses medical measurements from 768 women from a population near Phoenix, Arizona with high diabetes incidence rates

Features include number of pregnancies, glucose level, blood pressure, skin thickness, insulin level, BMI, diabetes pedigree function, and age. The binary classification task uses medical measurements from 768 women from a population near Phoenix, Arizona with high diabetes incidence rates. A.4.2 OOD Splits We construct out-of-distribution (OOD) train spl...

work page

[18] [18]

maximum uninformativeness

moderate_20_80, training on the bottom 20%; 4.tail_0_50, training on the 0–25% range; 5.tail_50_100, training on the 50–75% range. For each split, we designate a different feature as the shift variable (cycling through available fea- tures), compute the corresponding quantile thresh- olds, and construct train–test masks while enforc- ing a minimum of 50 s...

work page

[19] [19]

This absence of gradient-based regularization toward zero can lead to poor MCMC sampling behavior and overfitting

Lack of soft regularization:Uniform distri- butions do not encode the reasonable prior belief that smaller coefficient magnitudes are more plausible than larger ones. This absence of gradient-based regularization toward zero can lead to poor MCMC sampling behavior and overfitting

work page

[20] [20]

Incompatibility with MCMC:The flat den- sity of Uniform priors provides no gradient information to guide sampling algorithms, re- sulting in inefficient exploration and conver- gence issues, particularly visible in the Occu- pancy dataset collapse

work page

[21] [21]

Uniform priors violate this structure, treating β= 0.1 and β= 4.9 as equally plausible a prior

Prior-likelihood mismatch:For logistic re- gression, coefficients naturally follow a scale where values near zero are common and ex- treme values are rare. Uniform priors violate this structure, treating β= 0.1 and β= 4.9 as equally plausible a prior. A.7 Computational Resources

work page

[22] [22]

7985WX CPUwith64 cores (128 threads) at up to5.37 GHz, and256 GB of RAM

Infrastructure:Experiments were run on a workstation equipped with2 NVIDIA RTX 6000 Ada Generation GPUs(each with 50GB VRAM), aThreadripper PRO DatasetN(0,1)U(−1,1) Bank 0.61 0.35 Blood 0.63 0.59 Credit Score 0.74 0.58 Diabetes 0.82 0.32 Heart 0.91 0.84 Income 0.81 0.30 Liver 0.68 0.70 Occupancy 0.72 0.02 Jungle 0.72 0.58 Pima 0.82 0.32 Average 0.75 0.46 ...

work page

[23] [23]

All inference was performed using both GPUs in parallel

Model Inference:We usedGemma-2-27Bas our main LLM. All inference was performed using both GPUs in parallel. Each feature prompt required about50–70 secondsto com- plete across 10 paraphrases, leading to approx- imately60 seconds * #featuresper dataset

work page

[24] [24]

Preprocessing:Minimal preprocessing was required and was handled efficiently on a sin- gle CPU

work page

[25] [25]

Each call took2–3 secondsand used one CPU thread

Additional Resources:Exploratory analyses, including feature selection, were performed usingGPT-4o via OpenRouter. Each call took2–3 secondsand used one CPU thread

work page