pith. sign in

arxiv: 2606.20538 · v1 · pith:SWB43D2Wnew · submitted 2026-06-18 · 💻 cs.LG

Multi-Task Bayesian In-Context Learning

Pith reviewed 2026-06-26 17:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords multi-task learningBayesian inferencein-context learningtransformer modelamortized inferencepredictive distributionsdistribution shiftuncertainty quantification
0
0 comments X

The pith

A transformer matches exact Bayesian predictors on new priors by treating prior information as an in-context prefix.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bayesian predictive inference gives principled uncertainty estimates but exact computation is usually intractable. Prior amortized methods are fast but fail when the prior at test time differs from the training distribution. This work trains a transformer on sequences of prior datasets followed by target tasks so that the model learns to read the prior as context and adjust its predictions accordingly. Evaluations show it reaches the accuracy of oracle Bayesian methods even for out-of-meta-distribution priors and high-dimensional cases, while running much faster. The same approach succeeds on a real-world temperature prediction task.

Core claim

Training a transformer on multi-task sequences that include prior information as an in-context prefix produces an amortized hierarchical Bayesian predictor capable of adapting to new prior families, including those outside the meta-training distribution, with accuracy matching oracle methods at far lower computational cost.

What carries the argument

The multi-task in-context learning framework that represents prior information as a prefix of in-context datasets before the target task sequence.

Load-bearing premise

A transformer can learn to adapt its predictive behavior to entirely new prior families simply by receiving prior datasets as an in-context prefix during both training and inference.

What would settle it

Running the method on a high-dimensional latent structure prior outside the training distribution and finding that its predictions deviate substantially from those of an exact Bayesian oracle.

Figures

Figures reproduced from arXiv: 2606.20538 by Eric Karl Oermann, Kyunghyun Cho, Qingyang Zhu.

Figure 1
Figure 1. Figure 1: We propose a framework for amortized hierarchical Bayesian predictive inference using in-context learning. Tasks are drawn from a meta-distribution over priors rather than a single fixed prior. A transformer adapts to different priors by conditioning on prior datasets within a single context. In practice, we would create a long sequence of (x, y) tu￾ples, where each consecutive subsequence corresponds to a… view at source ↗
Figure 2
Figure 2. Figure 2: KL divergence between the PPDs produced by different inference methods and the oracle PPD for linear regression, evalu￾ated across multiple target context lengths and test priors. remains fixed. Multi-task ICL (with prefix) closely matches MCMC-HIER and achieves negligible KL across all con￾text lengths under IMD priors. This indicates the neural model learns a predictive mapping from (Dprior, Dtgt) to the… view at source ↗
Figure 3
Figure 3. Figure 3: KL divergence between the PPDs produced by different inference methods and the oracle PPD (approximated from con￾verged MCMC) for logistic regression. The x-axis is the number of posterior samples for MCMC models and the number of opti￾mization steps for SVI models. MCMC models are given 1000 warmup steps prior to collecting any posterior samples. constrained to work with the mis-specified prior. In addi￾t… view at source ↗
Figure 4
Figure 4. Figure 4: Prior adaptability check for logistic regression with a fixed target context of length = 5 and different prior prefixes. systematic changes in the variance of the distribution of pre￾dicted logits as a function of the prior prefix. This demon￾strates that the prefix exerts a coherent and controllable influence on the predictive distribution. Ruling out evidence pooling. A potential alternative ex￾planation… view at source ↗
Figure 5
Figure 5. Figure 5: Heatmaps of KL divergence to the oracle PPD from (left) multi-task ICL, (middle) hierarchical MCMC, and (right) hierarchical SVI. Rows index the minimum log ν included in the meta-training mixture (increasing tail-heaviness downward), and columns sweep the test prior log ν, covering IMD and OoMD regimes. Lower (darker) values indicate better agreement with the oracle. warmup, its runtime remains dominated … view at source ↗
Figure 6
Figure 6. Figure 6: Predictive KL divergence versus wall-clock inference time for flow-based priors with high-dimensional latents. 5.5. Real-World Evaluation on ERA5 We next evaluate whether this prior-prefix mechanism still provides practical benefits in a real-world setting where data is noisy and latent structure is complex and unclear. We use the spatiotemporal temperature prediction task from ERA5 climate data (Store et … view at source ↗
Figure 7
Figure 7. Figure 7: t-SNE and UMAP (2D) visualizations of N (µ, I) samples after a randomly initialized Spiral Flow transformation. The Spiral Flow is rotationally symmetric, therefore a standard Gaussian distribution will stay standard Gaussian after transformation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TV divergence between the PPDs produced by different inference methods and the oracle PPD (approximated from converged MCMC) for logistic regression. The x-axis is the number of posterior samples for MCMC models and the number of optimization steps for SVI models. MCMC models are given 1000 warmup steps prior to collecting any posterior samples. (a) Multi-task ICL. (b) MCMC-HIER (upperbound). (c) SVI-hier … view at source ↗
Figure 9
Figure 9. Figure 9: Heatmaps of TV divergence to the oracle PPD from (left) multi-task ICL, (middle) hierarchical MCMC, and (right) hierarchical SVI. Rows index the minimum log ν included in the meta-training mixture (increasing tail-heaviness downward), and columns sweep the test prior log ν, covering IMD and OoMD regimes. Lower (darker) values indicate better agreement with the oracle. G. Supplementary Analyses G.1. Permuta… view at source ↗
Figure 10
Figure 10. Figure 10: TV divergence versus wall-clock inference time for spiral flow priors with high-dimensional latents [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation model of multi-task ICL that has the permutation invariance between prior datasets enforced. Splits. We use 16,000 training episodes, 2,000 validation episodes, and 2,000 test episodes. In the 2019 IID split, train, validation, and test episodes are sampled from the full year of 2019. In the 2019 OOD split, training episodes are sampled from the first six months of 2019 excluding the final 14 day… view at source ↗
read the original abstract

Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at https://github.com/martianmartina/multi-task-bayesian-icl/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference. Prior information is explicitly represented as a prefix of in-context datasets, and a transformer is trained on sequences of prior and target tasks to adapt predictions across families of priors. On evaluations with increasing difficulty (including out-of-meta-distribution priors and high-dimensional latent structures), the method is claimed to match oracle Bayesian predictors while being orders of magnitude faster; results are also shown on a real-world spatiotemporal temperature prediction benchmark. Code is provided.

Significance. If the empirical claims hold, the work offers a scalable amortized alternative to exact or approximate Bayesian inference that can adapt to new priors at test time, addressing a key limitation of prior-data fitted models. Matching oracle performance on OOD and high-dimensional cases, combined with substantial speedups and a real-world demonstration, would be a notable contribution to uncertainty quantification in machine learning. Open-sourcing the code supports reproducibility.

major comments (1)
  1. Abstract: the central claim that the method 'matches oracle Bayesian predictors' on out-of-meta-distribution priors and high-dimensional latent structures is load-bearing for the contribution, yet the abstract provides no experimental details, error bars, ablation results, or quantitative metrics to allow verification that the data support this assertion.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. We address the single major comment below and agree that the abstract can be strengthened for greater transparency.

read point-by-point responses
  1. Referee: Abstract: the central claim that the method 'matches oracle Bayesian predictors' on out-of-meta-distribution priors and high-dimensional latent structures is load-bearing for the contribution, yet the abstract provides no experimental details, error bars, ablation results, or quantitative metrics to allow verification that the data support this assertion.

    Authors: We agree that the abstract's central claim would benefit from greater specificity to allow readers to assess its support immediately. The full manuscript contains the requested experimental details, including error bars, ablation studies, and quantitative metrics across the OOD and high-dimensional cases. In the revised version we will update the abstract to incorporate brief quantitative indicators (e.g., relative log-likelihood gaps and runtime ratios) drawn directly from the reported results, while remaining within length constraints. This change will be made. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical training and evaluation are self-contained

full rationale

The paper presents an empirical multi-task in-context learning approach where a transformer is trained on sequences of prior and target tasks to adapt predictions across prior families. No equations, derivations, or parameter-fitting steps are described that reduce any claimed prediction or result to the inputs by construction. Central claims rest on evaluations against oracle Bayesian predictors on OOD and high-dimensional cases, with no load-bearing self-citations or ansatzes that collapse the method to its own fitted quantities. This is a standard non-circular empirical setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into modeling choices; the core assumption is that the transformer can internalize hierarchical Bayesian adaptation from multi-task sequences.

axioms (1)
  • domain assumption A transformer can learn to map sequences of prior datasets plus target data to adapted predictive distributions across families of priors
    This is the central modeling premise required for the method to generalize beyond the training meta-distribution.

pith-pipeline@v0.9.1-grok · 5725 in / 1199 out tokens · 30993 ms · 2026-06-26T17:34:35.439798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 3 canonical work pages

  1. [1]

    arXiv preprint arXiv:2502.07281 , year=

    Supervised Contrastive Block Disentanglement , author=. arXiv preprint arXiv:2502.07281 , year=

  2. [2]

    Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference , pages =

    Implicitly Bayesian Prediction Rules in Deep Learning , author =. Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference , pages =. 2024 , editor =

  3. [3]

    Datos recuperados entre noviembre , year=

    ERA5 hourly data on single levels from 1940 to present , author=. Datos recuperados entre noviembre , year=

  4. [4]

    2024 , issue_date =

    Su, Jianlin and Ahmed, Murtadha and Lu, Yu and Pan, Shengfeng and Bo, Wen and Liu, Yunfeng , title =. 2024 , issue_date =. doi:10.1016/j.neucom.2023.127063 , journal =

  5. [5]

    Copernicus climate change service (c3s) climate data store (cds) , volume=

    ERA5 hourly data on single levels from 1940 to present , author=. Copernicus climate change service (c3s) climate data store (cds) , volume=

  6. [6]

    Third Symposium on Advances in Approximate Bayesian Inference , year=

    The Gaussian Neural Process , author=. Third Symposium on Advances in Approximate Bayesian Inference , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Episodic multi-task learning with heterogeneous neural processes , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Meta-in-context learning in large language models , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  9. [9]

    International Conference on Learning Representations , year=

    Meta-Learning Probabilistic Inference for Prediction , author=. International Conference on Learning Representations , year=

  10. [10]

    Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference , pages =

    In-Context In-Context Learning with Transformer Neural Processes , author =. Proceedings of the 6th Symposium on Advances in Approximate Bayesian Inference , pages =. 2024 , editor =

  11. [11]

    NeurIPS 2025 Workshop: Reliable ML from Unreliable Data , year=

    Robust Multi-task Modeling for Bayesian Optimization via In-Context Learning , author=. NeurIPS 2025 Workshop: Reliable ML from Unreliable Data , year=

  12. [12]

    2026 , eprint=

    Distribution Transformers: Fast Approximate Bayesian Inference With On-The-Fly Prior Adaptation , author=. 2026 , eprint=

  13. [13]

    The 28th International Conference on Artificial Intelligence and Statistics , year=

    Amortized Probabilistic Conditioning for Optimization, Simulation and Inference , author=. The 28th International Conference on Artificial Intelligence and Statistics , year=

  14. [14]

    Journal of machine learning research , volume=

    Pyro: Deep universal probabilistic programming , author=. Journal of machine learning research , volume=

  15. [15]

    , author=

    The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. , author=. J. Mach. Learn. Res. , volume=

  16. [16]

    Advances in neural information processing systems , volume=

    Neural ordinary differential equations , author=. Advances in neural information processing systems , volume=

  17. [17]

    ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

    Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning , author=. ICLR 2024 Workshop on Secure and Trustworthy Large Language Models , year=

  18. [18]

    Forty-first International Conference on Machine Learning , year=

    All-in-one simulation-based inference , author=. Forty-first International Conference on Machine Learning , year=

  19. [19]

    and Chu, Eric and Behbahani, Feryal and Faust, Aleksandra and Larochelle, Hugo , booktitle =

    Agarwal, Rishabh and Singh, Avi and Zhang, Lei and Bohnet, Bernd and Rosias, Luis and Chan, Stephanie and Zhang, Biao and Anand, Ankesh and Abbas, Zaheer and Nova, Azade and Co-Reyes, John D. and Chu, Eric and Behbahani, Feryal and Faust, Aleksandra and Larochelle, Hugo , booktitle =. Many-Shot In-Context Learning , volume =. doi:10.52202/079017-2447 , editor =

  20. [20]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  21. [21]

    International Conference on Machine Learning , pages=

    Transformers learn in-context by gradient descent , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  22. [22]

    Forty-second International Conference on Machine Learning , year=

    Can Transformers Learn Full Bayesian Inference in Context? , author=. Forty-second International Conference on Machine Learning , year=

  23. [23]

    Workshop on Efficient Systems for Foundation Models @ ICML2023 , year=

    A Closer Look at In-Context Learning under Distribution Shifts , author=. Workshop on Efficient Systems for Foundation Models @ ICML2023 , year=

  24. [24]

    Journal of Machine Learning Research , volume=

    Trained transformers learn linear models in-context , author=. Journal of Machine Learning Research , volume=

  25. [25]

    Advances in neural information processing systems , volume=

    Pretraining task diversity and the emergence of non-bayesian in-context learning for regression , author=. Advances in neural information processing systems , volume=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Llm processes: Numerical predictive distributions conditioned on natural language , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    International conference on artificial intelligence and statistics , pages=

    Benchmarking simulation-based inference , author=. International conference on artificial intelligence and statistics , pages=. 2021 , organization=

  28. [28]

    International conference on artificial neural networks , pages=

    Learning to learn using gradient descent , author=. International conference on artificial neural networks , pages=. 2001 , organization=

  29. [29]

    International conference on machine learning , pages=

    Meta-learning with memory-augmented neural networks , author=. International conference on machine learning , pages=. 2016 , organization=

  30. [30]

    International conference on machine learning , pages=

    Model-agnostic meta-learning for fast adaptation of deep networks , author=. International conference on machine learning , pages=. 2017 , organization=

  31. [31]

    arXiv preprint arXiv:2212.04458 , year=

    General-purpose in-context learning by meta-learning transformers , author=. arXiv preprint arXiv:2212.04458 , year=

  32. [32]

    The Eleventh International Conference on Learning Representations , year=

    What learning algorithm is in-context learning? Investigations with linear models , author=. The Eleventh International Conference on Learning Representations , year=

  33. [33]

    What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , volume =

    Garg, Shivam and Tsipras, Dimitris and Liang, Percy S and Valiant, Gregory , booktitle =. What Can Transformers Learn In-Context? A Case Study of Simple Function Classes , volume =

  34. [34]

    International Conference on Learning Representations , year=

    Transformers Can Do Bayesian Inference , author=. International Conference on Learning Representations , year=

  35. [35]

    Noah Hollmann and Samuel M. Tab. The Eleventh International Conference on Learning Representations , year=

  36. [36]

    International Conference on Learning Representations , year=

    An Explanation of In-context Learning as Implicit Bayesian Inference , author=. International Conference on Learning Representations , year=

  37. [37]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  38. [38]

    and Kucukelbir, Alp and McAuliffe, Jon D

    Blei, David M. and Kucukelbir, Alp and McAuliffe, Jon D. , year=. Variational Inference: A Review for Statisticians , volume=. Journal of the American Statistical Association , publisher=

  39. [39]

    Neural computation , volume=

    A practical Bayesian framework for backpropagation networks , author=. Neural computation , volume=. 1992 , publisher=

  40. [40]

    1996 , publisher=

    Bayesian leaning for neural networks , author=. 1996 , publisher=

  41. [41]

    Machine learning , volume=

    An introduction to MCMC for machine learning , author=. Machine learning , volume=. 2003 , publisher=

  42. [42]

    ArXiv , year=

    Neural Processes , author=. ArXiv , year=

  43. [43]

    5th International Conference on Learning Representations , pages=

    Towards a Neural Statistician , author=. 5th International Conference on Learning Representations , pages=

  44. [44]

    2020, Proceedings of the National Academy of Sciences, 117, 48, 30055

    Cranmer, Kyle and Brehmer, Johann and Louppe, Gilles , year=. The frontier of simulation-based inference , volume=. Proceedings of the National Academy of Sciences , publisher=. doi:10.1073/pnas.1912789117 , number=

  45. [45]

    Forty-second International Conference on Machine Learning Position Paper Track , year=

    Position: The Future of Bayesian Prediction Is Prior-Fitted , author=. Forty-second International Conference on Machine Learning Position Paper Track , year=

  46. [46]

    Forty-second International Conference on Machine Learning , year=

    Does learning the right latent variables necessarily improve in-context learning? , author=. Forty-second International Conference on Machine Learning , year=