arxiv: 2605.07799 · v2 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning

Xueying Ding , Leman Akoglu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords privileged informationtabular foundation modelsLUPIlearning accelerationgeneralizationPIQLdata efficiencyconvergence

0 comments

The pith

PIQL integrates train-only privileged information to speed convergence and improve generalization in tabular foundation models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PIQL as a framework that supplies two kinds of privileged information only at training time: dataset-level aggregate statistics and encodings of the underlying data-generating program. These help tabular foundation models reduce the load on in-context learning and gain knowledge beyond what raw observations provide. An architecture is designed so the model learns to reconstruct the privileged signals from regular context alone when the information is no longer available at inference. Theoretical analysis identifies conditions under which the added information shrinks the population approximation gap and accelerates finite-sample convergence. Experiments indicate the result is faster training, lower final loss, and stronger generalization, which together lower the data and compute needed to reach target performance.

Core claim

PIQL is the first systematic method to embed privileged information into tabular foundation models by supplying aggregate statistics and data-generating program encodings during training, then training the model to recover those signals from observable inputs at test time, with theory showing reduced approximation error and faster convergence under stated conditions, and experiments confirming quicker training, lower loss, and better generalization.

What carries the argument

The PIQL architecture that learns to reconstruct train-time-only privileged information from observed context at inference time.

If this is right

Tabular foundation models reach target performance with fewer training examples and less compute.
Final loss decreases and out-of-sample accuracy rises under the same data budget.
The pretraining stage can be guided by domain knowledge encoded as privileged signals rather than raw data volume alone.
The same reconstruction mechanism can be applied to other modalities where auxiliary signals exist only during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may allow smaller tabular models to match the performance of larger ones trained without privileged information.
Combining PIQL with existing efficiency methods such as parameter-efficient fine-tuning could compound resource savings.
If reconstruction quality can be monitored, the framework might adaptively decide how much privileged information to supply per batch.

Load-bearing premise

The model can reliably reconstruct the privileged information from regular inputs at inference without adding instability or hidden error.

What would settle it

Run the same training schedule with and without the reconstruction module active; if the version that cannot recover privileged signals shows no convergence speedup or generalization gain, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.07799 by Leman Akoglu, Xueying Ding.

**Figure 2.** Figure 2: Training loss (left) and average test AUROC (right) for the proposed PIQL model (red) versus the no-PI baseline (blue). PIQL exhibits accelerated learning, achieving lower loss and higher AUROC in fewer epochs. Despite minor fluctuations during teacher-forcing annealing, PIQL achieves similar loss and performance to the no-transfer variant (gray) at convergence. (best in color) 6.2 Ablation Analyses LLM-as… view at source ↗

**Figure 3.** Figure 3: (left) LLM-as-Program Encoder; (middle) GeneratorPI + MetaPI; (right) MetaPI++ 6.3 Layer-wise Representation Probing Finally, we use linear probes [6] (diagnostic classifiers) to analyze representations from each layer (0–9) of pretrained privileged vs. standard TFMs. Specifically, we predict two targets: (i) type of generator (5-way, balanced) and (ii) number of components (M=5-way balanced, for GMMgener… view at source ↗

**Figure 4.** Figure 4: Illustration of our PIQL diagram: the teacher-student TFM and training to transfer PI. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Pairwise permutation test results for the 10-layer and 2-layer models. [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

read the original abstract

Training foundation models is computationally intensive and often slow to converge. We introduce PIQL,Privileged Information for Quick and Quality Learning, the first framework to systematically integrate privileged information (PI) to simultaneously accelerate learning and improve generalization in tabular foundation models (TFMs). We construct two complementary forms of PI: (i) aggregate dataset-level statistics that reduce the burden on in-context learning, and (ii) encodings of the underlying data-generating program, providing knowledge beyond observable data. We further design an architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference. We provide a theoretical analysis characterizing conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes. Empirical evidence shows that PIQL enables TFMs to achieve faster convergence, lower final loss, and better generalization, in effect, reducing data and compute requirements. Our work establishes PI-guided pretraining as a principled and practical paradigm for improving the efficiency and performance of foundation models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIQL applies privileged information to tabular foundation models via two new PI forms and a reconstruction architecture, with theory on convergence gains, but the reconstruction step lacks clear validation of its reliability.

read the letter

The paper's main move is to treat privileged information as a first-class ingredient for tabular foundation models. It defines two concrete kinds: aggregate dataset statistics that ease in-context learning, and encodings of the underlying data-generating program. An architecture is built to recover this information from ordinary inputs at inference time, and a theoretical section spells out conditions under which the extra signal shrinks the approximation gap and speeds finite-data convergence. That combination of specific constructions, transfer mechanism, and analysis is what is new here. The empirical section reports faster convergence, lower final loss, and better generalization, which would matter if the numbers are stable across reasonable baselines and data splits. The theory is the part that feels most grounded; it gives a clear route from the PI assumptions to the claimed efficiency improvements. The soft spot is the reconstruction step itself. Everything downstream depends on the model learning a faithful mapping from context to the hidden PI. The abstract describes the architecture but does not surface reconstruction error metrics, mutual-information checks, or ablations that isolate how much the gains degrade when reconstruction is imperfect. Without those, it is difficult to tell whether the observed speed-ups come from the privileged signal or from other modeling choices. This paper is aimed at people working on efficient pretraining for structured data and on privileged-information methods more broadly. A reader who already follows in-context learning or tabular foundation models would get the most out of the framework and the theory. It deserves a serious referee. The idea is fresh, the theoretical framing is explicit, and the empirical direction is practical; the reconstruction validation is the main item that needs tightening before publication.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PIQL, a framework for integrating privileged information (PI) into tabular foundation models (TFMs) to accelerate learning and improve generalization. Two forms of PI are constructed: aggregate dataset-level statistics and encodings of the data-generating program. An architecture is proposed to reconstruct this PI from observed context at inference time. Theoretical analysis characterizes conditions for reducing approximation gap and accelerating convergence, with empirical results showing faster convergence, lower loss, and better generalization, thereby reducing data and compute needs.

Significance. If the proposed reconstruction mechanism reliably infers the PI and the theoretical conditions translate to practice without instabilities, this work could establish a principled approach to improve efficiency of foundation models using LUPI. The theoretical analysis and empirical evidence are positive aspects, but the practical feasibility of PI transfer is key to the significance.

major comments (2)

The architecture's ability to learn to reconstruct train-time-only PI (aggregate statistics and data-generating program encodings) from observed context at inference is central to the claims, yet no quantitative measures of reconstruction fidelity (e.g., MSE or mutual information) or ablation studies on reconstruction quality are referenced, leaving the link between theory and observed speed-ups unverified.
The conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes need to be explicitly stated with any assumptions on the reconstruction error; if reconstruction is imperfect, the finite-data acceleration may not hold as claimed.

minor comments (2)

The title appears to have a missing space: 'Models:LUPI' should be 'Models: LUPI'.
The acronym PIQL is introduced but its expansion 'Privileged Information for Quick and Quality Learning' could be clarified earlier for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have revised the manuscript accordingly to strengthen the presentation of the reconstruction mechanism and its theoretical grounding.

read point-by-point responses

Referee: The architecture's ability to learn to reconstruct train-time-only PI (aggregate statistics and data-generating program encodings) from observed context at inference is central to the claims, yet no quantitative measures of reconstruction fidelity (e.g., MSE or mutual information) or ablation studies on reconstruction quality are referenced, leaving the link between theory and observed speed-ups unverified.

Authors: We agree that explicit quantitative evaluation of reconstruction fidelity is necessary to connect the theoretical claims to the observed empirical speed-ups. In the revised manuscript we will report MSE for reconstruction of the aggregate dataset-level statistics and mutual information for the data-generating program encodings. We will also add ablation studies that systematically vary reconstruction quality (via controlled noise injection or reduced context length) and measure the resulting effects on convergence rate, final loss, and generalization error. revision: yes
Referee: The conditions under which PI reduces the population-level approximation gap and accelerates convergence in finite-data regimes need to be explicitly stated with any assumptions on the reconstruction error; if reconstruction is imperfect, the finite-data acceleration may not hold as claimed.

Authors: We appreciate this observation. While the theoretical section already derives bounds on the approximation gap under privileged information, the dependence on reconstruction error was stated only implicitly. In the revision we will explicitly list the assumptions: (i) the reconstruction error is bounded by a term that vanishes as the length of observed context grows, and (ii) this error term appears additively in the finite-sample convergence rate. We will then show that the acceleration result continues to hold whenever the reconstruction error is o(1/sqrt(n)) in the finite-data regime, thereby clarifying the conditions under imperfect reconstruction. revision: yes

Circularity Check

0 steps flagged

No circularity: new framework elements and theory are independent of inputs

full rationale

The paper introduces PIQL as a novel framework, defines two new forms of privileged information (aggregate statistics and data-generating program encodings), designs a reconstruction architecture, and supplies a theoretical characterization of approximation-gap reduction. No equations, self-citations, or fitted parameters are shown to reduce any claimed prediction or gain to the inputs by construction. The derivation chain remains self-contained; empirical gains are presented as external validation rather than tautological outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is presented at a high level without detailing any fitted constants or unproven assumptions beyond standard ML practice.

pith-pipeline@v0.9.0 · 5466 in / 1043 out tokens · 46045 ms · 2026-05-15T06:17:44.334259+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce PIQL... two complementary forms of PI: (i) aggregate dataset-level statistics... (ii) encodings of the underlying data-generating program... architecture that effectively transfers the train-time-only PI by learning to reconstruct it from observed context at inference.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 5.1 (Model-Relative Risk Reduction via Meta-PI under Architectural Constraints)... Theorem 5.2 (Information-Theoretic Risk Reduction via Generator based PI)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

Gediminas Adomavicius and Alexander Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions.IEEE Transactions on Knowledge and Data Engineering, 17(6):734–749, 2005

work page 2005
[2]

The power of personalization: A systematic review of personality-adaptive chatbots.SN Computer Science, 4(5):661, 2023

Tarek Ait Baha, Mohamed El Hajji, Youssef Es-Saady, and Hammou Fadili. The power of personalization: A systematic review of personality-adaptive chatbots.SN Computer Science, 4(5):661, 2023

work page 2023
[3]

Scheduled sampling for sequence prediction with recurrent neural networks

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

work page 2015
[4]

Unleashing the potential of prompt engineering for large language models.Patterns, 6(6), 2025

Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engineering for large language models.Patterns, 6(6), 2025

work page 2025
[5]

Learning personas from dialogue with attentive memory networks

Eric Chu, Prashanth Vijayaraghavan, and Deb Roy. Learning personas from dialogue with attentive memory networks. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2638–2646, 2018

work page 2018
[6]

What you can cram into a single vector: Probing sentence embeddings for linguistic properties

Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What you can cram into a single vector: Probing sentence embeddings for linguistic properties. InACL, 2018

work page 2018
[7]

Dempster, Nan M

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm.Journal of the Royal Statistical Society: Series B, 39(1):1–22, 1977

work page 1977
[8]

From zero to hero: Advancing zero-shot foundation models for tabular outlier detection.arXiv preprint arXiv:2602.03018, 2026

Xueying Ding, Haomin Wen, Simon Klütterman, and Leman Akoglu. From zero to hero: Advancing zero-shot foundation models for tabular outlier detection.arXiv preprint arXiv:2602.03018, 2026

work page arXiv 2026
[9]

Fabian Falck, Ziyu Wang, and Christopher C. Holmes. Is in-context learning in large language models bayesian? a martingale perspective. InProceedings of the 41st International Conference on Machine Learning, pages 12784–12805. PMLR, 2024

work page 2024
[10]

What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes.Advances in neural information processing systems, 35:30583–30598, 2022

work page 2022
[11]

xVal: A continuous number encoding for large language models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo- Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, and Shirley Ho. xVal: A continuous number encoding for large language models. InNeurIPS 2023 AI for Science Workshop, 2023

work page 2023
[12]

Adbench: Anomaly detection benchmark.Advances in Neural Information Processing Systems, 35, 2022

Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. Adbench: Anomaly detection benchmark.Advances in Neural Information Processing Systems, 35, 2022

work page 2022
[13]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[14]

Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

work page 2025
[15]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025
[16]

Varshney, Caiming Xiong, and Richard Socher

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation.arXiv preprint arXiv:1909.05858, 2019. 10

work page arXiv 1909
[17]

Vilt: Vision-and-language transformer without convolution or region supervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-and-language transformer without convolution or region supervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021

work page 2021
[18]

The power of scale for parameter-efficient prompt tuning

Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

work page 2021
[19]

Persona-aware alignment framework for personalized dialogue generation.Transactions of the Association for Computational Linguistics, 13:1722–1742, 2025

Guanrong Li, Xinyu Liu, Zhen Wu, and Xinyu Dai. Persona-aware alignment framework for personalized dialogue generation.Transactions of the Association for Computational Linguistics, 13:1722–1742, 2025

work page 2025
[20]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

work page 2023
[21]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

work page 2021
[22]

Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4):1748–1764, 2021

Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4):1748–1764, 2021

work page 2021
[23]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets.arXiv preprint arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

Cross-task gener- alization via natural language crowdsourcing instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487, 2022

work page 2022
[25]

Transformers can do bayesian inference

Samuel Müller, Noah Hollmann, Sebastian Pineda-Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representa- tions (ICLR), 2022

work page 2022
[26]

On the theory of learning with privileged information

Dmitry Pechyony and Vladimir Vapnik. On the theory of learning with privileged information. InAdvances in Neural Information Processing Systems (NIPS), volume 23, pages 1894–1902, 2010

work page 1902
[27]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[28]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference on machine learning, pages 8821–8831. Pmlr, 2021

work page 2021
[29]

Introduction to recommender systems handbook

Francesco Ricci, Lior Rokach, and Bracha Shapira. Introduction to recommender systems handbook. InRecommender Systems Handbook, pages 1–35. Springer, 2010

work page 2010
[30]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[31]

A foundation model for zero-shot tabular outlier detection.Transactions on Machine Learning Research, 2025

Yuchen Shen, Haomin Wen, and Leman Akoglu. A foundation model for zero-shot tabular outlier detection.Transactions on Machine Learning Research, 2025

work page 2025
[32]

van der Vaart.Asymptotic Statistics

Aad W. van der Vaart.Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998. 11

work page 1998
[33]

Learning using privileged information: Similarity control and knowledge transfer.Journal of Machine Learning Research, 16(61):2023–2049, 2015

Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control and knowledge transfer.Journal of Machine Learning Research, 16(61):2023–2049, 2015

work page 2023
[34]

Knowledge transfer in SVM and neural networks.Annals of Mathematics and Artificial Intelligence, 81(1):3–19, 2017

Vladimir Vapnik and Rauf Izmailov. Knowledge transfer in SVM and neural networks.Annals of Mathematics and Artificial Intelligence, 81(1):3–19, 2017

work page 2017
[35]

A new learning paradigm: Learning using privileged information.Neural Networks, 22(5-6):544–557, 2009

Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information.Neural Networks, 22(5-6):544–557, 2009

work page 2009
[36]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Informa- tion Processing Systems (NIPS), pages 5998–6008, 2017

work page 2017
[37]

Transformers learn in-context by gradient descent

Johannes V on Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. InInternational Conference on Machine Learning, pages 35151–35174. PMLR, 2023

work page 2023
[38]

CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation

Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 8696–8708, 2021

work page 2021
[39]

Dai, and Quoc V Le

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2022

work page 2022
[40]

An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit bayesian inference.arXiv preprint arXiv:2111.02080, 2021

work page arXiv 2021
[41]

Does multimodality lead to better time series forecasting?arXiv preprint arXiv:2506.21611,

Xiyuan Zhang, Boran Han, Haoyang Fang, Abdul Fatir Ansari, Shuai Zhang, Danielle C Maddix, Cuixiong Hu, Andrew Gordon Wilson, Michael W Mahoney, Hao Wang, et al. When does multimodality lead to better time series forecasting?arXiv preprint arXiv:2506.21611, 2025

work page arXiv 2025
[42]

Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W

Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models. InThe Thirty-ninth Annual Conference on Neural Information Proc...

work page 2025
[43]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embed- ding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025. 12 Appendix Broader Impact and Limitations Broader Impact:Our work introduces a ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

(no transfer) reach No-PI’s loss at convergence @1500 epochs respectively @900 and @750 epochs

Specifically, PIQL (transferred) and Silver Enc. (no transfer) reach No-PI’s loss at convergence @1500 epochs respectively @900 and @750 epochs. Figure 5a presents the corresponding permu- tation test results, showing that Silver Enc@750 outperforms No-PI@1500, and that PIQL@900 significantly outperforms both. 25 Similarly, we compare Gold@250, Silver@450...

work page