SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Abdallah Aboelela; Chonglin Sun; Dong Liang; Ellie Wen; Feifan Gu; Fenggang Wu; Hang Qu; Huayu Li; Jill Pan; Jingxian Huang

arxiv: 2604.12110 · v2 · pith:NBGSRIW6new · submitted 2026-04-13 · 💻 cs.LG

SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Zikun Liu , Liang Luo , Qianru Li , Zhengyu Zhang , Wei Ling , Jingyi Shen , Zeliang Chen , Yaning Huang

show 26 more authors

Jingxian Huang Abdallah Aboelela Chonglin Sun Feifan Gu Fenggang Wu Hang Qu Huayu Li Jill Pan Kaidi Pei Laming Chen Longhao Jin Qin Huang Tongyi Tang Varna Puvvada Wenlin Chen Xiaohan Wei Xu Cao Yantao Yao Yuan Jin Yunchen Pu Yuxin Chen Zijian Shen Zhengkai Zhang Jing Zhu Dong Liang Ellie Wen

This is my paper

Pith reviewed 2026-05-10 15:01 UTC · model grok-4.3

classification 💻 cs.LG

keywords inference scalingspeculative decodingrecommendation systemsembedding precomputationfoundation modelsasynchronous servinglatent representationsonline advertising

0 comments

The pith

Predicting future user-item pairs allows precomputing their embeddings to use complex foundation models in real-time serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to handle the high cost of running large recommendation foundation models by moving their inference work ahead of actual requests. It predicts which user-item interactions are likely to occur soon and generates the required latent representations asynchronously. This keeps the expensive computation off the latency-sensitive serving path while still providing high-quality outputs when needed. In a large-scale deployment serving billions of requests daily, the approach produced a 0.67 percent gain in revenue-driving metrics.

Core claim

The central claim is that speculative precomputation of latent representations for forecasted user-item pairs decouples foundation-model inference from the critical serving path. Instead of relying on distillation to smaller models, the method forecasts likely requests, runs the full model on those pairs in the background, and stores the resulting embeddings for instant retrieval during live traffic.

What carries the argument

The request-prediction module that selects which user-item pairs to precompute, combined with asynchronous foundation-model inference to generate and cache their embeddings ahead of time.

If this is right

Larger foundation models can be used for serving without increasing response latency.
Recommendation quality improves because full-model representations replace distilled approximations.
The serving system handles high request volumes without proportional growth in real-time compute.
Business metrics tied to recommendation performance show measurable positive change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prediction-plus-precompute pattern could reduce compute waste in other online ML systems where inputs are somewhat predictable.
If the prediction model itself is lightweight, the overall energy cost of serving might decrease even while model size grows.
Combining this offloading with dynamic caching could further cut wasted precomputation when request patterns shift rapidly.

Load-bearing premise

Future user-item pairs can be predicted accurately enough that the cost of precomputing unused embeddings is outweighed by the benefit of having ready representations for the pairs that actually arrive.

What would settle it

A controlled trial in which the prediction accuracy drops such that more than half the precomputed embeddings go unused and the net change in revenue-driving metrics becomes zero or negative.

Figures

Figures reproduced from arXiv: 2604.12110 by Abdallah Aboelela, Chonglin Sun, Dong Liang, Ellie Wen, Feifan Gu, Fenggang Wu, Hang Qu, Huayu Li, Jill Pan, Jingxian Huang, Jingyi Shen, Jing Zhu, Kaidi Pei, Laming Chen, Liang Luo, Longhao Jin, Qianru Li, Qin Huang, Tongyi Tang, Varna Puvvada, Wei Ling, Wenlin Chen, Xiaohan Wei, Xu Cao, Yaning Huang, Yantao Yao, Yuan Jin, Yunchen Pu, Yuxin Chen, Zeliang Chen, Zhengkai Zhang, Zhengyu Zhang, Zijian Shen, Zikun Liu.

**Figure 1.** Figure 1: SOLARIS overview model with user and ad features to narrow the selection to hundreds of items. 3) Final stage ranking [3, 10, 23], where it uses resource-intensive models that analyze thousands of signals, including real-time user activity, to select the top items for auction and delivery. In our system, SOLARIS serves the final stage ranking models. 2.2 Knowledge Transfer Knowledge transfer is a fundame… view at source ↗

read the original abstract

Recent advances in recommendation scaling laws have led to foundation models of unprecedented complexity. While these models offer superior performance, their computational demands make real-time serving impractical, often forcing practitioners to rely on knowledge distillation-compromising serving quality for efficiency. To address this challenge, we present SOLARIS (Speculative Offloading of Latent-bAsed Representation for Inference Scaling), a novel framework inspired by speculative decoding. SOLARIS proactively precomputes user-item interaction embeddings by predicting which user-item pairs are likely to appear in future requests, and asynchronously generating their foundation model representations ahead of time. This approach decouples the costly foundation model inference from the latency-critical serving path, enabling real-time knowledge transfer from models previously considered too expensive for online use. Deployed across Meta's advertising system serving billions of daily requests, SOLARIS achieves 0.67% revenue-driving top-line metrics gain, demonstrating its effectiveness at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SOLARIS, a framework inspired by speculative decoding that predicts future user-item pairs in recommendation systems and asynchronously precomputes their foundation-model embeddings. This decouples expensive inference from the latency-critical serving path. The central claim is a production deployment across Meta's advertising system (billions of daily requests) that yields a 0.67% improvement in revenue-driving top-line metrics.

Significance. If the empirical result holds after full methodological disclosure, the work would be significant for large-scale recommendation systems. It offers a practical route to deploy complex foundation models online without latency penalties, extending speculative-execution ideas from language models to embedding generation. A verified net-positive gain at Meta's scale would provide a concrete existence proof for inference offloading in production recsys.

major comments (1)

Abstract: The 0.67% revenue gain is the load-bearing claim, yet the text supplies no description of the speculative predictor (architecture, training, accuracy, coverage fraction), no cost-benefit accounting for precomputation overhead and staleness, and no baselines or statistical tests. Without these elements the net-value assertion cannot be evaluated.

minor comments (1)

Title: The acronym expansion contains an inconsistent capitalization ('Latent-bAsed'); standardizing to 'Latent-Based' would improve readability.

Simulated Author's Rebuttal

1 responses · 3 unresolved

We thank the referee for the constructive review and for recognizing the potential impact of SOLARIS at scale. We address the major comment on the abstract below, providing the strongest honest response possible given the production context at Meta.

read point-by-point responses

Referee: [—] Abstract: The 0.67% revenue gain is the load-bearing claim, yet the text supplies no description of the speculative predictor (architecture, training, accuracy, coverage fraction), no cost-benefit accounting for precomputation overhead and staleness, and no baselines or statistical tests. Without these elements the net-value assertion cannot be evaluated.

Authors: We agree that the abstract is high-level and does not detail the speculative predictor or the supporting evaluation elements. We will revise the abstract to include a concise description of the predictor's role in forecasting user-item pairs for asynchronous precomputation, along with a high-level note on the production A/B testing that supports the reported gain. However, due to confidentiality constraints at Meta, we cannot provide the predictor's architecture, training procedure, accuracy metrics, coverage fraction, cost-benefit details, overhead accounting, staleness handling specifics, baselines, or statistical test results in the public manuscript. These elements involve proprietary infrastructure and internal metrics that cannot be fully disclosed. revision: partial

standing simulated objections not resolved

Detailed description of the speculative predictor architecture, training, accuracy, and coverage fraction
Cost-benefit accounting for precomputation overhead and staleness
Baselines and statistical tests validating the 0.67% revenue gain

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claim is an empirical production deployment result (0.67% revenue gain in Meta's advertising system serving billions of requests). The provided text contains no equations, derivations, fitted parameters, or self-citations that could form a load-bearing chain. The framework is described at a high level as inspired by speculative decoding, with precomputation of embeddings based on future-pair prediction, but no internal modeling step reduces to its own inputs by construction. The result is externally falsifiable via deployment metrics and does not rely on any self-referential definition or renamed known result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract describes a high-level engineering framework without introducing mathematical derivations, free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5581 in / 1131 out tokens · 69107 ms · 2026-05-10T15:01:26.487506+00:00 · methodology

SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)