Latent Cache Flow: Model-to-Model Communication Without Text

Eugene Wu; Maximillian Rossi; Prajwal Raghunath

arxiv: 2605.22863 · v1 · pith:APRAYUSXnew · submitted 2026-05-19 · 💻 cs.LG

Latent Cache Flow: Model-to-Model Communication Without Text

Maximillian Rossi , Prajwal Raghunath , Eugene Wu This is my paper

Pith reviewed 2026-05-25 05:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords Latent Cache FlowKV cache communicationLLM agentsmodel-to-model communicationcache adaptersinformation sharing without textcache-to-cache translation

0 comments

The pith

A compact adapter lets LLMs exchange KV cache summaries instead of text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Latent Cache Flow to let LLMs communicate by translating and compressing their internal key-value caches into summaries of new information. This sidesteps the latency and detail loss that come from turning model states into text for the receiver to re-encode. The adapter shrinks to roughly 4 percent the size of earlier cache-translation methods by handling keys and values together. Early tests indicate the 13 MB adapter can exceed the accuracy of a 956 MB predecessor when contexts match and deliver 23 percent higher accuracy with 8.5 times the speed of text when contexts differ.

Core claim

Latent Cache Flow enables model-to-model communication without text by using a small adapter that jointly translates and compresses keys and values from the sharer model's KV cache into a summary of new information the receiver does not already possess, allowing the receiver to incorporate that information even when the two models maintain different contexts.

What carries the argument

The Latent Cache Flow adapter, which jointly translates and compresses KV cache entries to transmit summaries of new information.

If this is right

A 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared-context settings.
For different contexts, LCF is 23 percent more accurate than text-based communication.
LCF communication runs 8.5 times faster than text-based methods.
The adapter size is reduced to about 4 percent of that used in Cache-to-Cache approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Networks of LLMs could exchange updates continuously without regenerating text at every step.
The summary approach might extend to cases where models differ in size or architecture.
Testing whether the same adapter works across entirely new model pairs would show how general the translation is.

Load-bearing premise

A learned summary of new information extracted from the sharer KV cache can be translated by a small adapter into a form the receiver model can usefully incorporate without requiring identical context or losing critical details.

What would settle it

An experiment in which the receiver model, after receiving an LCF summary, shows no improvement on questions that require the new information the sharer held.

Figures

Figures reproduced from arXiv: 2605.22863 by Eugene Wu, Maximillian Rossi, Prajwal Raghunath.

**Figure 1.** Figure 1: Conceptual efficiency–flexibility trade-off. LCF improves prior cache-level transfer efficiency while relaxing C2C’s alignment constraint toward T2T-like flexibility. challenge by treating the sharer’s cache as a learned semantic conditioning signal. C2C trains a fuser module that takes both models’ caches as input and generates a residual update, which is added to the receiver cache before decoding [PI… view at source ↗

**Figure 3.** Figure 3: LCF/LCF-X architecture. Blue denotes sharer-only states, green denotes receiver-only states, and yellow denotes projection/compression operations. Sharer-side pooling enables crosscontext summaries; the shared LCF projector compresses KV inputs and up-projects receiver-cache residuals. 2. Concat & Flatten The KV cache tensors from the Sharer and Receiver are concatenated and flattened along the head dime… view at source ↗

**Figure 4.** Figure 4: summarizes the shared-context results, including both bottleneck scaling and layer-pruned variants. LCF establishes an accuracy–efficiency frontier for inter-model communication, outperforming C2C at substantially lower adapter overhead. The frontier is traced by the progression from LCF-128-9L to LCF-128, and finally to LCF-256. Increasing latent capacity improves accuracy while keeping adapter size much … view at source ↗

**Figure 5.** Figure 5: summarizes the quality–latency frontier. LCF-X lies above and to the left of the T2T sweep, indicating higher answer quality at lower communication latency [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a 13 MB LCF adapter can be more accurate than a 956 MB C2C adapter in shared-context settings; for different contexts, LCF is 23% more accurate and 8.5x faster than text-based communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LCF shrinks the KV adapter to 13 MB via joint compression and adds a summary step for mismatched contexts, but the accuracy and speed claims rest on undetailed early experiments.

read the letter

The main advance is the joint compression-plus-translation design that cuts the adapter to roughly 4% the size of C2C, combined with transmitting a summary of new information rather than token-by-token translation. This directly targets the identical-context requirement that limits prior cache-sharing work and makes the approach more usable for agents whose models see different histories. The size reduction and the explicit handling of context deltas are the concrete differences from the cited C2C baseline. If those mechanisms work as described, they address real latency and information-loss issues in multi-model setups. The paper states the design choices clearly and frames the reported gains as experimental outcomes rather than derived from fitted parameters, which keeps the argument from being circular at the level shown. The weakest part is the evidence. The abstract gives accuracy and speed numbers but supplies no task descriptions, dataset details, run counts, baselines, or variance measures, so the 23% accuracy and 8.5x speed claims cannot be evaluated yet. The assumption that a learned summary extracted from the sharer cache can be turned by a small adapter into usable information for a receiver with non-identical context is plausible, but it remains an empirical question that the current writeup leaves open. This is for researchers building LLM agent systems who need lower-overhead state sharing. A reader already working on cache translation or multi-agent efficiency would find the size and context-handling ideas worth checking. The work shows clear engagement with the practical constraints and prior limitations, so it deserves a serious referee even if the experiments need substantial expansion and verification.

Referee Report

1 major / 0 minor

Summary. The paper proposes Latent Cache Flow (LCF) as an alternative to text-based or Cache-to-Cache (C2C) communication between LLMs. It jointly compresses and translates KV-cache entries via a small adapter (~13 MB, or 4% of C2C size) that transmits only a summary of new information, enabling use with non-identical contexts. Early experiments are reported to show the LCF adapter outperforming the 956 MB C2C adapter on accuracy in shared-context settings and delivering 23% higher accuracy plus 8.5x speedup versus text-based baselines when contexts differ.

Significance. If the performance claims are substantiated, LCF would offer a practical route to low-latency, low-loss inter-agent communication that scales to models with mismatched contexts, addressing a clear bottleneck in multi-LLM systems. The size reduction and context-robustness design choices are concrete engineering contributions that could be adopted independently of the specific accuracy numbers.

major comments (1)

[Abstract] Abstract: the central performance claims (13 MB LCF more accurate than 956 MB C2C; 23% accuracy gain and 8.5x speedup vs. text for differing contexts) rest entirely on “early experiments” for which no methodology, datasets, model pairs, baseline implementations, number of trials, error bars, or statistical tests are supplied. Without these details the empirical support for the design’s advantages cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for identifying the lack of experimental detail. We agree that the abstract's reference to 'early experiments' requires supporting methodology to allow evaluation of the claims, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (13 MB LCF more accurate than 956 MB C2C; 23% accuracy gain and 8.5x speedup vs. text for differing contexts) rest entirely on “early experiments” for which no methodology, datasets, model pairs, baseline implementations, number of trials, error bars, or statistical tests are supplied. Without these details the empirical support for the design’s advantages cannot be evaluated.

Authors: We acknowledge that this comment is correct and that the current manuscript does not supply the requested details. The work is presented as preliminary, with the abstract summarizing early results. In the revised version we will add a dedicated Experiments section describing the full methodology, datasets and benchmarks, model pairs, baseline implementations (including how the 956 MB C2C adapter was reproduced), number of trials, error bars, and statistical tests. We will also update the abstract to reference this section. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript presents LCF as an empirical design for KV-cache communication, motivated by observations on joint translation/compression and summary transmission for differing contexts. No equations, derivations, fitted parameters, or self-citations appear in the provided text that would reduce any claimed result to its inputs by construction. Accuracy and latency numbers are reported as experimental outcomes rather than predictions forced by parameter fitting or definitional equivalence. The central feasibility claim is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that KV-cache information is sufficiently model-agnostic to be summarized and translated by a small learned adapter.

pith-pipeline@v0.9.0 · 5741 in / 1133 out tokens · 18956 ms · 2026-05-25T05:44:56.893576+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle=

work page
[2]

International Conference on Machine Learning , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. International Conference on Machine Learning , year=

work page
[3]

and Burger, Doug and Wang, Chi , booktitle=

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , booktitle=

work page
[4]

Chen, Weize and Su, Yusheng and Zuo, Jingwei and Yang, Cheng and Yuan, Chenfei and Chan, Chi-Min and Yu, Heyang and Lu, Yaxi and Hung, Yi-Hsin and Qian, Chen and Qin, Yujia and Cong, Xin and Xie, Ruobing and Liu, Zhiyuan and Sun, Maosong and Zhou, Jie , booktitle=

work page
[5]

Proceedings of Machine Learning and Systems , year=

Efficiently Scaling Transformer Inference , author=. Proceedings of Machine Learning and Systems , year=

work page
[6]

International Conference on Learning Representations , year=

Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. International Conference on Learning Representations , year=

work page
[7]

arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Advances in Neural Information Processing Systems , year=

Transferring Linear Features Across Language Models With Model Stitching , author=. Advances in Neural Information Processing Systems , year=

work page
[9]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=. 2018 , publisher=

work page 2018
[10]

arXiv preprint arXiv:2601.06123 , year =

Latent Space Communication via K-V Cache Alignment , author =. arXiv preprint arXiv:2601.06123 , year =. 2601.06123 , archivePrefix =

work page arXiv
[11]

2023 , publisher =

OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =

work page 2023

[1] [1]

Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle=

work page

[2] [2]

International Conference on Machine Learning , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. International Conference on Machine Learning , year=

work page

[3] [3]

and Burger, Doug and Wang, Chi , booktitle=

Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Zhang, Shaokun and Liu, Jiale and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , booktitle=

work page

[4] [4]

Chen, Weize and Su, Yusheng and Zuo, Jingwei and Yang, Cheng and Yuan, Chenfei and Chan, Chi-Min and Yu, Heyang and Lu, Yaxi and Hung, Yi-Hsin and Qian, Chen and Qin, Yujia and Cong, Xin and Xie, Ruobing and Liu, Zhiyuan and Sun, Maosong and Zhou, Jie , booktitle=

work page

[5] [5]

Proceedings of Machine Learning and Systems , year=

Efficiently Scaling Transformer Inference , author=. Proceedings of Machine Learning and Systems , year=

work page

[6] [6]

International Conference on Learning Representations , year=

Cache-to-Cache: Direct Semantic Communication Between Large Language Models , author=. International Conference on Learning Representations , year=

work page

[7] [7]

arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Advances in Neural Information Processing Systems , year=

Transferring Linear Features Across Language Models With Model Stitching , author=. Advances in Neural Information Processing Systems , year=

work page

[9] [9]

and Salakhutdinov, Ruslan and Manning, Christopher D

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William W. and Salakhutdinov, Ruslan and Manning, Christopher D. , booktitle=. 2018 , publisher=

work page 2018

[10] [10]

arXiv preprint arXiv:2601.06123 , year =

Latent Space Communication via K-V Cache Alignment , author =. arXiv preprint arXiv:2601.06123 , year =. 2601.06123 , archivePrefix =

work page arXiv

[11] [11]

2023 , publisher =

OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants , author =. 2023 , publisher =

work page 2023