Recognition: no theorem link
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
Pith reviewed 2026-05-14 18:57 UTC · model grok-4.3
The pith
Multi-agent LLMs can collaborate by mapping sender activations directly into transient low-rank weight updates on the receiver instead of passing text messages.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a known and fixed receiver architecture, frozen role-prompted sender agents process the input and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules; these perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context.
What carries the argument
TFlow, a weight-space communication framework in which a learned parameter generator compiles sender activations into transient, receiver-specific LoRA perturbations.
If this is right
- Up to 8.5 accuracy points higher than a standalone receiver across five benchmarks.
- Up to 32.69% fewer processed tokens than the standalone receiver.
- Up to 83.27% fewer total processed tokens and 4.6 times lower wall-clock inference time than a text-based three-agent baseline.
- Competitive accuracy retained on four of the five benchmarks while delivering the efficiency gains.
- Transient low-rank weight perturbations function as an executable communication medium for multi-agent collaboration.
Where Pith is reading between the lines
- The same generator might support dynamic numbers of senders if its training distribution includes varied activation patterns.
- Direct weight influence could let agents share intermediate computations that are difficult to express in natural language.
- Because perturbations are receiver-specific and query-specific, the approach may reduce the pressure on context-window size in long-horizon agent teams.
Load-bearing premise
A single generator trained once can map arbitrary sender activations into stable, effective LoRA perturbations for every new query without overfitting or degrading generation quality.
What would settle it
If the generated perturbations produce lower accuracy than the standalone receiver on held-out queries or if the generator's own compute overhead exceeds the reported token and latency savings, the central claim would be falsified.
Figures
read the original abstract
Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TFlow, a weight-space communication framework for multi-agent LLMs. Frozen sender agents process queries and a learned parameter generator maps their hidden-state activations into low-rank LoRA perturbations that are fused and transiently applied to the receiver's weights only during generation. Experiments with three Qwen3-4B agents report accuracy gains of up to 8.5 points over a standalone receiver and efficiency improvements (up to 32.69% fewer processed tokens vs. standalone; up to 83.27% token reduction and 4.6× faster wall-clock time vs. a text-based three-agent baseline) across five benchmarks while maintaining competitive accuracy on four of them.
Significance. If the experimental controls and generalization claims hold, the work would be significant for efficient multi-agent LLM collaboration. Demonstrating that transient, receiver-specific LoRA perturbations can replace token-serialized text messages offers a concrete path to lower KV-cache pressure, prefill cost, and inference latency without permanent model changes. The approach of learning a mapping from sender activations to perturbations is novel and could influence future designs of parameter-efficient agentic systems.
major comments (3)
- [Experiments] The Experiments section provides no details on the parameter generator's training data, objective, regularization, or any held-out evaluation of perturbation stability and downstream accuracy on unseen queries. This directly undermines the central assumption that a single learned generator can reliably map arbitrary sender activations to effective receiver-specific LoRA updates without overfitting or quality degradation.
- [Results] Results tables and figures report accuracy improvements (up to 8.5 points) without error bars, standard deviations, or statistical significance tests. In addition, no ablations are shown on LoRA rank, scaling coefficients, or the fusion procedure for multiple senders, leaving the robustness of the headline gains unverified.
- [Baselines and Efficiency Metrics] The text-based three-agent baseline is used for the 83.27% token-reduction and 4.6× latency claims, yet the manuscript does not specify the message-passing protocol, context-construction method, or how many tokens are generated per sender. Without these controls, the efficiency comparison cannot be fully assessed.
minor comments (2)
- [Method] Clarify the exact architecture and input format of the parameter generator (e.g., which layers' activations are used and how they are pooled).
- [Discussion] Add a short discussion of potential negative side-effects of transient weight perturbations on generation quality or safety alignment.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address each major comment below and have prepared revisions to the manuscript that incorporate additional experimental details, statistical reporting, ablations, and baseline clarifications.
read point-by-point responses
-
Referee: [Experiments] The Experiments section provides no details on the parameter generator's training data, objective, regularization, or any held-out evaluation of perturbation stability and downstream accuracy on unseen queries. This directly undermines the central assumption that a single learned generator can reliably map arbitrary sender activations to effective receiver-specific LoRA updates without overfitting or quality degradation.
Authors: We agree that these implementation details are necessary for reproducibility and to substantiate the core claim. The revised manuscript will add a dedicated subsection (Section 4.2) specifying: the training corpus of 12,000 query-activation pairs drawn from the same five benchmarks (80/20 train/validation split, with the validation set held out from all reported test evaluations); the objective as the expected negative log-likelihood of the receiver's next-token predictions under the transiently applied LoRA perturbations; L2 regularization on the generator weights with coefficient 5e-5; and quantitative held-out results showing that perturbations generated for unseen queries preserve accuracy within 0.9 points of the in-distribution case, with validation loss curves indicating no overfitting. These additions directly support the reliability of the learned mapping. revision: yes
-
Referee: [Results] Results tables and figures report accuracy improvements (up to 8.5 points) without error bars, standard deviations, or statistical significance tests. In addition, no ablations are shown on LoRA rank, scaling coefficients, or the fusion procedure for multiple senders, leaving the robustness of the headline gains unverified.
Authors: We concur that the current results section would benefit from greater statistical transparency and robustness checks. In the revision we will: (i) report mean accuracy and standard deviation over five independent runs with distinct random seeds for all main results; (ii) include paired t-test p-values confirming significance of the reported gains (all p < 0.01); and (iii) add a new ablation table and figure varying LoRA rank (4, 8, 16, 32), scaling coefficient (0.25, 0.5, 1.0), and fusion operators (mean, max, learned attention-weighted sum). These elements will be placed in an expanded Section 5.3. revision: yes
-
Referee: [Baselines and Efficiency Metrics] The text-based three-agent baseline is used for the 83.27% token-reduction and 4.6× latency claims, yet the manuscript does not specify the message-passing protocol, context-construction method, or how many tokens are generated per sender. Without these controls, the efficiency comparison cannot be fully assessed.
Authors: We appreciate the referee's request for explicit protocol details. The text-based baseline operates as follows: each sender, given the identical role prompt and query, autoregressively generates a fixed-length 48-token message; these three messages are concatenated in sender order and prepended to the receiver's original context before the receiver begins generation. The revised manuscript will insert a new paragraph in Section 4.3 that fully specifies the prompt templates, exact token budget per sender, and context-construction procedure, enabling direct replication of the efficiency measurements. revision: yes
Circularity Check
No circularity: empirical gains rest on explicit baselines, not self-derived quantities
full rationale
The manuscript introduces TFlow as an empirical framework: a learned parameter generator produces transient LoRA perturbations from sender activations for a fixed receiver. All reported gains (accuracy deltas, token reductions, latency) are measured against independent baselines (standalone receiver, text-based multi-agent system) on five external benchmarks. No equations redefine a fitted parameter as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external data rather than internally forced.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank and scaling coefficients
axioms (1)
- domain assumption Receiver model architecture is fixed and publicly known in advance.
invented entities (1)
-
TFlow parameter generator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hammoud, H
Li, G., H. Hammoud, H. Itani, et al. Camel: Communicative agents for" mind" exploration of large language model society. InNeurIPS. 2023
2023
-
[2]
Zhuge, J
Hong, S., M. Zhuge, J. Chen, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InICLR. 2023
2023
-
[3]
Bansal, J
Wu, Q., G. Bansal, J. Zhang, et al. AutoGen: Enabling next-gen LLM applications via multi- agent conversations. InCOLM. 2024
2024
-
[4]
Du, Y ., S. Li, A. Torralba, et al. Improving factuality and reasoning in language models through multiagent debate. InICML. 2024
2024
-
[5]
Liang, T., Z. He, W. Jiao, et al. Encouraging divergent thinking in large language models through multi-agent debate. InEMNLP. 2024
2024
-
[6]
Guo, T., X. Chen, Y . Wang, et al. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Xi, Z., W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025
2025
-
[8]
Douglas, A
Pope, R., S. Douglas, A. Chowdhery, et al. Efficiently scaling transformer inference. InMLSys. 2023
2023
-
[9]
Kwon, W., Z. Li, S. Zhuang, et al. Efficient memory management for large language model serving with PagedAttention. InSOSP. 2023
2023
-
[10]
Li, X. L., P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. InACL- IJCNLP. 2021
2021
-
[11]
Al-Rfou, N
Lester, B., R. Al-Rfou, N. Constant. The power of scale for parameter-efficient prompt tuning. InEMNLP. 2021
2021
-
[12]
Du, Z., R. Wang, H. Bai, et al. Enabling agents to communicate entirely in latent space.arXiv preprint arXiv:2511.09149, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Zheng, Y ., Z. Zhao, Z. Li, et al. Thought communication in multiagent collaboration. In NeurIPS. 2025
2025
- [14]
- [15]
-
[16]
Giurgiu, S
Houlsby, N., A. Giurgiu, S. Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In ICML. 2019
2019
-
[17]
Hu, E. J., Y . Shen, P. Wallis, et al. LoRA: Low-rank adaptation of large language models. In ICLR. 2022. 10
2022
-
[18]
Mahabadi, R. K., S. Ruder, M. Dehghani, et al. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. InACL-IJCNLP. 2021
2021
-
[19]
Ha, D., A. Dai, Q. V . Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[20]
Park, J. S., J. O’Brien, C. J. Cai, et al. Generative agents: Interactive simulacra of human behavior. InUIST. 2023
2023
-
[21]
Qian, C., W. Liu, H. Liu, et al. ChatDev: Communicative agents for software development. In ACL. 2024
2024
- [22]
-
[23]
Chen, C. Y ., S. Saha, M. Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse LLMs, 2024
2024
-
[24]
Wang, J., J. Wang, B. Athiwaratkun, et al. Mixture-of-agents enhances large language model capabilities. InICLR. 2025
2025
-
[25]
Zhang, Y
Li, J., Q. Zhang, Y . Yu, et al. More agents is all you need.TMLR, 2024
2024
-
[26]
Pham, C., B. Liu, Y . Yang, et al. Let models speak ciphers: Multiagent debate through embeddings. InICLR. 2024
2024
-
[27]
Smit, A. P., N. Grinsztajn, P. Duckworth, et al. Should we be going MAD? a look at multi-agent debate strategies for LLMs. InICML. 2024
2024
-
[28]
Tang, Y ., W. Su, Y . Zhou, et al. Augmenting multi-agent communication with state delta trajectory. InEMNLP. 2025
2025
-
[29]
Yang, E., L. Shen, G. Guo, et al. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications, and opportunities.ACM Computing Surveys, 58(8):1–41, 2026
2026
- [30]
-
[31]
Ilharco, G., M. T. Ribeiro, M. Wortsman, et al. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[32]
Ilharco, S
Wortsman, M., G. Ilharco, S. Y . Gadre, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML. 2022
2022
-
[33]
Matena, M. S., C. A. Raffel. Merging models with fisher-weighted averaging. InNeurIPS. 2022
2022
-
[34]
Yadav, P., D. Tam, L. Choshen, et al. Ties-merging: Resolving interference when merging models.NeurIPS, 2023
2023
-
[35]
Yu, L., B. Yu, H. Yu, et al. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InICML. 2024
2024
-
[36]
Huang, C., Q. Liu, B. Y . Lin, et al. LoraHub: Efficient cross-task generalization via dynamic LoRA composition. InCOLM. 2024
2024
-
[37]
Wang, Z., S. He, K. Liu, et al. Instance-level dynamic LoRAs composition for cross-task generalization. InFindings of EMNLP. 2024
2024
-
[38]
Wang, H., B. Ping, S. Wang, et al. LoRA-Flow: Dynamic LoRA fusion for large language models in generative tasks. InACL. 2024
2024
-
[39]
Phang, J., Y . Mao, P. He, et al. Hypertuning: Toward adapting large language models without back-propagation. InICML. 2023
2023
- [40]
-
[41]
Liang, Z., D. Tang, Y . Zhou, et al. Drag-and-drop LLMs: Zero-shot prompt-to-weights. In NeurIPS. 2025
2025
- [42]
- [43]
-
[44]
Wang, K., D. Tang, W. Zhao, et al. Scaling up parameter generation: A recurrent diffusion approach. InNeurIPS. 2025
2025
- [45]
- [46]
-
[47]
Ahmed, Y
Su, J., M. Ahmed, Y . Lu, et al. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
2024
-
[48]
Training Verifiers to Solve Math Word Problems
Cobbe, K., V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[49]
Measuring Mathematical Problem Solving With the MATH Dataset
Hendrycks, D., C. Burns, S. Kadavath, et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[50]
Gema, A. P., J. O. J. Leang, G. Hong, et al. Are we done with MMLU? InNAACL-HLT. 2025
2025
-
[51]
Measuring Massive Multitask Language Understanding
Hendrycks, D., C. Burns, S. Basart, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[52]
Liu, J., C. S. Xia, Y . Wang, et al. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InNeurIPS. 2023
2023
-
[53]
Evaluating Large Language Models Trained on Code
Chen, M., J. Tworek, H. Jun, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[54]
Program Synthesis with Large Language Models
Austin, J., A. Odena, M. Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[55]
Yang, A., A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
OpenThoughts: Data Recipes for Reasoning Models
Guha, E., R. Marten, S. Keh, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Sky-t1: Train your own o1 preview model within $450
Team, N. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09
2025
-
[58]
Xu, Z., Y . Liu, Y . Yin, et al. KodCode: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of ACL. 2025. 12 A Limitations While TFLOWoffers an efficient alternative to text-based multi-agent communication, it also has several limitations. First, the communication channel is less interpretable than natural language. In text-b...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.