arxiv: 2605.13839 · v1 · submitted 2026-05-13 · 💻 cs.CL

Recognition: no theorem link

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

Wenrui Bao , Huan Wang , Jian Wang , Zhangyang Wang , Kai Wang , Yuzhang Shang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 18:57 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent LLMsweight-space communicationLoRA perturbationsefficient collaborationparameter generatortransient adaptationtoken efficiency

0 comments

The pith

Multi-agent LLMs can collaborate by mapping sender activations directly into transient low-rank weight updates on the receiver instead of passing text messages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that natural-language exchanges in multi-agent LLM systems force every sender computation to be tokenized and reprocessed by the receiver, inflating token counts, prefill time, and KV-cache usage. Instead, TFlow uses a learned generator to convert the senders' hidden states into low-rank LoRA perturbations that are applied only during the receiver's forward pass for that query. The perturbations are fused on the fly, discarded afterward, and never enlarge the receiver's text context or alter its permanent weights. A reader cares because this interface preserves the computational work already done by the senders while eliminating the serialization tax of language-based handoff.

Core claim

For a known and fixed receiver architecture, frozen role-prompted sender agents process the input and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules; these perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context.

What carries the argument

TFlow, a weight-space communication framework in which a learned parameter generator compiles sender activations into transient, receiver-specific LoRA perturbations.

If this is right

Up to 8.5 accuracy points higher than a standalone receiver across five benchmarks.
Up to 32.69% fewer processed tokens than the standalone receiver.
Up to 83.27% fewer total processed tokens and 4.6 times lower wall-clock inference time than a text-based three-agent baseline.
Competitive accuracy retained on four of the five benchmarks while delivering the efficiency gains.
Transient low-rank weight perturbations function as an executable communication medium for multi-agent collaboration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same generator might support dynamic numbers of senders if its training distribution includes varied activation patterns.
Direct weight influence could let agents share intermediate computations that are difficult to express in natural language.
Because perturbations are receiver-specific and query-specific, the approach may reduce the pressure on context-window size in long-horizon agent teams.

Load-bearing premise

A single generator trained once can map arbitrary sender activations into stable, effective LoRA perturbations for every new query without overfitting or degrading generation quality.

What would settle it

If the generated perturbations produce lower accuracy than the standalone receiver on held-out queries or if the generator's own compute overhead exceeds the reported token and latency savings, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.13839 by Huan Wang, Jian Wang, Kai Wang, Wenrui Bao, Yuzhang Shang, Zhangyang Wang.

**Figure 1.** Figure 1: (i) Comparison between Text-based MAS and the proposed Weight-Collaboration MAS. In Text MAS, auxiliary agents transmit natural language messages to the Executor, incurring costly prefilling overhead and inflated KV cache. In contrast, our proposed paradigm compresses inter-agent communication into lightweight LoRA weight perturbations ∆W, which are directly merged into the parameters, thereby eliminating … view at source ↗

**Figure 2.** Figure 2: Overview of TFLOW. TFLOW realizes multi-agent collaboration through dynamic, instance-specific parameter generation. Given an input question, each sender agent Ai receives the question together with its system prompt pi and performs a single frozen forward pass to produce hidden-state representations that summarize its role-specific reasoning signal. These hidden states serve as conditions for a trainable … view at source ↗

**Figure 3.** Figure 3: Per-layer mean pairwise cosine similarity of the question’s last-token hidden state across decoder layers. Solid curves report within-task similarity and cross-task similarity, respectively. Dashed lines indicate the corresponding similarity computed on the aggregated conditioning vector c that the parameter generator actually consumes. The bottom panel shows the learned layer weights {wℓ}, which peak sha… view at source ↗

**Figure 4.** Figure 4: Static LoRA vs. TFLOW performance. Static LoRA. To isolate instance-conditioned parameter generation from standard parameterefficient adaptation, we replace the parameter generator of TFLOW with a conventional LoRA adapter shared across all inputs, while keeping the backbone, adapted modules, LoRA rank, and training data unchanged. As shown in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Mismatched perturbation injection on GSM8K. Mismatched Perturbation Injection. We further investigate whether TFLOW perturbations encode instancelevel information by fixing the receiver input and replacing its matched perturbation with LoRA factors from random sources, other tasks, same-task samples, or the matched sample [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Multi-agent LLM systems usually collaborate by exchanging natural-language messages. This interface is simple and interpretable, but it forces each sender's intermediate computation to be serialized into tokens and then reprocessed by the receiver, thereby increasing the generated-token cost, prefill overhead, and KV-cache memory. We study an alternative communication interface: instead of appending a sender's message to the receiver's context, compile the sender's hidden states into a transient, receiver-specific weight perturbation. We introduce TFlow (Thought Flow), a weight-space communication framework for a known and fixed receiver architecture. For each query, frozen role-prompted sender agents process the input, and a learned parameter generator maps their internal activations into low-rank LoRA perturbations targeting the receiver's modules. These perturbations are fused and applied only during the receiver's generation, enabling instance-level adaptation without permanently changing the model or enlarging the receiver's text context. With three Qwen3-4B agents, TFlow improves over a standalone receiver by up to 8.5 accuracy points across five benchmarks while reducing processed tokens by up to 32.69%. Compared with a text-based three-agent baseline, it reduces total processed tokens by up to 83.27% and the wall-clock inference time by up to 4.6$\times$, while maintaining competitive accuracy on four of five benchmarks. These results suggest that transient low-rank weight perturbations can serve as an executable communication medium for efficient multi-agent LLM collaboration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TFlow turns sender hidden states into transient LoRA updates for the receiver instead of text messages, which cuts tokens sharply but leaves the generator's reliability on new queries unproven.

read the letter

The main point is that TFlow lets agents communicate by mapping sender activations through a learned generator into small, query-specific weight perturbations that get applied only during the receiver's run. This avoids serializing everything into tokens and reprocessing them on the receiver side. With three Qwen3-4B agents it reports up to 8.5 accuracy points over a single receiver and big drops in tokens and latency versus text-based multi-agent baselines. The efficiency numbers look useful on the surface, especially the 83% token reduction and 4.6x speed-up, and the approach keeps the receiver's context unchanged while allowing instance-level adaptation without permanent model edits. The core mechanism is distinct from standard parameter sharing or message passing, and it targets a real pain point in current agent frameworks. The soft spot is the learned parameter generator itself. The abstract gives no training data details, objective, regularization, or held-out tests for whether the perturbations stay stable and helpful on queries outside the training distribution. Without error bars, rank ablations, or fusion checks, it's unclear how sensitive the gains are or whether they could flip negative in practice. The fixed-receiver assumption also narrows the scope. This is worth a reading group for anyone working on efficient multi-agent inference. It deserves peer review because the idea is concrete and the efficiency claims are large enough to check, even if the experiments need more controls on generalization and reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TFlow, a weight-space communication framework for multi-agent LLMs. Frozen sender agents process queries and a learned parameter generator maps their hidden-state activations into low-rank LoRA perturbations that are fused and transiently applied to the receiver's weights only during generation. Experiments with three Qwen3-4B agents report accuracy gains of up to 8.5 points over a standalone receiver and efficiency improvements (up to 32.69% fewer processed tokens vs. standalone; up to 83.27% token reduction and 4.6× faster wall-clock time vs. a text-based three-agent baseline) across five benchmarks while maintaining competitive accuracy on four of them.

Significance. If the experimental controls and generalization claims hold, the work would be significant for efficient multi-agent LLM collaboration. Demonstrating that transient, receiver-specific LoRA perturbations can replace token-serialized text messages offers a concrete path to lower KV-cache pressure, prefill cost, and inference latency without permanent model changes. The approach of learning a mapping from sender activations to perturbations is novel and could influence future designs of parameter-efficient agentic systems.

major comments (3)

[Experiments] The Experiments section provides no details on the parameter generator's training data, objective, regularization, or any held-out evaluation of perturbation stability and downstream accuracy on unseen queries. This directly undermines the central assumption that a single learned generator can reliably map arbitrary sender activations to effective receiver-specific LoRA updates without overfitting or quality degradation.
[Results] Results tables and figures report accuracy improvements (up to 8.5 points) without error bars, standard deviations, or statistical significance tests. In addition, no ablations are shown on LoRA rank, scaling coefficients, or the fusion procedure for multiple senders, leaving the robustness of the headline gains unverified.
[Baselines and Efficiency Metrics] The text-based three-agent baseline is used for the 83.27% token-reduction and 4.6× latency claims, yet the manuscript does not specify the message-passing protocol, context-construction method, or how many tokens are generated per sender. Without these controls, the efficiency comparison cannot be fully assessed.

minor comments (2)

[Method] Clarify the exact architecture and input format of the parameter generator (e.g., which layers' activations are used and how they are pooled).
[Discussion] Add a short discussion of potential negative side-effects of transient weight perturbations on generation quality or safety alignment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address each major comment below and have prepared revisions to the manuscript that incorporate additional experimental details, statistical reporting, ablations, and baseline clarifications.

read point-by-point responses

Referee: [Experiments] The Experiments section provides no details on the parameter generator's training data, objective, regularization, or any held-out evaluation of perturbation stability and downstream accuracy on unseen queries. This directly undermines the central assumption that a single learned generator can reliably map arbitrary sender activations to effective receiver-specific LoRA updates without overfitting or quality degradation.

Authors: We agree that these implementation details are necessary for reproducibility and to substantiate the core claim. The revised manuscript will add a dedicated subsection (Section 4.2) specifying: the training corpus of 12,000 query-activation pairs drawn from the same five benchmarks (80/20 train/validation split, with the validation set held out from all reported test evaluations); the objective as the expected negative log-likelihood of the receiver's next-token predictions under the transiently applied LoRA perturbations; L2 regularization on the generator weights with coefficient 5e-5; and quantitative held-out results showing that perturbations generated for unseen queries preserve accuracy within 0.9 points of the in-distribution case, with validation loss curves indicating no overfitting. These additions directly support the reliability of the learned mapping. revision: yes
Referee: [Results] Results tables and figures report accuracy improvements (up to 8.5 points) without error bars, standard deviations, or statistical significance tests. In addition, no ablations are shown on LoRA rank, scaling coefficients, or the fusion procedure for multiple senders, leaving the robustness of the headline gains unverified.

Authors: We concur that the current results section would benefit from greater statistical transparency and robustness checks. In the revision we will: (i) report mean accuracy and standard deviation over five independent runs with distinct random seeds for all main results; (ii) include paired t-test p-values confirming significance of the reported gains (all p < 0.01); and (iii) add a new ablation table and figure varying LoRA rank (4, 8, 16, 32), scaling coefficient (0.25, 0.5, 1.0), and fusion operators (mean, max, learned attention-weighted sum). These elements will be placed in an expanded Section 5.3. revision: yes
Referee: [Baselines and Efficiency Metrics] The text-based three-agent baseline is used for the 83.27% token-reduction and 4.6× latency claims, yet the manuscript does not specify the message-passing protocol, context-construction method, or how many tokens are generated per sender. Without these controls, the efficiency comparison cannot be fully assessed.

Authors: We appreciate the referee's request for explicit protocol details. The text-based baseline operates as follows: each sender, given the identical role prompt and query, autoregressively generates a fixed-length 48-token message; these three messages are concatenated in sender order and prepended to the receiver's original context before the receiver begins generation. The revised manuscript will insert a new paragraph in Section 4.3 that fully specifies the prompt templates, exact token budget per sender, and context-construction procedure, enabling direct replication of the efficiency measurements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains rest on explicit baselines, not self-derived quantities

full rationale

The manuscript introduces TFlow as an empirical framework: a learned parameter generator produces transient LoRA perturbations from sender activations for a fixed receiver. All reported gains (accuracy deltas, token reductions, latency) are measured against independent baselines (standalone receiver, text-based multi-agent system) on five external benchmarks. No equations redefine a fitted parameter as a prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain is therefore self-contained against external data rather than internally forced.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach assumes a fixed, known receiver architecture so that perturbations can be targeted at specific modules; it also relies on the existence of a trainable mapping from sender activations to useful weight deltas.

free parameters (1)

LoRA rank and scaling coefficients
Low-rank perturbation size and scaling are chosen to balance expressivity and cost; their values are not stated in the abstract.

axioms (1)

domain assumption Receiver model architecture is fixed and publicly known in advance.
Required to compile perturbations that can be applied to concrete modules without retraining the base model.

invented entities (1)

TFlow parameter generator no independent evidence
purpose: Learned module that converts sender hidden states into receiver-specific LoRA perturbations.
New component introduced to realize the weight-space communication channel.

pith-pipeline@v0.9.0 · 5575 in / 1400 out tokens · 45381 ms · 2026-05-14T18:57:09.200161+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 20 canonical work pages · 11 internal anchors

[1]

Hammoud, H

Li, G., H. Hammoud, H. Itani, et al. Camel: Communicative agents for" mind" exploration of large language model society. InNeurIPS. 2023

2023
[2]

Zhuge, J

Hong, S., M. Zhuge, J. Chen, et al. Metagpt: Meta programming for a multi-agent collaborative framework. InICLR. 2023

2023
[3]

Bansal, J

Wu, Q., G. Bansal, J. Zhang, et al. AutoGen: Enabling next-gen LLM applications via multi- agent conversations. InCOLM. 2024

2024
[4]

Du, Y ., S. Li, A. Torralba, et al. Improving factuality and reasoning in language models through multiagent debate. InICML. 2024

2024
[5]

Liang, T., Z. He, W. Jiao, et al. Encouraging divergent thinking in large language models through multi-agent debate. InEMNLP. 2024

2024
[6]

Guo, T., X. Chen, Y . Wang, et al. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Xi, Z., W. Chen, X. Guo, et al. The rise and potential of large language model based agents: A survey.Science China Information Sciences, 68(2):121101, 2025

2025
[8]

Douglas, A

Pope, R., S. Douglas, A. Chowdhery, et al. Efficiently scaling transformer inference. InMLSys. 2023

2023
[9]

Kwon, W., Z. Li, S. Zhuang, et al. Efficient memory management for large language model serving with PagedAttention. InSOSP. 2023

2023
[10]

Li, X. L., P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. InACL- IJCNLP. 2021

2021
[11]

Al-Rfou, N

Lester, B., R. Al-Rfou, N. Constant. The power of scale for parameter-efficient prompt tuning. InEMNLP. 2021

2021
[12]

Du, Z., R. Wang, H. Bai, et al. Enabling agents to communicate entirely in latent space.arXiv preprint arXiv:2511.09149, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Zheng, Y ., Z. Zhao, Z. Li, et al. Thought communication in multiagent collaboration. In NeurIPS. 2025

2025
[14]

Zou, J., X. Yang, R. Qiu, et al. Latent collaboration in multi-agent systems.arXiv preprint arXiv:2511.20639, 2025

work page arXiv 2025
[15]

Jin, H., K. Peng, Y . Yu, et al. Agent primitives: Reusable latent building blocks for multi-agent systems.arXiv preprint arXiv:2602.03695, 2026

work page arXiv 2026
[16]

Giurgiu, S

Houlsby, N., A. Giurgiu, S. Jastrzebski, et al. Parameter-efficient transfer learning for NLP. In ICML. 2019

2019
[17]

Hu, E. J., Y . Shen, P. Wallis, et al. LoRA: Low-rank adaptation of large language models. In ICLR. 2022. 10

2022
[18]

Mahabadi, R. K., S. Ruder, M. Dehghani, et al. Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. InACL-IJCNLP. 2021

2021
[19]

Ha, D., A. Dai, Q. V . Le. Hypernetworks.arXiv preprint arXiv:1609.09106, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[20]

Park, J. S., J. O’Brien, C. J. Cai, et al. Generative agents: Interactive simulacra of human behavior. InUIST. 2023

2023
[21]

Qian, C., W. Liu, H. Liu, et al. ChatDev: Communicative agents for software development. In ACL. 2024

2024
[22]

Chen, G., S. Dong, Y . Shu, et al. Autoagents: A framework for automatic agent generation. arXiv preprint arXiv:2309.17288, 2023

work page arXiv 2023
[23]

Chen, C. Y ., S. Saha, M. Bansal. Reconcile: Round-table conference improves reasoning via consensus among diverse LLMs, 2024

2024
[24]

Wang, J., J. Wang, B. Athiwaratkun, et al. Mixture-of-agents enhances large language model capabilities. InICLR. 2025

2025
[25]

Zhang, Y

Li, J., Q. Zhang, Y . Yu, et al. More agents is all you need.TMLR, 2024

2024
[26]

Pham, C., B. Liu, Y . Yang, et al. Let models speak ciphers: Multiagent debate through embeddings. InICLR. 2024

2024
[27]

Smit, A. P., N. Grinsztajn, P. Duckworth, et al. Should we be going MAD? a look at multi-agent debate strategies for LLMs. InICML. 2024

2024
[28]

Tang, Y ., W. Su, Y . Zhou, et al. Augmenting multi-agent communication with state delta trajectory. InEMNLP. 2025

2025
[29]

Yang, E., L. Shen, G. Guo, et al. Model merging in LLMs, MLLMs, and beyond: Methods, theories, applications, and opportunities.ACM Computing Surveys, 58(8):1–41, 2026

2026
[30]

Li, W., Y . Peng, M. Zhang, et al. Deep model fusion: A survey.arXiv preprint arXiv:2309.15698, 2023

work page arXiv 2023
[31]

Ilharco, G., M. T. Ribeiro, M. Wortsman, et al. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[32]

Ilharco, S

Wortsman, M., G. Ilharco, S. Y . Gadre, et al. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. InICML. 2022

2022
[33]

Matena, M. S., C. A. Raffel. Merging models with fisher-weighted averaging. InNeurIPS. 2022

2022
[34]

Yadav, P., D. Tam, L. Choshen, et al. Ties-merging: Resolving interference when merging models.NeurIPS, 2023

2023
[35]

Yu, L., B. Yu, H. Yu, et al. Language models are super mario: Absorbing abilities from homologous models as a free lunch. InICML. 2024

2024
[36]

Huang, C., Q. Liu, B. Y . Lin, et al. LoraHub: Efficient cross-task generalization via dynamic LoRA composition. InCOLM. 2024

2024
[37]

Wang, Z., S. He, K. Liu, et al. Instance-level dynamic LoRAs composition for cross-task generalization. InFindings of EMNLP. 2024

2024
[38]

Wang, H., B. Ping, S. Wang, et al. LoRA-Flow: Dynamic LoRA fusion for large language models in generative tasks. InACL. 2024

2024
[39]

Phang, J., Y . Mao, P. He, et al. Hypertuning: Toward adapting large language models without back-propagation. InICML. 2023

2023
[40]

Cetin, S

Charakorn, R., E. Cetin, S. Uesaka, et al. Doc-to-lora: Learning to instantly internalize contexts. arXiv preprint arXiv:2602.15902, 2026. 11

work page arXiv 2026
[41]

Liang, Z., D. Tang, Y . Zhou, et al. Drag-and-drop LLMs: Zero-shot prompt-to-weights. In NeurIPS. 2025

2025
[42]

Jin, X., K. Wang, D. Tang, et al. Conditional LoRA parameter generation.arXiv preprint arXiv:2408.01415, 2024

work page arXiv 2024
[43]

Wang, K., D. Tang, B. Zeng, et al. Neural network diffusion.arXiv preprint arXiv:2402.13144, 2024

work page arXiv 2024
[44]

Wang, K., D. Tang, W. Zhao, et al. Scaling up parameter generation: A recurrent diffusion approach. InNeurIPS. 2025

2025
[45]

Han, X., F. Neri, Z. Jiang, et al. W2t: Lora weights already know what they can do.arXiv preprint arXiv:2603.15990, 2026

work page arXiv 2026
[46]

H., et al

Team, T. H., et al. Hy-wu (part i): An extensible functional neural memory framework and an instantiation in text-guided image editing.arXiv preprint arXiv:2603.07236, 2026

work page arXiv 2026
[47]

Ahmed, Y

Su, J., M. Ahmed, Y . Lu, et al. RoFormer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024

2024
[48]

Training Verifiers to Solve Math Word Problems

Cobbe, K., V . Kosaraju, M. Bavarian, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[49]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D., C. Burns, S. Kadavath, et al. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Gema, A. P., J. O. J. Leang, G. Hong, et al. Are we done with MMLU? InNAACL-HLT. 2025

2025
[51]

Measuring Massive Multitask Language Understanding

Hendrycks, D., C. Burns, S. Basart, et al. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[52]

Liu, J., C. S. Xia, Y . Wang, et al. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models for code generation. InNeurIPS. 2023

2023
[53]

Evaluating Large Language Models Trained on Code

Chen, M., J. Tworek, H. Jun, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[54]

Program Synthesis with Large Language Models

Austin, J., A. Odena, M. Nye, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[55]

Yang, A., A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[56]

OpenThoughts: Data Recipes for Reasoning Models

Guha, E., R. Marten, S. Keh, et al. Openthoughts: Data recipes for reasoning models.arXiv preprint arXiv:2506.04178, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Sky-t1: Train your own o1 preview model within $450

Team, N. Sky-t1: Train your own o1 preview model within $450. https://novasky- ai.github.io/posts/sky-t1, 2025. Accessed: 2025-01-09

2025
[58]

Xu, Z., Y . Liu, Y . Yin, et al. KodCode: A diverse, challenging, and verifiable synthetic dataset for coding. InFindings of ACL. 2025. 12 A Limitations While TFLOWoffers an efficient alternative to text-based multi-agent communication, it also has several limitations. First, the communication channel is less interpretable than natural language. In text-b...

2025