pith. sign in

arxiv: 2505.12741 · v2 · pith:LGTTE77Fnew · submitted 2025-05-19 · 💻 cs.AI

Language Model Networks: Supervision-Efficient Learning through Dense Communication

Pith reviewed 2026-05-22 14:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords language model networksdense communicationseq2seq modulesend-to-end optimizationlimited supervisionmulti-model collaborationdifferentiable edges
0
0 comments X

The pith

Language model networks learn dense vector communication between pre-trained nodes to enable end-to-end optimization with limited supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LMNet to connect pre-trained language models as nodes in a larger system. Communication occurs through trainable sequence-to-sequence modules that exchange dense vectors rather than generating text at every step. This bypasses repeated embedding and de-embedding operations so gradients can flow through the entire network from the final task loss. A sympathetic reader would care because the approach promises to combine existing models into collaborative systems that adapt effectively when only small amounts of task-specific data are available.

Core claim

LMNet realizes language model networks by using stripped pre-trained LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary and thereby achieving efficient information transfer, end-to-end gradient optimization, and learned communication protocols beyond hand-designed ones.

What carries the argument

LMNet architecture, in which pre-trained language models function as reusable nodes connected by trainable seq2seq modules that pass dense vectors to support differentiable communication across the network.

If this is right

  • The full network can be optimized end-to-end from the final task objective.
  • Performance gains appear with only small additional training cost for the communication modules.
  • The system adapts to new tasks under limited supervision while keeping natural language at the boundaries.
  • Communication protocols emerge automatically instead of relying on manually specified formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar dense links could reduce the number of tokens generated at intermediate steps and thereby lower inference latency in multi-model pipelines.
  • The approach might extend to networks that mix language models with other differentiable modules such as vision encoders.
  • Learned vector protocols could transfer across related tasks if the seq2seq modules are kept frozen after initial training.

Load-bearing premise

Trainable seq2seq modules can learn effective dense communication protocols from end-task supervision alone without degrading the capabilities of the pre-trained LLM nodes or requiring extensive additional data.

What would settle it

Train an LMNet on a concrete task such as multi-step reasoning and compare its accuracy against both the strongest single pre-trained model and a baseline network that communicates only through generated natural language text; if the dense version shows no gain or a loss, the claim of effective learned communication fails.

Figures

Figures reproduced from arXiv: 2505.12741 by Quanming Yao, Shiguang Wu, Yaqing Wang.

Figure 1
Figure 1. Figure 1: Communication between LLMs through dense vectors eliminates the bottleneck of natural language. Large Language Models (LLMs) have achieved impressive performance in natural language understanding, genera￾tion, and reasoning [5]. Modern LLMs exhibit general intelligence capabilities across a wide range of subjects [1, 52, 11], but still face limitations when tackling complex tasks that require domain-specif… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the proposed paradigm. (a) A standard LLM processes discrete token inputs [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of attention weights in the edge modules on the 4 edges at the last layer of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of query projection matrix of the attention block on every edge in trained [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time reasoning to multi-model collaboration. We study language model networks, where pre-trained language models serve as reusable nodes and intelligence emerges from their topology, communication, and optimization. Existing systems mostly communicate through natural language: easy to deploy, but discrete, inefficient, and hard to optimize from end-task supervision. We propose LMNet, a dense and differentiable realization of this paradigm. LMNet uses stripped LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary. By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization, and learned communication beyond hand-designed protocols. Experiments show performance with small additional training cost and effective adaptation under limited supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes LMNet, a network architecture in which pre-trained language models serve as reusable nodes connected by trainable seq2seq modules that exchange dense vector representations. This design bypasses intermediate embedding and de-embedding steps to enable efficient, differentiable communication, end-to-end gradient flow, and learned protocols that adapt under limited supervision, with claims of small additional training cost relative to natural-language baselines.

Significance. If the central claims are substantiated, the work would offer a concrete mechanism for supervision-efficient multi-LLM systems by replacing discrete text exchanges with dense, optimizable channels. The approach directly addresses a practical bottleneck in current multi-model inference pipelines and could influence designs for collaborative reasoning systems.

major comments (2)
  1. [Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.
  2. [Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.
minor comments (1)
  1. [Abstract] The abstract refers to 'stripped LLMs' without defining what layers or components are removed; a brief clarification of the node interface would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our submission. The comments have helped us identify areas where additional clarity and analysis would strengthen the manuscript. We address each major comment below and have incorporated revisions to improve the presentation of our experimental support and architectural assumptions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.

    Authors: We agree that the abstract would benefit from greater specificity to support the central claims. While the full manuscript details the experimental setup—including datasets, model scales, training cardinalities, natural-language baselines, communication ablations, and statistical reporting—in the Experiments section, we have revised the abstract to include a concise high-level summary of these elements along with references to the relevant sections. This change makes the empirical support more immediately evaluable without substantially increasing length. revision: yes

  2. Referee: [Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.

    Authors: We appreciate this observation on the implicit assumptions of the architecture. The manuscript currently supports the claim through end-to-end performance gains under limited supervision, but we concur that direct evidence of alignment and non-degradation would be valuable. In the revised version we have added a dedicated analysis subsection with ablations that quantify distribution alignment between seq2seq outputs and LLM hidden states, measure any capability degradation on the frozen nodes, and report the additional data required for stable training. revision: yes

Circularity Check

0 steps flagged

No circularity: LMNet proposal introduces new architecture with empirical claims

full rationale

The paper proposes LMNet as a system architecture using stripped pre-trained LLMs as nodes and trainable seq2seq modules as edges. Claims of efficient dense communication, end-to-end optimization, and limited-supervision adaptation rest on the introduction of these components and reported experimental outcomes rather than any derivation that reduces to its own inputs by construction. No equations, predictions, or uniqueness theorems are presented that loop back to fitted parameters or self-referential definitions. The central premise is a methodological suggestion whose value is asserted via performance results, not tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that pre-trained LLMs remain functional when stripped for use as reusable nodes and that seq2seq modules can be trained to handle intermediate dense representations effectively.

free parameters (1)
  • seq2seq module parameters
    Trainable parameters introduced for the communication edges that are optimized end-to-end.
axioms (2)
  • domain assumption Pre-trained language models can serve as reusable vertex modules after stripping
    Invoked when describing LMNet nodes in the abstract.
  • domain assumption Dense vector exchange preserves necessary information for system-level tasks
    Implicit in the claim that bypassing embedding/de-embedding enables efficient transfer.
invented entities (1)
  • LMNet architecture with seq2seq communication edges no independent evidence
    purpose: To realize dense differentiable communication in language model networks
    New proposed system not previously described in the abstract's context

pith-pipeline@v0.9.0 · 5680 in / 1378 out tokens · 32102 ms · 2026-05-22T14:46:34.268094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 19 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Program Synthesis with Large Language Models

    Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

  3. [3]

    A neural probabilistic language model

    Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

  4. [4]

    Graph of Thoughts: Solving elaborate problems with large language models

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of Thoughts: Solving elaborate problems with large language models. In AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

  5. [5]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

  6. [6]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  7. [7]

    Natural language processing

    KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of Artificial Intelligence, pages 603–649, 2020

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    On the speed of mental processes.Acta psychologica, 30:412–431, 1969

    Franciscus Cornelis Donders. On the speed of mental processes.Acta psychologica, 30:412–431, 1969

  10. [10]

    Improving factuality and reasoning in language models through multiagent debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In International Conference on Machine Learning, 2023

  11. [11]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  12. [12]

    Bonbon alignment for large language models and the sweetness of best-of-n sampling

    Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024

  13. [13]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

  14. [14]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. 10

  15. [15]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems, 2021

  16. [16]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023

  17. [17]

    Parameter-efficient transfer learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  18. [18]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  19. [19]

    Thinking, fast and slow

    Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

  20. [20]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  21. [21]

    DSPy: Compiling declarative language model calls into state-of-the-art pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations, 2024

  22. [22]

    ProsocialDialog: A prosocial backbone for conversational agents

    Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. ProsocialDialog: A prosocial backbone for conversational agents. In Conference on Empirical Methods in Natural Language Processing, 2022

  23. [23]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

  24. [24]

    Exploring versatile generative language model via parameter-efficient transfer learning

    Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020

  25. [25]

    Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering

    Guangyi Liu, Yongqi Zhang, Yong Li, and Quanming Yao. Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering. In Conference on Parsimony and Learning, 2025

  26. [26]

    A dynamic LLM-powered agent network for task-oriented agent collaboration

    Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic LLM-powered agent network for task-oriented agent collaboration. In Conference on Language Modeling, 2024

  27. [27]

    Distributed repre- sentations of words and phrases and their compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26, 2013

  28. [28]

    The magical number seven, plus or minus two: Some limits on our capacity for processing information

    George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81, 1956

  29. [29]

    Society of mind

    Marvin Minsky. Society of mind. Simon and Schuster, 1986

  30. [30]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

  31. [31]

    The E2E Dataset: New Challenges For End-to-End Generation

    Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017

  32. [32]

    The language instinct: How the mind creates language

    Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003

  33. [33]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. 11

  34. [34]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020

  35. [35]

    GPQA: A graduate-level google-proof Q&A benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark. In Conference on Language Modeling, 2024

  36. [36]

    Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

    Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024

  37. [37]

    Learning multiagent communication with backpropa- gation

    Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa- gation. In Advances in Neural Information Processing Systems, volume 29, 2016

  38. [38]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  39. [39]

    Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

    Shuhei Suzuoki and Keiko Hatano. Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

  40. [40]

    Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023

  41. [41]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  42. [42]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  43. [43]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

  44. [44]

    Toward self-improvement of llms via imagination, searching, and criticizing

    Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. In Advances in Neural Information Processing Systems, volume 37, pages 52723–52748, 2024

  45. [45]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, volume 30, 2017

  46. [46]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

  47. [47]

    MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024

  48. [48]

    Transforming and combining rewards for aligning large language models

    Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024

  49. [49]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 12

  50. [50]

    Understanding natural language

    Terry Winograd. Understanding natural language. Cognitive Pychology, 3(1):1–191, 1972

  51. [51]

    LaMini-LM: A diverse herd of distilled models from large-scale instructions

    Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. LaMini-LM: A diverse herd of distilled models from large-scale instructions. In Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024

  52. [52]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  53. [53]

    Tree of Thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822, 2023

  54. [54]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

  55. [55]

    ReST-MCTS*: LLM self-training via process reward guided tree search

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. ReST-MCTS*: LLM self-training via process reward guided tree search. In Advances in Neural Information Processing Systems, volume 37, pages 64735–64772, 2024

  56. [56]

    AFlow: Automating Agentic Workflow Generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024

  57. [57]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models. arXiv preprint arXiv:2503.24235, 2025

  58. [58]

    arXiv preprint arXiv:2502.02533 , year=

    Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025

  59. [59]

    GPTSwarm: Language agents as optimizable graphs

    Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. In International Conference on Machine Learning, 2024. 13 A Visualization of Edges We visualize query/key/value/output projection matrix of the attention block on every edge in trained LMNet-1B respective...