Language Model Networks: Supervision-Efficient Learning through Dense Communication
Pith reviewed 2026-05-22 14:46 UTC · model grok-4.3
The pith
Language model networks learn dense vector communication between pre-trained nodes to enable end-to-end optimization with limited supervision.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LMNet realizes language model networks by using stripped pre-trained LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary and thereby achieving efficient information transfer, end-to-end gradient optimization, and learned communication protocols beyond hand-designed ones.
What carries the argument
LMNet architecture, in which pre-trained language models function as reusable nodes connected by trainable seq2seq modules that pass dense vectors to support differentiable communication across the network.
If this is right
- The full network can be optimized end-to-end from the final task objective.
- Performance gains appear with only small additional training cost for the communication modules.
- The system adapts to new tasks under limited supervision while keeping natural language at the boundaries.
- Communication protocols emerge automatically instead of relying on manually specified formats.
Where Pith is reading between the lines
- Similar dense links could reduce the number of tokens generated at intermediate steps and thereby lower inference latency in multi-model pipelines.
- The approach might extend to networks that mix language models with other differentiable modules such as vision encoders.
- Learned vector protocols could transfer across related tasks if the seq2seq modules are kept frozen after initial training.
Load-bearing premise
Trainable seq2seq modules can learn effective dense communication protocols from end-task supervision alone without degrading the capabilities of the pre-trained LLM nodes or requiring extensive additional data.
What would settle it
Train an LMNet on a concrete task such as multi-step reasoning and compare its accuracy against both the strongest single pre-trained model and a baseline network that communicates only through generated natural language text; if the dense version shows no gain or a loss, the claim of effective learned communication fails.
Figures
read the original abstract
Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time reasoning to multi-model collaboration. We study language model networks, where pre-trained language models serve as reusable nodes and intelligence emerges from their topology, communication, and optimization. Existing systems mostly communicate through natural language: easy to deploy, but discrete, inefficient, and hard to optimize from end-task supervision. We propose LMNet, a dense and differentiable realization of this paradigm. LMNet uses stripped LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary. By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization, and learned communication beyond hand-designed protocols. Experiments show performance with small additional training cost and effective adaptation under limited supervision.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes LMNet, a network architecture in which pre-trained language models serve as reusable nodes connected by trainable seq2seq modules that exchange dense vector representations. This design bypasses intermediate embedding and de-embedding steps to enable efficient, differentiable communication, end-to-end gradient flow, and learned protocols that adapt under limited supervision, with claims of small additional training cost relative to natural-language baselines.
Significance. If the central claims are substantiated, the work would offer a concrete mechanism for supervision-efficient multi-LLM systems by replacing discrete text exchanges with dense, optimizable channels. The approach directly addresses a practical bottleneck in current multi-model inference pipelines and could influence designs for collaborative reasoning systems.
major comments (2)
- [Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.
- [Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.
minor comments (1)
- [Abstract] The abstract refers to 'stripped LLMs' without defining what layers or components are removed; a brief clarification of the node interface would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our submission. The comments have helped us identify areas where additional clarity and analysis would strengthen the manuscript. We address each major comment below and have incorporated revisions to improve the presentation of our experimental support and architectural assumptions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.
Authors: We agree that the abstract would benefit from greater specificity to support the central claims. While the full manuscript details the experimental setup—including datasets, model scales, training cardinalities, natural-language baselines, communication ablations, and statistical reporting—in the Experiments section, we have revised the abstract to include a concise high-level summary of these elements along with references to the relevant sections. This change makes the empirical support more immediately evaluable without substantially increasing length. revision: yes
-
Referee: [Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.
Authors: We appreciate this observation on the implicit assumptions of the architecture. The manuscript currently supports the claim through end-to-end performance gains under limited supervision, but we concur that direct evidence of alignment and non-degradation would be valuable. In the revised version we have added a dedicated analysis subsection with ablations that quantify distribution alignment between seq2seq outputs and LLM hidden states, measure any capability degradation on the frozen nodes, and report the additional data required for stable training. revision: yes
Circularity Check
No circularity: LMNet proposal introduces new architecture with empirical claims
full rationale
The paper proposes LMNet as a system architecture using stripped pre-trained LLMs as nodes and trainable seq2seq modules as edges. Claims of efficient dense communication, end-to-end optimization, and limited-supervision adaptation rest on the introduction of these components and reported experimental outcomes rather than any derivation that reduces to its own inputs by construction. No equations, predictions, or uniqueness theorems are presented that loop back to fitted parameters or self-referential definitions. The central premise is a methodological suggestion whose value is asserted via performance results, not tautological re-labeling of inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- seq2seq module parameters
axioms (2)
- domain assumption Pre-trained language models can serve as reusable vertex modules after stripping
- domain assumption Dense vector exchange preserves necessary information for system-level tasks
invented entities (1)
-
LMNet architecture with seq2seq communication edges
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use such stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, with similar structure as MLPs.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[3]
A neural probabilistic language model
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003
work page 2003
-
[4]
Graph of Thoughts: Solving elaborate problems with large language models
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of Thoughts: Solving elaborate problems with large language models. In AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024
work page 2024
-
[5]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020
work page 1901
-
[6]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[7]
KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of Artificial Intelligence, pages 603–649, 2020
work page 2020
-
[8]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
On the speed of mental processes.Acta psychologica, 30:412–431, 1969
Franciscus Cornelis Donders. On the speed of mental processes.Acta psychologica, 30:412–431, 1969
work page 1969
-
[10]
Improving factuality and reasoning in language models through multiagent debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In International Conference on Machine Learning, 2023
work page 2023
-
[11]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Bonbon alignment for large language models and the sweetness of best-of-n sampling
Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024
-
[13]
Training Large Language Models to Reason in a Continuous Latent Space
Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Measuring massive multitask language understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. 10
work page 2021
-
[15]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems, 2021
work page 2021
-
[16]
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Parameter-efficient transfer learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019
work page 2019
-
[18]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022
work page 2022
- [19]
-
[20]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[21]
DSPy: Compiling declarative language model calls into state-of-the-art pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations, 2024
work page 2024
-
[22]
ProsocialDialog: A prosocial backbone for conversational agents
Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. ProsocialDialog: A prosocial backbone for conversational agents. In Conference on Empirical Methods in Natural Language Processing, 2022
work page 2022
-
[23]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[24]
Exploring versatile generative language model via parameter-efficient transfer learning
Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020
-
[25]
Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering
Guangyi Liu, Yongqi Zhang, Yong Li, and Quanming Yao. Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering. In Conference on Parsimony and Learning, 2025
work page 2025
-
[26]
A dynamic LLM-powered agent network for task-oriented agent collaboration
Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic LLM-powered agent network for task-oriented agent collaboration. In Conference on Language Modeling, 2024
work page 2024
-
[27]
Distributed repre- sentations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26, 2013
work page 2013
-
[28]
The magical number seven, plus or minus two: Some limits on our capacity for processing information
George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81, 1956
work page 1956
- [29]
-
[30]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
The E2E Dataset: New Challenges For End-to-End Generation
Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[32]
The language instinct: How the mind creates language
Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003
work page 2003
-
[33]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. 11
work page 2019
-
[34]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020
work page 2020
-
[35]
GPQA: A graduate-level google-proof Q&A benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark. In Conference on Language Modeling, 2024
work page 2024
-
[36]
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Learning multiagent communication with backpropa- gation
Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa- gation. In Advances in Neural Information Processing Systems, volume 29, 2016
work page 2016
-
[38]
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Shuhei Suzuoki and Keiko Hatano. Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024
work page 2024
-
[40]
Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents
Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [41]
-
[42]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Toward self-improvement of llms via imagination, searching, and criticizing
Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. In Advances in Neural Information Processing Systems, volume 37, pages 52723–52748, 2024
work page 2024
-
[45]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, volume 30, 2017
work page 2017
-
[46]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
MMLU-Pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024
work page 2024
-
[48]
Transforming and combining rewards for aligning large language models
Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024
-
[49]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 12
work page 2022
-
[50]
Understanding natural language
Terry Winograd. Understanding natural language. Cognitive Pychology, 3(1):1–191, 1972
work page 1972
-
[51]
LaMini-LM: A diverse herd of distilled models from large-scale instructions
Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. LaMini-LM: A diverse herd of distilled models from large-scale instructions. In Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024
work page 2024
-
[52]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Tree of Thoughts: Deliberate problem solving with large language models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822, 2023
work page 2023
-
[54]
ReAct: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023
work page 2023
-
[55]
ReST-MCTS*: LLM self-training via process reward guided tree search
Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. ReST-MCTS*: LLM self-training via process reward guided tree search. In Advances in Neural Information Processing Systems, volume 37, pages 64735–64772, 2024
work page 2024
-
[56]
AFlow: Automating Agentic Workflow Generation
Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models. arXiv preprint arXiv:2503.24235, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[58]
arXiv preprint arXiv:2502.02533 , year=
Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025
-
[59]
GPTSwarm: Language agents as optimizable graphs
Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. In International Conference on Machine Learning, 2024. 13 A Visualization of Edges We visualize query/key/value/output projection matrix of the attention block on every edge in trained LMNet-1B respective...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.