Language Model Networks: Supervision-Efficient Learning through Dense Communication

Quanming Yao; Shiguang Wu; Yaqing Wang

arxiv: 2505.12741 · v2 · pith:LGTTE77Fnew · submitted 2025-05-19 · 💻 cs.AI

Language Model Networks: Supervision-Efficient Learning through Dense Communication

Shiguang Wu , Yaqing Wang , Quanming Yao This is my paper

Pith reviewed 2026-05-22 14:46 UTC · model grok-4.3

classification 💻 cs.AI

keywords language model networksdense communicationseq2seq modulesend-to-end optimizationlimited supervisionmulti-model collaborationdifferentiable edges

0 comments

The pith

Language model networks learn dense vector communication between pre-trained nodes to enable end-to-end optimization with limited supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LMNet to connect pre-trained language models as nodes in a larger system. Communication occurs through trainable sequence-to-sequence modules that exchange dense vectors rather than generating text at every step. This bypasses repeated embedding and de-embedding operations so gradients can flow through the entire network from the final task loss. A sympathetic reader would care because the approach promises to combine existing models into collaborative systems that adapt effectively when only small amounts of task-specific data are available.

Core claim

LMNet realizes language model networks by using stripped pre-trained LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary and thereby achieving efficient information transfer, end-to-end gradient optimization, and learned communication protocols beyond hand-designed ones.

What carries the argument

LMNet architecture, in which pre-trained language models function as reusable nodes connected by trainable seq2seq modules that pass dense vectors to support differentiable communication across the network.

If this is right

The full network can be optimized end-to-end from the final task objective.
Performance gains appear with only small additional training cost for the communication modules.
The system adapts to new tasks under limited supervision while keeping natural language at the boundaries.
Communication protocols emerge automatically instead of relying on manually specified formats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar dense links could reduce the number of tokens generated at intermediate steps and thereby lower inference latency in multi-model pipelines.
The approach might extend to networks that mix language models with other differentiable modules such as vision encoders.
Learned vector protocols could transfer across related tasks if the seq2seq modules are kept frozen after initial training.

Load-bearing premise

Trainable seq2seq modules can learn effective dense communication protocols from end-task supervision alone without degrading the capabilities of the pre-trained LLM nodes or requiring extensive additional data.

What would settle it

Train an LMNet on a concrete task such as multi-step reasoning and compare its accuracy against both the strongest single pre-trained model and a baseline network that communicates only through generated natural language text; if the dense version shows no gain or a loss, the claim of effective learned communication fails.

Figures

Figures reproduced from arXiv: 2505.12741 by Quanming Yao, Shiguang Wu, Yaqing Wang.

**Figure 1.** Figure 1: Communication between LLMs through dense vectors eliminates the bottleneck of natural language. Large Language Models (LLMs) have achieved impressive performance in natural language understanding, generation, and reasoning [5]. Modern LLMs exhibit general intelligence capabilities across a wide range of subjects [1, 52, 11], but still face limitations when tackling complex tasks that require domain-specif… view at source ↗

**Figure 2.** Figure 2: Illustration of the proposed paradigm. (a) A standard LLM processes discrete token inputs [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of attention weights in the edge modules on the 4 edges at the last layer of [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of query projection matrix of the attention block on every edge in trained [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Language models are increasingly used not only as standalone predictors but also as components in larger inference systems, from test-time reasoning to multi-model collaboration. We study language model networks, where pre-trained language models serve as reusable nodes and intelligence emerges from their topology, communication, and optimization. Existing systems mostly communicate through natural language: easy to deploy, but discrete, inefficient, and hard to optimize from end-task supervision. We propose LMNet, a dense and differentiable realization of this paradigm. LMNet uses stripped LLMs as vertex modules and trainable seq2seq modules as communication edges, enabling intermediate nodes to exchange dense vectors while preserving natural-language input and output at the system boundary. By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization, and learned communication beyond hand-designed protocols. Experiments show performance with small additional training cost and effective adaptation under limited supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes LMNet, a network architecture in which pre-trained language models serve as reusable nodes connected by trainable seq2seq modules that exchange dense vector representations. This design bypasses intermediate embedding and de-embedding steps to enable efficient, differentiable communication, end-to-end gradient flow, and learned protocols that adapt under limited supervision, with claims of small additional training cost relative to natural-language baselines.

Significance. If the central claims are substantiated, the work would offer a concrete mechanism for supervision-efficient multi-LLM systems by replacing discrete text exchanges with dense, optimizable channels. The approach directly addresses a practical bottleneck in current multi-model inference pipelines and could influence designs for collaborative reasoning systems.

major comments (2)

[Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.
[Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.

minor comments (1)

[Abstract] The abstract refers to 'stripped LLMs' without defining what layers or components are removed; a brief clarification of the node interface would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our submission. The comments have helped us identify areas where additional clarity and analysis would strengthen the manuscript. We address each major comment below and have incorporated revisions to improve the presentation of our experimental support and architectural assumptions.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that 'experiments show performance with small additional training cost and effective adaptation under limited supervision' is load-bearing for the central claim, yet the manuscript provides no information on datasets, model sizes, training-set cardinalities, baselines, ablations isolating the dense-communication benefit, or statistical controls. Without these, the empirical support for supervision efficiency cannot be evaluated.

Authors: We agree that the abstract would benefit from greater specificity to support the central claims. While the full manuscript details the experimental setup—including datasets, model scales, training cardinalities, natural-language baselines, communication ablations, and statistical reporting—in the Experiments section, we have revised the abstract to include a concise high-level summary of these elements along with references to the relevant sections. This change makes the empirical support more immediately evaluable without substantially increasing length. revision: yes
Referee: [Architecture description] Proposed architecture (implicit in the description of stripped LLMs as nodes and seq2seq as edges): The claim that trainable seq2seq modules can discover communication vectors compatible with the internal hidden-state distributions of frozen pre-trained LLMs rests on the assumption that end-task gradients alone will align the output distribution of the seq2seq modules with the expectations of the transformer layers. No analysis or ablation is supplied to show that this alignment occurs without degrading node capabilities or requiring large additional data.

Authors: We appreciate this observation on the implicit assumptions of the architecture. The manuscript currently supports the claim through end-to-end performance gains under limited supervision, but we concur that direct evidence of alignment and non-degradation would be valuable. In the revised version we have added a dedicated analysis subsection with ablations that quantify distribution alignment between seq2seq outputs and LLM hidden states, measure any capability degradation on the frozen nodes, and report the additional data required for stable training. revision: yes

Circularity Check

0 steps flagged

No circularity: LMNet proposal introduces new architecture with empirical claims

full rationale

The paper proposes LMNet as a system architecture using stripped pre-trained LLMs as nodes and trainable seq2seq modules as edges. Claims of efficient dense communication, end-to-end optimization, and limited-supervision adaptation rest on the introduction of these components and reported experimental outcomes rather than any derivation that reduces to its own inputs by construction. No equations, predictions, or uniqueness theorems are presented that loop back to fitted parameters or self-referential definitions. The central premise is a methodological suggestion whose value is asserted via performance results, not tautological re-labeling of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the domain assumption that pre-trained LLMs remain functional when stripped for use as reusable nodes and that seq2seq modules can be trained to handle intermediate dense representations effectively.

free parameters (1)

seq2seq module parameters
Trainable parameters introduced for the communication edges that are optimized end-to-end.

axioms (2)

domain assumption Pre-trained language models can serve as reusable vertex modules after stripping
Invoked when describing LMNet nodes in the abstract.
domain assumption Dense vector exchange preserves necessary information for system-level tasks
Implicit in the claim that bypassing embedding/de-embedding enables efficient transfer.

invented entities (1)

LMNet architecture with seq2seq communication edges no independent evidence
purpose: To realize dense differentiable communication in language model networks
New proposed system not previously described in the abstract's context

pith-pipeline@v0.9.0 · 5680 in / 1378 out tokens · 32102 ms · 2026-05-22T14:46:34.268094+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use such stripped LLMs as vertexes and optimizable seq2seq modules as edges to construct LMNet, with similar structure as MLPs.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By bypassing intermediate embedding and de-embedding, LMNet enables efficient information transfer, end-to-end gradient optimization

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 19 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

work page 2003
[4]

Graph of Thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of Thoughts: Solving elaborate problems with large language models. In AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

work page 2024
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901
[6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[7]

Natural language processing

KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of Artificial Intelligence, pages 603–649, 2020

work page 2020
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

On the speed of mental processes.Acta psychologica, 30:412–431, 1969

Franciscus Cornelis Donders. On the speed of mental processes.Acta psychologica, 30:412–431, 1969

work page 1969
[10]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In International Conference on Machine Learning, 2023

work page 2023
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Bonbon alignment for large language models and the sweetness of best-of-n sampling

Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024

work page arXiv 2024
[13]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. 10

work page 2021
[15]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems, 2021

work page 2021
[16]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019
[18]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022
[19]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

work page 2011
[20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[21]

DSPy: Compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations, 2024

work page 2024
[22]

ProsocialDialog: A prosocial backbone for conversational agents

Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. ProsocialDialog: A prosocial backbone for conversational agents. In Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022
[23]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Exploring versatile generative language model via parameter-efficient transfer learning

Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020

work page arXiv 2004
[25]

Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering

Guangyi Liu, Yongqi Zhang, Yong Li, and Quanming Yao. Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering. In Conference on Parsimony and Learning, 2025

work page 2025
[26]

A dynamic LLM-powered agent network for task-oriented agent collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic LLM-powered agent network for task-oriented agent collaboration. In Conference on Language Modeling, 2024

work page 2024
[27]

Distributed repre- sentations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26, 2013

work page 2013
[28]

The magical number seven, plus or minus two: Some limits on our capacity for processing information

George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81, 1956

work page 1956
[29]

Society of mind

Marvin Minsky. Society of mind. Simon and Schuster, 1986

work page 1986
[30]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

The E2E Dataset: New Challenges For End-to-End Generation

Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[32]

The language instinct: How the mind creates language

Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003

work page 2003
[33]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. 11

work page 2019
[34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020
[35]

GPQA: A graduate-level google-proof Q&A benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark. In Conference on Language Modeling, 2024

work page 2024
[36]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Learning multiagent communication with backpropa- gation

Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa- gation. In Advances in Neural Information Processing Systems, volume 29, 2016

work page 2016
[38]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[39]

Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

Shuhei Suzuoki and Keiko Hatano. Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

work page 2024
[40]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[42]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Toward self-improvement of llms via imagination, searching, and criticizing

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. In Advances in Neural Information Processing Systems, volume 37, pages 52723–52748, 2024

work page 2024
[45]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017
[46]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[47]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024

work page 2024
[48]

Transforming and combining rewards for aligning large language models

Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024

work page arXiv 2024
[49]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 12

work page 2022
[50]

Understanding natural language

Terry Winograd. Understanding natural language. Cognitive Pychology, 3(1):1–191, 1972

work page 1972
[51]

LaMini-LM: A diverse herd of distilled models from large-scale instructions

Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. LaMini-LM: A diverse herd of distilled models from large-scale instructions. In Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024

work page 2024
[52]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Tree of Thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822, 2023

work page 2023
[54]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023
[55]

ReST-MCTS*: LLM self-training via process reward guided tree search

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. ReST-MCTS*: LLM self-training via process reward guided tree search. In Advances in Neural Information Processing Systems, volume 37, pages 64735–64772, 2024

work page 2024
[56]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models. arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[58]

arXiv preprint arXiv:2502.02533 , year=

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025

work page arXiv 2025
[59]

GPTSwarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. In International Conference on Machine Learning, 2024. 13 A Visualization of Edges We visualize query/key/value/output projection matrix of the attention block on every edge in trained LMNet-1B respective...

work page 2024

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[3] [3]

A neural probabilistic language model

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb):1137–1155, 2003

work page 2003

[4] [4]

Graph of Thoughts: Solving elaborate problems with large language models

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of Thoughts: Solving elaborate problems with large language models. In AAAI Conference on Artificial Intelligence, volume 38, pages 17682–17690, 2024

work page 2024

[5] [5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901, 2020

work page 1901

[6] [6]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[7] [7]

Natural language processing

KR1442 Chowdhary and KR Chowdhary. Natural language processing. Fundamentals of Artificial Intelligence, pages 603–649, 2020

work page 2020

[8] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

On the speed of mental processes.Acta psychologica, 30:412–431, 1969

Franciscus Cornelis Donders. On the speed of mental processes.Acta psychologica, 30:412–431, 1969

work page 1969

[10] [10]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. In International Conference on Machine Learning, 2023

work page 2023

[11] [11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Bonbon alignment for large language models and the sweetness of best-of-n sampling

Lin Gui, Cristina Gârbacea, and Victor Veitch. Bonbon alignment for large language models and the sweetness of best-of-n sampling. arXiv preprint arXiv:2406.00832, 2024

work page arXiv 2024

[13] [13]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space. arXiv preprint arXiv:2412.06769, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. 10

work page 2021

[15] [15]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In Advances in Neural Information Processing Systems, 2021

work page 2021

[16] [16]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 3(4):6, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

work page 2019

[18] [18]

LoRA: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

work page 2022

[19] [19]

Thinking, fast and slow

Daniel Kahneman. Thinking, fast and slow. macmillan, 2011

work page 2011

[20] [20]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[21] [21]

DSPy: Compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. DSPy: Compiling declarative language model calls into state-of-the-art pipelines. In International Conference on Learning Representations, 2024

work page 2024

[22] [22]

ProsocialDialog: A prosocial backbone for conversational agents

Hyunwoo Kim, Youngjae Yu, Liwei Jiang, Ximing Lu, Daniel Khashabi, Gunhee Kim, Yejin Choi, and Maarten Sap. ProsocialDialog: A prosocial backbone for conversational agents. In Conference on Empirical Methods in Natural Language Processing, 2022

work page 2022

[23] [23]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Exploring versatile generative language model via parameter-efficient transfer learning

Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829, 2020

work page arXiv 2004

[25] [25]

Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering

Guangyi Liu, Yongqi Zhang, Yong Li, and Quanming Yao. Dual reasoning: A GNN-LLM collaborative framework for knowledge graph question answering. In Conference on Parsimony and Learning, 2025

work page 2025

[26] [26]

A dynamic LLM-powered agent network for task-oriented agent collaboration

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. A dynamic LLM-powered agent network for task-oriented agent collaboration. In Conference on Language Modeling, 2024

work page 2024

[27] [27]

Distributed repre- sentations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre- sentations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, volume 26, 2013

work page 2013

[28] [28]

The magical number seven, plus or minus two: Some limits on our capacity for processing information

George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2):81, 1956

work page 1956

[29] [29]

Society of mind

Marvin Minsky. Society of mind. Simon and Schuster, 1986

work page 1986

[30] [30]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

The E2E Dataset: New Challenges For End-to-End Generation

Jekaterina Novikova, Ondˇrej Dušek, and Verena Rieser. The E2E dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[32] [32]

The language instinct: How the mind creates language

Steven Pinker. The language instinct: How the mind creates language. Penguin uK, 2003

work page 2003

[33] [33]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019. 11

work page 2019

[34] [34]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020

work page 2020

[35] [35]

GPQA: A graduate-level google-proof Q&A benchmark

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. GPQA: A graduate-level google-proof Q&A benchmark. In Conference on Language Modeling, 2024

work page 2024

[36] [36]

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for llm reasoning. arXiv preprint arXiv:2410.08146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Learning multiagent communication with backpropa- gation

Sainbayar Sukhbaatar, Rob Fergus, et al. Learning multiagent communication with backpropa- gation. In Advances in Neural Information Processing Systems, volume 29, 2016

work page 2016

[38] [38]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[39] [39]

Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

Shuhei Suzuoki and Keiko Hatano. Reducing hallucinations in large language models: A consensus voting approach using mixture of experts, 2024

work page 2024

[40] [40]

Multi-Agent Collaboration: Harnessing the Power of Intelligent LLM Agents

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[42] [42]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Toward self-improvement of llms via imagination, searching, and criticizing

Ye Tian, Baolin Peng, Linfeng Song, Lifeng Jin, Dian Yu, Lei Han, Haitao Mi, and Dong Yu. Toward self-improvement of llms via imagination, searching, and criticizing. In Advances in Neural Information Processing Systems, volume 37, pages 52723–52748, 2024

work page 2024

[45] [45]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, volume 30, 2017

work page 2017

[46] [46]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [47]

MMLU-Pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. MMLU-Pro: A more robust and challenging multi-task language understanding benchmark. In Advances in Neural Information Processing Systems (Datasets and Benchmarks Track), 2024

work page 2024

[48] [48]

Transforming and combining rewards for aligning large language models

Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D’Amour, Sanmi Koyejo, and Victor Veitch. Transforming and combining rewards for aligning large language models. arXiv preprint arXiv:2402.00742, 2024

work page arXiv 2024

[49] [49]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837, 2022. 12

work page 2022

[50] [50]

Understanding natural language

Terry Winograd. Understanding natural language. Cognitive Pychology, 3(1):1–191, 1972

work page 1972

[51] [51]

LaMini-LM: A diverse herd of distilled models from large-scale instructions

Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, and Alham Fikri Aji. LaMini-LM: A diverse herd of distilled models from large-scale instructions. In Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 944–964, 2024

work page 2024

[52] [52]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Tree of Thoughts: Deliberate problem solving with large language models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of Thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems, volume 36, pages 11809–11822, 2023

work page 2023

[54] [54]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In International Conference on Learning Representations, 2023

work page 2023

[55] [55]

ReST-MCTS*: LLM self-training via process reward guided tree search

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. ReST-MCTS*: LLM self-training via process reward guided tree search. In Advances in Neural Information Processing Systems, volume 37, pages 64735–64772, 2024

work page 2024

[56] [56]

AFlow: Automating Agentic Workflow Generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. AFlow: Automating agentic workflow generation. arXiv preprint arXiv:2410.10762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models. arXiv preprint arXiv:2503.24235, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [58]

arXiv preprint arXiv:2502.02533 , year=

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. arXiv preprint arXiv:2502.02533, 2025

work page arXiv 2025

[59] [59]

GPTSwarm: Language agents as optimizable graphs

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber. GPTSwarm: Language agents as optimizable graphs. In International Conference on Machine Learning, 2024. 13 A Visualization of Edges We visualize query/key/value/output projection matrix of the attention block on every edge in trained LMNet-1B respective...

work page 2024