AgentPSO: Evolving Agent Reasoning Skill via Multi-agent Particle Swarm Optimization
Pith reviewed 2026-05-12 00:51 UTC · model grok-4.3
The pith
AgentPSO evolves multi-agent reasoning skills by treating each agent's natural-language description as a particle state that updates toward better performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgentPSO models agents as particles whose states are natural-language skill descriptions. In each iteration, an agent revises its skill by blending its prior velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. The process improves reasoning performance across the population without updating any parameters of the underlying language model and produces skills that generalize beyond the training tasks.
What carries the argument
The semantic velocity update that fuses an agent's previous direction, personal-best skill, global-best skill, and self-reflection from collective trajectories to refine natural-language reasoning skills.
If this is right
- Agents achieve higher accuracy on mathematical and general reasoning benchmarks than static single-agent skills or test-time-only multi-agent methods.
- Skills learned during evolution transfer successfully to different benchmarks.
- The same evolved skills retain their benefits when deployed on a different backbone language model.
- Reasoning gains arise from population-level discovery of reusable procedures without access to model gradients or internals.
Where Pith is reading between the lines
- The approach could be applied to evolve skills for non-reasoning tasks such as collaborative code generation or multi-step planning.
- Focusing on skill evolution rather than inference-time aggregation may reduce problems like biased consensus in debate systems.
- Evolved skills could be archived as a reusable library for deployment across multiple models and problem domains.
- Because updates require only text outputs, the method works with black-box API models that provide no internal access.
Load-bearing premise
Combining previous directions with personal and global best skills through natural-language updates produces genuine reasoning gains rather than superficial changes in prompt wording.
What would settle it
If the evolved skills show no accuracy gain on held-out benchmarks compared with random semantic perturbations or fail to transfer when applied to new tasks or models, the claim of reusable reasoning procedures would be falsified.
Figures
read the original abstract
Multi-agent reasoning has shown promise for improving the problem-solving ability of large language models by allowing multiple agents to explore diverse reasoning paths. However, most existing multi-agent methods rely on inference-time debate or aggregation, which can be vulnerable to incorrect peer influence and biased consensus. Moreover, the agents themselves remain static, as their underlying reasoning skills do not evolve across tasks. In this paper, we introduce \textbf{AgentPSO}, a particle-swarm-inspired framework for evolving multi-agent reasoning skills. AgentPSO treats each agent as a particle-like reasoner whose state is a natural-language skill and whose velocity is a semantic update direction, iteratively guiding agents toward higher-performing skill configurations. Across training iterations, each agent updates its skill by combining its previous velocity, personal-best skill, global-best skill, and a self-reflective direction derived from peer reasoning trajectories. This enables agents to learn reusable reasoning behaviors by drawing on their own experience and on the strongest skills found by the population, without updating the parameters of the backbone language model. Experiments on mathematical and general reasoning benchmarks show that AgentPSO improves over static single-agent skills and test-time-only multi-agent reasoning baselines. The evolved skills further transfer across benchmarks and to another backbone model, suggesting that AgentPSO captures reusable reasoning procedures rather than merely optimizing benchmark-specific prompts. Code is publicly available at https://github.com/HYUNMIN-HWANG/AgentPSO/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgentPSO, a multi-agent framework that adapts particle swarm optimization to evolve natural-language reasoning skills for LLMs. Each agent is treated as a particle whose position is a skill description and velocity is a semantic update direction; updates combine prior velocity, personal-best skill, global-best skill, and self-reflective direction derived from peer trajectories. The method runs without gradient updates to the backbone model. Experiments on mathematical and general reasoning benchmarks report gains over static single-agent and test-time multi-agent baselines, with claimed transfer of evolved skills across benchmarks and to a different backbone model.
Significance. If the empirical claims hold under rigorous controls, the work would be significant for demonstrating a gradient-free mechanism to discover reusable, transferable reasoning procedures in multi-agent LLM systems. It directly addresses the static-agent limitation of prior multi-agent reasoning methods and provides an open-source implementation, which supports reproducibility.
major comments (3)
- [Experiments] Experiments section: the reported improvements over baselines lack any mention of the number of independent runs, variance across seeds, or statistical significance tests. Without these, it is impossible to determine whether the gains exceed what would be expected from additional inference budget or random prompt variation.
- [Method and Experiments] Method and Experiments: no component ablations are presented for the four-term update rule (inertia, cognitive/personal-best, social/global-best, self-reflective). This is load-bearing for the central claim that the PSO-style combination produces genuine skill evolution rather than iterative prompt refinement.
- [Experiments] Experiments: the paper provides no examples or qualitative analysis of the evolved skill strings before and after optimization. Without inspection of the actual natural-language content, it remains possible that transfer results reflect accumulation of benchmark-specific fragments rather than reusable procedures.
minor comments (2)
- [Abstract and Introduction] The abstract and introduction use the term 'parameter-free' for the evolution process, yet the update rule includes tunable PSO-style coefficients (inertia, cognitive, social weights) whose values are not specified or ablated.
- [Method] Figure captions and algorithm pseudocode would benefit from explicit notation for how natural-language concatenation is performed during velocity and position updates.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments, which help strengthen the empirical rigor and clarity of our work on AgentPSO. We address each major point below and commit to revisions that directly incorporate the suggested improvements.
read point-by-point responses
-
Referee: Experiments section: the reported improvements over baselines lack any mention of the number of independent runs, variance across seeds, or statistical significance tests. Without these, it is impossible to determine whether the gains exceed what would be expected from additional inference budget or random prompt variation.
Authors: We agree that reporting variance and statistical tests is necessary to rule out effects from random prompt variation or extra inference budget. In the revised manuscript, we will rerun all main experiments across 5 independent runs with distinct random seeds, report mean accuracies with standard deviations, and include paired t-tests (with p-values) comparing AgentPSO against each baseline to establish statistical significance. revision: yes
-
Referee: Method and Experiments: no component ablations are presented for the four-term update rule (inertia, cognitive/personal-best, social/global-best, self-reflective). This is load-bearing for the central claim that the PSO-style combination produces genuine skill evolution rather than iterative prompt refinement.
Authors: We acknowledge that ablations are required to isolate the contribution of the PSO-style four-term rule versus simpler iterative refinement. The revised version will add a dedicated ablation study that evaluates variants with individual terms removed or replaced (e.g., no inertia, no self-reflective component) on the primary math and general reasoning benchmarks, demonstrating that the full combination yields superior skill evolution. revision: yes
-
Referee: Experiments: the paper provides no examples or qualitative analysis of the evolved skill strings before and after optimization. Without inspection of the actual natural-language content, it remains possible that transfer results reflect accumulation of benchmark-specific fragments rather than reusable procedures.
Authors: We agree that qualitative inspection is important to substantiate claims of reusable procedures. The revision will include concrete examples of initial versus evolved skill strings for representative agents, accompanied by a qualitative analysis section that highlights recurring reasoning patterns (e.g., decomposition strategies) that persist across benchmarks and support the transfer results. revision: yes
Circularity Check
No significant circularity in claimed results
full rationale
The paper introduces an empirical PSO-inspired framework for updating natural-language agent skills via LLM calls, with central claims resting on benchmark experiments showing performance gains and cross-benchmark/model transfer. No equations, derivations, or first-principles results are presented that reduce outputs to inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The method is self-contained as an experimental procedure evaluated against external baselines, with no self-definitional loops or ansatz smuggling identified in the provided text.
Axiom & Free-Parameter Ledger
free parameters (1)
- PSO-style coefficients for inertia, cognitive, and social components
axioms (1)
- domain assumption Semantic combinations of natural-language skill descriptions can serve as effective velocity and position updates that improve reasoning performance.
invented entities (2)
-
Agent skill state as particle position
no independent evidence
-
Semantic update direction as particle velocity
no independent evidence
Forward citations
Cited by 1 Pith paper
-
"Skill issues'': data-centric optimization of lakehouse agents
Data-centric optimization of skills for agents on a branching lakehouse improves accuracy by 31.9% on 25 tasks via state-verification evaluation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.