pith. machine review for the scientific record. sign in

arxiv: 2604.17220 · v1 · submitted 2026-04-19 · 💻 cs.MA · cs.AI

Recognition: unknown

Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:03 UTC · model grok-4.3

classification 💻 cs.MA cs.AI
keywords agentscognitivesupplybehavioralchaindynamicsbiasescomplex
0
0 comments X

The pith

Heterogeneous LLM agents in supply chain simulations exhibit myopic self-interested behaviors that worsen inefficiencies, but information sharing mitigates these effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors create a simulation where supply chain agents are powered by different large language models, some more sophisticated than others. These agents make decisions across multiple stages like ordering and production. The setup reveals that agents tend to focus only on their immediate needs and ignore the bigger picture, leading to problems like excess inventory or shortages for the whole system. When agents share information about demand or plans, the overall performance improves. The work treats these LLM behaviors as stand-ins for how real people might act with varying levels of cognitive skill.

Core claim

Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects.

Load-bearing premise

That behaviors observed in LLM-based agents with varying reasoning sophistication accurately proxy and generalize human cognitive biases and decision-making in real multi-stage supply chains.

Figures

Figures reproduced from arXiv: 2604.17220 by Bo Yang, Guang Xiao, Guangxin Jiang, Jin Yang, Jiuyun Jiang, Xiaomeng Guo, Yuecheng Hong.

Figure 1
Figure 1. Figure 1: Overview of the experimental workflow and analytical framework. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Mechanism of the Information Structure. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Impact of information sharing on order vari [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Order amplification: bullwhip effect in homo [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ordering dynamics and variance in cognitive heterogeneity groups. The results demonstrate the robust [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Order variance across supply chain stages [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Total system cost and average stage cost. [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The Beer Distribution Game Structure. From Logistics to Psychology At this point, the game ceases to be about simply managing stock; it transforms into a profound psychological and strategic puzzle. The challenge is no longer logisti￾cal but interpretive. To place an effective order, a player must essentially become a mind-reader, at￾tempting to divine the intentions, fears, and strate￾gies of the other in… view at source ↗
read the original abstract

Modeling coordination among generative agents in complex multi-round decision-making presents a core challenge for AI and operations management. Although behavioral experiments have revealed cognitive biases behind supply chain inefficiencies, traditional methods face scalability and control limitations. We introduce a scalable experimental paradigm using Large Language Models (LLMs) to simulate multi-stage supply chain dynamics. Grounded in a Hierarchical Reasoning Framework, this study specifically analyzes the impact of cognitive heterogeneity on agent interactions. Unlike prior homogeneous settings, we employ DeepSeek and GPT agents to systematically vary reasoning sophistication across supply chain tiers. Through rigorously replicated and statistically validated simulations, we investigate how this cognitive diversity influences collective outcomes. Results indicate that agents exhibit myopic and self-interested behaviors that exacerbate systemic inefficiencies. However, we demonstrate that information sharing effectively mitigates these adverse effects. Our findings extend traditional behavioral methods and offer new insights into the dynamics of AI-enabled organizations. This work underscores both the potential and limitations of LLM-based agents as proxies for human decision-making in complex operational environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces a scalable LLM-based simulation paradigm for multi-stage supply chains, using a Hierarchical Reasoning Framework with DeepSeek and GPT agents to instantiate cognitive heterogeneity across tiers. It reports that heterogeneous agents exhibit myopic and self-interested behaviors exacerbating systemic inefficiencies in multi-round decisions, but that information sharing mitigates these effects. The work claims rigorous replication, statistical validation, and positions the approach as extending traditional behavioral experiments while highlighting limitations of LLMs as human proxies.

Significance. If the LLM agents faithfully reproduce human-like cognitive biases, the framework offers a scalable, controllable method to study supply-chain coordination beyond the limits of human-subject experiments such as the beer game. It provides concrete evidence on the role of reasoning heterogeneity and the value of information sharing, with direct relevance to designing AI-augmented operational systems. The explicit use of multiple model families and emphasis on replication are strengths that could support reproducible follow-on work.

major comments (1)
  1. The central claim that information sharing mitigates inefficiencies arising from cognitive heterogeneity rests on the assumption that behaviors observed in DeepSeek and GPT agents accurately proxy human decision-making biases. No calibration against human data, no comparison to established beer-game results, and no ablation of prompting or model artifacts are reported, so the attribution of outcomes to intended heterogeneity rather than LLM-specific factors cannot be assessed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important aspect of our work's positioning, and we address it directly below while outlining planned revisions.

read point-by-point responses
  1. Referee: The central claim that information sharing mitigates inefficiencies arising from cognitive heterogeneity rests on the assumption that behaviors observed in DeepSeek and GPT agents accurately proxy human decision-making biases. No calibration against human data, no comparison to established beer-game results, and no ablation of prompting or model artifacts are reported, so the attribution of outcomes to intended heterogeneity rather than LLM-specific factors cannot be assessed.

    Authors: We agree that the manuscript does not include direct calibration to human-subject beer-game data or quantitative comparisons to established results from the literature, nor does it report systematic ablations isolating prompting or model-specific artifacts. Our primary aim was to develop and validate a scalable LLM-based simulation framework for investigating cognitive heterogeneity, rather than to establish LLMs as precise human proxies. The observed myopic and self-interested behaviors are presented as emerging from the instantiated heterogeneity across model families, with information sharing shown to reduce inefficiencies within these simulations; we note qualitative alignment with known supply-chain coordination issues but did not perform formal benchmarking. In revision, we will expand the limitations and discussion sections to explicitly state the absence of human calibration, add a dedicated subsection on potential model artifacts with any feasible additional checks, and reframe the central claims to emphasize dynamics within LLM agent systems while noting implications for human-like bias studies. This preserves the contribution as a complementary methodological tool. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reports outcomes from LLM-based simulations of multi-stage supply chains, using DeepSeek and GPT agents to instantiate cognitive heterogeneity and observing myopic behaviors mitigated by information sharing. These are generated results from the experimental runs rather than any derivation, fitted parameter, or self-referential definition that reduces to the inputs by construction. No equations, uniqueness theorems, or ansatzes are presented in the provided text that would trigger self-definitional, fitted-prediction, or self-citation load-bearing patterns. The study is self-contained as a simulation paradigm with independent content from its setup and statistical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating LLM outputs as faithful proxies for human cognitive heterogeneity without independent human-subject calibration data; this is an untested domain assumption rather than a derived result.

axioms (1)
  • domain assumption LLM agents with different models can reliably simulate distinct levels of human reasoning sophistication and associated behavioral biases in supply chain decisions
    Invoked throughout the abstract to interpret simulation outcomes as evidence of cognitive heterogeneity effects.

pith-pipeline@v0.9.0 · 5491 in / 1161 out tokens · 34311 ms · 2026-05-10T06:03:48.989140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 16 canonical work pages · 9 internal anchors

  1. [1]

    Management science , volume=

    Modeling managerial behavior: Misperceptions of feedback in a dynamic decision making experiment , author=. Management science , volume=. 1989 , publisher=

  2. [2]

    The quarterly journal of economics , volume=

    A theory of fairness, competition, and cooperation , author=. The quarterly journal of economics , volume=. 1999 , publisher=

  3. [3]

    Management science , volume=

    Behavioral causes of the bullwhip effect and the observed value of inventory information , author=. Management science , volume=. 2006 , publisher=

  4. [4]

    Journal of Operations Management , volume=

    Behavioral operations: the state of the field , author=. Journal of Operations Management , volume=. 2013 , publisher=

  5. [5]

    Management science , volume=

    Supply chain decision making: Will shorter cycle times and shared point-of-sale information necessarily help? , author=. Management science , volume=. 2004 , publisher=

  6. [6]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  7. [7]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  8. [8]

    Emergent Abilities of Large Language Models

    Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  9. [9]

    Advances in neural information processing systems , volume=

    Are emergent abilities of large language models a mirage? , author=. Advances in neural information processing systems , volume=

  10. [10]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Sparks of artificial general intelligence: Early experiments with gpt-4 , author=. arXiv preprint arXiv:2303.12712 , year=

  11. [11]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  12. [12]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  13. [13]

    Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

    Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

  14. [14]

    Nature Human Behaviour , pages=

    Playing repeated games with large language models , author=. Nature Human Behaviour , pages=. 2025 , publisher=

  15. [15]

    The eleventh international conference on learning representations , year=

    Large language models are human-level prompt engineers , author=. The eleventh international conference on learning representations , year=

  16. [16]

    Nature Human Behaviour , volume=

    Emergent analogical reasoning in large language models , author=. Nature Human Behaviour , volume=. 2023 , publisher=

  17. [17]

    Political Analysis , volume=

    Out of one, many: Using language models to simulate human samples , author=. Political Analysis , volume=. 2023 , publisher=

  18. [18]

    arXiv preprint arXiv:2302.02083 , year=

    Theory of mind may have spontaneously emerged in large language models , author=. arXiv preprint arXiv:2302.02083 , volume=

  19. [19]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Truthfulqa: Measuring how models mimic human falsehoods , author=. arXiv preprint arXiv:2109.07958 , year=

  20. [20]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Can llms reliably simulate human learner actions? a simulation authoring framework for open-ended learning environments , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  21. [21]

    Advances in Neural Information Processing Systems , volume=

    Decision-making behavior evaluation framework for llms under uncertain context , author=. Advances in Neural Information Processing Systems , volume=

  22. [22]

    ACM transactions on intelligent systems and technology , volume=

    A survey on evaluation of large language models , author=. ACM transactions on intelligent systems and technology , volume=. 2024 , publisher=

  23. [23]

    Proceedings of the Twenty-fourth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing , pages=

    Toward a better understanding of the emotional dynamics of negotiation with large language models , author=. Proceedings of the Twenty-fourth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing , pages=

  24. [24]

    Journal of Physics: Complexity , volume=

    Measuring an artificial intelligence language model’s trust in humans using machine incentives , author=. Journal of Physics: Complexity , volume=. 2024 , publisher=

  25. [25]

    Constitutional AI: Harmlessness from AI Feedback

    Constitutional AI: Harmlessness from AI feedback , author=. arXiv preprint arXiv:2212.08073 , year=

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Meta-in-context learning in large language models , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  28. [28]

    Improving alignment of dialogue agents via targeted human judgements

    Improving alignment of dialogue agents via targeted human judgements , author=. arXiv preprint arXiv:2209.14375 , year=

  29. [29]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  30. [30]

    arXiv preprint arXiv:2505.18597 , year=

    LLMs for Supply Chain Management , author=. arXiv preprint arXiv:2505.18597 , year=

  31. [31]

    arXiv preprint arXiv:2312.00798 , year=

    A Turing Test: Are AI Chatbots Behaviorally Similar to Humans? , author=. arXiv preprint arXiv:2312.00798 , year=

  32. [32]

    The quarterly journal of economics , volume=

    A cognitive hierarchy model of games , author=. The quarterly journal of economics , volume=. 2004 , publisher=

  33. [33]

    UNSW Business School Research Paper Forthcoming , year=

    Artificial agents and operations management decision-making , author=. UNSW Business School Research Paper Forthcoming , year=

  34. [34]

    Management Science , volume=

    A replication study of operations management experiments in management science , author=. Management Science , volume=. 2023 , publisher=

  35. [35]

    Nature Computational Science , pages=

    A large-scale replication of scenario-based experiments in psychology and management using large language models , author=. Nature Computational Science , pages=. 2025 , publisher=

  36. [36]

    Nature , volume=

    DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

  37. [37]

    InvAgent: Alargelanguagemodelbasedmulti-agentinventorymanagementsystem.arXiv preprint arXiv:2407.11384v1,

    Invagent: A large language model based multi-agent system for inventory management in supply chains , author=. arXiv preprint arXiv:2407.11384 , year=

  38. [38]

    Nature Human Behaviour , pages=

    Multimodal large language models can make context-sensitive hate speech evaluations aligned with human judgement , author=. Nature Human Behaviour , pages=. 2025 , publisher=

  39. [39]

    Is independent learning all you need in the starcraft multi-agent challenge?arXiv preprint arXiv:2011.09533, 2020

    Is independent learning all you need in the starcraft multi-agent challenge? , author=. arXiv preprint arXiv:2011.09533 , year=

  40. [40]

    Advances in neural information processing systems , volume=

    The surprising effectiveness of ppo in cooperative multi-agent games , author=. Advances in neural information processing systems , volume=

  41. [41]

    arXiv preprint arXiv:2307.03875 , year=

    Large language models for supply chain optimization , author=. arXiv preprint arXiv:2307.03875 , year=

  42. [42]

    International Journal of Production Research , pages=

    Agentic LLMs in the supply chain: towards autonomous multi-agent consensus-seeking , author=. International Journal of Production Research , pages=. 2026 , publisher=

  43. [43]

    Findings of the Association for Computational Linguistics: ACL 2025 , pages=

    MegaAgent: A large-scale autonomous LLM-based multi-agent system without predefined SOPs , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

  44. [44]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Multiagentbench: Evaluating the collaboration and competition of llm agents , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  45. [45]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  46. [46]

    Operations Research , volume=

    Disruption and rerouting in supply chain networks , author=. Operations Research , volume=. 2023 , publisher=

  47. [47]

    Production and Operations Management , volume=

    Renegotiations in the presence of supply disruptions , author=. Production and Operations Management , volume=. 2025 , publisher=

  48. [48]

    Management Science , volume=

    Disruption risk and optimal sourcing in multitier supply networks , author=. Management Science , volume=. 2017 , publisher=

  49. [49]

    Management Science , volume=

    Impact of traceability technology adoption in food supply chain networks , author=. Management Science , volume=. 2023 , publisher=

  50. [50]

    arXiv preprint arXiv:2511.01448 , year=

    Licomemory: Lightweight and cognitive agentic memory for efficient long-term reasoning , author=. arXiv preprint arXiv:2511.01448 , year=