arxiv: 2604.13345 · v1 · submitted 2026-04-14 · 💻 cs.CV

Recognition: unknown

Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

Vladimir Kalu\v{s}ev , Branko Brklja\v{c} , Milan Brklja\v{c}

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords multi-agent systemsobject detectionRaspberry PiYOLOOllamaedge computingnatural language interfaceSlack

0 comments

The pith

A multi-agent object detection system integrates YOLO, local Ollama LLM, and Slack interface on a single Raspberry Pi using event-based orchestration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that object detection and tracking can be achieved through a tightly coupled set of AI agents running entirely on resource-constrained edge hardware. It combines a YOLO-based vision agent with an Ollama LLM reporting agent and a Slack chatbot agent, all coordinated locally via a custom event-driven message exchange. This setup replaces cloud services and fully autonomous LLM control with a centralized but lightweight orchestration layer, while demonstrating how generative AI tools speed up prototyping. A sympathetic reader would see value in the concrete evidence that natural-language control and real-time vision can coexist on low-cost platforms without external dependencies.

Core claim

The central claim is that a multi-agent framework can deliver real-time object detection and tracking on a Raspberry Pi by running YOLO for vision alongside locally hosted Ollama and Slack agents, with coordination handled by an event-based message exchange subsystem that avoids both cloud resources and fully autonomous agent control.

What carries the argument

The event-based message exchange subsystem that routes tasks and data between the YOLO computer vision agent, the Ollama LLM reporting agent, and the Slack chatbot agent on the same hardware.

If this is right

Object detection with natural-language control becomes feasible on single low-cost devices without cloud connectivity.
Fast prototyping with generative AI tools can accelerate the construction of such integrated systems.
Centralized multi-agent designs encounter measurable limits on constrained hardware that differ from cloud-heavy alternatives.
Privacy and cost benefits arise from keeping all components local rather than relying on external resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar event-based orchestration could be tested on other edge tasks such as simple navigation or anomaly monitoring.
Chat-app interfaces like Slack may offer a practical way to supervise vision systems in field deployments where full autonomy is undesirable.
Performance comparisons across different local LLMs or detector variants would reveal scaling rules for this style of integration.

Load-bearing premise

An event-driven messaging system can reliably orchestrate the agents on a low-power Raspberry Pi without external cloud resources or fully autonomous LLM oversight.

What would settle it

Experiments that record frequent coordination failures, such as dropped detection events, stalled LLM reports, or high latency under realistic loads on the Raspberry Pi hardware.

Figures

Figures reproduced from arXiv: 2604.13345 by Branko Brklja\v{c}, Milan Brklja\v{c}, Vladimir Kalu\v{s}ev.

**Figure 2.** Figure 2: Multi-agent system architecture: functional elements of the proposed ob [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Software components implementing proposed multi-agent architecture in [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the implemented functionalities: (a) vision agent’s object [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Characteristics of the messaging application: socket mode communication, [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Multi-agent system operation with user interaction through created nat [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of system configuration without LLM based reporting agent: [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

read the original abstract

The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a plain engineering demo of running YOLO detection with local Ollama and a Slack interface on one Raspberry Pi via a custom event bus, with no new algorithms or measurements.

read the letter

The paper shows a working prototype that keeps object detection, local LLM responses, and user chat all on a single low-cost Raspberry Pi. They use YOLO for the vision agent, Ollama for the reporting agent, Slack as the natural-language front end, and a custom event-based message system to coordinate them without cloud calls or fully autonomous agent loops. That local, offline integration is the main concrete thing they deliver, and it does illustrate one practical way to avoid external resources on constrained hardware. The write-up also notes how generative AI tools sped up their own development and contrasts the approach with cloud-heavy alternatives and with more autonomous frameworks like OpenClaw. Those points are useful for anyone who has tried similar edge setups and wants a concrete reference architecture. The soft spot is the evaluation. The abstract and description talk about an experimental investigation and insights into hardware limits, yet supply no frame rates, latency figures, detection accuracy numbers, agent coordination delays, or comparisons against baselines. Without those details the claims about practical limitations stay anecdotal. The work is clear and honest on its own terms as an implementation report, but the evidence does not go beyond showing that the pieces can be assembled and run. This is the sort of thing that might help an engineer or student who is building an offline IoT detection system and needs a starting template. It will not move research in computer vision or multi-agent systems forward. I would not send it for peer review; a tech report or workshop demo fits the contribution better.

Referee Report

2 major / 1 minor

Summary. The manuscript presents the design and prototype implementation of a multi-agent object detection framework on a single Raspberry Pi. It integrates a YOLO-based vision agent for real-time detection and tracking, a local Ollama LLM reporting agent, and a Slack chatbot agent for natural-language control, all coordinated via a custom event-driven message bus. The work emphasizes fast prototyping with generative AI tools, contrasts the local approach with cloud-dependent alternatives, and reports qualitative insights from experiments on hardware limitations.

Significance. If the integration functions as described, the paper offers a concrete example of accessible, fully local multi-agent AI deployment on low-cost edge hardware using only open-source components. This could be useful for IoT and embedded scenarios avoiding cloud dependencies. The approach receives credit for its practical, construction-based demonstration and for highlighting the role of generative AI in rapid system development, though the absence of quantitative benchmarks restricts its value as a validated advance in the field.

major comments (2)

[Experimental Investigation] Experimental Investigation section: the claim that the prototype provides 'valuable insights into limitations of the low cost testbed platforms' is not supported by any reported quantitative metrics such as detection accuracy, inference latency, CPU/memory usage, message throughput, or failure rates; without these, the feasibility assertions for the centralized orchestration on resource-constrained hardware remain unverified.
[Agent Orchestration] Agent orchestration description: the event-based message exchange subsystem is positioned as a reliable alternative to fully autonomous LLM control, yet no details are given on concurrency handling, error recovery, or performance under concurrent agent load, which directly bears on the central claim of tight integration without external resources.

minor comments (1)

[Abstract] The abstract and introduction would benefit from a brief statement of the specific hardware model (e.g., Pi 4 or 5) and YOLO variant used, to allow readers to assess reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses

Referee: [Experimental Investigation] Experimental Investigation section: the claim that the prototype provides 'valuable insights into limitations of the low cost testbed platforms' is not supported by any reported quantitative metrics such as detection accuracy, inference latency, CPU/memory usage, message throughput, or failure rates; without these, the feasibility assertions for the centralized orchestration on resource-constrained hardware remain unverified.

Authors: We acknowledge that the Experimental Investigation section currently relies on qualitative observations. To better support the claims regarding hardware limitations and the feasibility of centralized orchestration, we will revise this section to include quantitative measurements of YOLO inference latency, CPU and memory utilization, and basic detection performance obtained from additional testbed runs. These metrics will be reported alongside the existing qualitative insights. revision: yes
Referee: [Agent Orchestration] Agent orchestration description: the event-based message exchange subsystem is positioned as a reliable alternative to fully autonomous LLM control, yet no details are given on concurrency handling, error recovery, or performance under concurrent agent load, which directly bears on the central claim of tight integration without external resources.

Authors: The event-based message bus is described at a conceptual level as providing reliable coordination. We agree that additional implementation details are needed to substantiate reliability claims. In the revision, we will expand the Agent Orchestration section with specifics on concurrency management through the message queue, error recovery approaches (including retries for agent communication failures), and observed behavior under concurrent loads during prototype testing. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a descriptive engineering report on assembling and integrating existing open-source components (YOLO detector, local Ollama LLM, Slack chatbot interface) on Raspberry Pi hardware via a custom event-driven message bus. No mathematical derivations, equations, fitted parameters, or self-citations appear as load-bearing elements in the central claims. The multi-agent framework is presented as a design and implementation choice demonstrated by construction, with experimental insights limited to platform limitations rather than any predictive or uniqueness assertions that reduce to inputs. This matches the assessment of an implementation prototype without internal circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a software prototype description with no mathematical derivations, fitted parameters, or new theoretical entities.

pith-pipeline@v0.9.0 · 5551 in / 1065 out tokens · 31418 ms · 2026-05-10T14:45:29.907540+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Generative AI as a transformative force for inno- vation: a review of opportunities, applications and challenges,

S. Sedkaoui and R. Benaichouba, “Generative AI as a transformative force for inno- vation: a review of opportunities, applications and challenges,”European Journal of Innovation Management, 08 2024

2024
[2]

VEI: A multicloud edge gate- way for computer vision in IoT,

S. Luu, A. Ravindran, A. D. Pazho, and H. Tabkhi, “VEI: A multicloud edge gate- way for computer vision in IoT,” inProceedings of the 1st Workshop on Middleware for the Edge, 2022, pp. 6–11

2022
[3]

Object detection with deep learning: A review,

Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,”IEEE Transactions on Neural Networks and Learning Systems, vol. 30, no. 11, pp. 3212–3232, 2019

2019
[4]

UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking,

L. Wen, D. Du, Z. Cai, Z. Lei, M.-C. Chang, H. Qi, J. Lim, M.-H. Yang, and S. Lyu, “UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking,”Computer Vision and Image Understanding, vol. 193, p. 102907, 2020. 18 V. Kalušev, B.Brkljač and M.Brkljač

2020
[5]

[Online]

(2026) Raspberry Pi 4 Model B Specifications. [Online]. Available: https: //www.raspberrypi.com/products/raspberry-pi-4-model-b/specifications/ (Ac- cessed 2026-04-12)

2026
[6]

[Online]

(2026) OpenClaw: Autonomous AI Agent Framework. [Online]. Available: https://openclaw.ai (Accessed 2026-04-14)

2026
[7]

(2026) Slack

Slack Technologies. (2026) Slack. [Online]. Available: https://slack.com (Accessed 2026-04-14)

2026
[8]

[Online]

(2026) Ollama. [Online]. Available: https://ollama.com (Accessed 2026-04-14)

2026
[9]

Applications of information geometry driven deep learn- ing,

B. Brkljač and M. Janev, “Applications of information geometry driven deep learn- ing,” inPattern Recognition and Computer Vision in the New AI Era, C. H. Chen, Ed. Series in Computer Vision: Volume 9, World Scientific, 2025, pp. 373–397

2025
[10]

Deep learning in multi-object detection and tracking: State of the art,

S. K. Pal, A. Pramanik, J. Maiti, and P. Mitra, “Deep learning in multi-object detection and tracking: State of the art,”Applied Intelligence, vol. 51, no. 9, pp. 6400–6429, 2021

2021
[11]

Region-based convolutional networks for accurate object detection and segmentation,

R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 1, pp. 142–158, 2015

2015
[12]

Faster R-CNN: Towards real-time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,”EEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1137–1149, 2016

2016
[13]

You Only Look Once: Uni- fied,Real-TimeObjectDetection,

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Uni- fied,Real-TimeObjectDetection,”inProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (CVPR), June 2016

2016
[14]

End-to-end object detection with transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” inComputer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds. Cham: Springer International Publishing, 2020, pp. 213–229

2020
[15]

Object detection in 20 years: A survey,

Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20 years: A survey,”Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276, 2023

2023
[16]

High performance Linpack (HPL) bench- mark on Raspberry Pi 4B (8GB) Beowulf cluster,

D. Papakyriakou and I. S. Barbounakis, “High performance Linpack (HPL) bench- mark on Raspberry Pi 4B (8GB) Beowulf cluster,”Int. J. Comput. Appl, vol. 185, pp. 11–19, 2023

2023
[17]

Raspberry Pi Fan adapter

(2026) Waveshare. Raspberry Pi Fan adapter. [Online]. Available: https: //www.waveshare.com/pi-fan-3007.htm (Accessed 2026-04-12)

2026
[18]

[Online]

(2026) Raspberry Pi Operating System. [Online]. Available: https://www. raspberrypi.com/software/operating-systems/ (Accessed 2026-04-12)

2026
[19]

[Online]

(2026) Ultralytics YOLOv8 Repository. [Online]. Available: https://github.com/ ultralytics/ultralytics (Accessed 2026-04-12)

2026
[20]

Jocher and J

G. Jocher and J. Qiu. (2026) Ultralytics YOLO26. [Online]. Available: https://github.com/ultralytics/ultralytics

2026
[21]

Jocher and J

G. Jocher and J. Qiu. (2024) Ultralytics YOLO11. [Online]. Available: https://github.com/ultralytics/ultralytics

2024
[22]

YOLOv12: Attention-Centric Real-Time Object Detectors

Y. Tian, Q. Ye, and D. Doermann, “YOLO12: Attention-centric real-time object detectors,”arXiv preprint arXiv:2502.12524, 2025

work page internal anchor Pith review arXiv 2025
[23]

G. Jocher. (2020) Ultralytics YOLOv5. [Online]. Available: https://github.com/ ultralytics/yolov5

2020
[24]

Jocher, A

G. Jocher, A. Chaurasia, and J. Qiu. (2023) Ultralytics YOLOv8. [Online]. Available: https://github.com/ultralytics/ultralytics

2023
[25]

TinyLlama: An open-source small language model,

P. Zhang, G. Zeng, T. Wang, and W. Lu, “TinyLlama: An open-source small language model,” 2024

2024
[26]

The Llama 3 herd of models,

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Let- man, A. Mathur, A. Schelten, A. Vaughanet al., “The Llama 3 herd of models,” inNeural Information Processing Systems. Curran Associates, 2024

2024
[27]

Gemma 3 Technical Report,

G. Team, A. Kamath, J. Ferret, S. Pathak, and N. V. et al., “Gemma 3 Technical Report,” 2025

2025
[28]

Kimi K2.5: Visual Agentic Intelligence

K. Team, T. Bai, Y. Bai, Y. Bao, S. Cai, Y. Cao, Y. Charles, H. Che, C. Chen, G. Chenet al., “Kimi K2. 5: Visual Agentic Intelligence,”arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review arXiv 2026