Recognition: unknown
ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search
Pith reviewed 2026-05-10 15:03 UTC · model grok-4.3
The pith
ARGOS reformulates multi-camera person search as an interactive agent reasoning task with a spatio-temporal graph.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ARGOS is the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where),
What carries the argument
The Spatio-Temporal Topology Graph (STTG) that encodes camera connectivity and transition times to ground the agent's planning and tool use.
If this is right
- Current LLM backbones leave the benchmark far from solved, with peak scores of 0.383 on the spatial track and 0.590 on the temporal track.
- Ablations show that removing domain-specific tools reduces accuracy by as much as 49.6 percentage points.
- The three tracks isolate semantic perception, spatial reasoning, and temporal reasoning for separate measurement.
- The 2,691 tasks across 14 scenarios supply a diverse, realistic testbed for evaluating agent behavior.
Where Pith is reading between the lines
- The same agentic structure could be adapted to other incremental search problems that involve partial observations, such as robotic exploration or medical record querying.
- Stronger vision-language models would likely raise performance most on the semantic perception track.
- Person search deployments may need to add explicit planning layers rather than relying on retrieval alone.
Load-bearing premise
The Spatio-Temporal Topology Graph provides accurate and sufficient grounding for the agent's planning and tool use across the 14 scenarios.
What would settle it
If agents achieve similar success rates when the Spatio-Temporal Topology Graph is removed from their input and they must operate on general knowledge alone, the specific contribution of this graph-based grounding would be falsified.
Figures
read the original abstract
We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.
Significance. If the STTG validation and task construction hold up under scrutiny, ARGOS could be a significant contribution by establishing a new benchmark for evaluating agentic reasoning in realistic multi-camera settings that integrate perception, spatial planning, and temporal inference under information asymmetry. The progressive track design, multi-backbone experiments, and tool ablations provide concrete evidence of current LLM limitations and the value of domain tools, which are strengths that could drive follow-on work in agent frameworks for surveillance and search tasks.
major comments (2)
- [STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.
- [Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.
minor comments (2)
- [Abstract] Abstract: The acronym TWS is used without prior definition or expansion.
- [Figures] Figures: Diagrams of the agent workflow and STTG would benefit from clearer legends and explicit cross-references in the main text for improved readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback. We address the major comments point-by-point below and plan to incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.
Authors: We agree that additional details on the empirical validation process are essential to support the claims. In the revised version, we will include specifics on the data volume (number of person trajectories and time periods collected), error bounds from the validation, coverage of all camera pairs in the 14 scenarios, and robustness checks against variations in crowd density and time-of-day. These additions will be placed in the STTG description section to better justify its use as grounding for the agent. revision: yes
-
Referee: [Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.
Authors: We recognize that more comprehensive details on task construction and validation would strengthen the interpretation of the results. We will revise the Experiments section to provide a full description of how the 2,691 tasks were constructed across the three tracks, including the procedures used for validation (such as manual review or consistency checks), and include an error analysis breaking down agent failure modes. This will help demonstrate that the performance gaps are attributable to the challenges in agentic multi-camera reasoning. revision: yes
Circularity Check
No significant circularity in ARGOS benchmark and framework
full rationale
The paper introduces a new benchmark (2,691 tasks across 14 scenarios) and agentic framework for multi-camera person search, reformulating it as interactive reasoning grounded in an STTG whose transition times are described as empirically validated. No load-bearing step reduces a claimed prediction, uniqueness theorem, or first-principles result to its own inputs by construction; the reported LLM performance (TWS 0.383/0.590) and tool-ablation drops are direct experimental outcomes rather than tautological fits. The framework definition and benchmark construction remain independent of the evaluation results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Spatio-Temporal Topology Graph (STTG) accurately encodes camera connectivity and empirically validated transition times
invented entities (2)
-
ARGOS agent
no independent evidence
-
Spatio-Temporal Topology Graph (STTG)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
On Evaluation of Embodied Navigation Agents
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 3
work page internal anchor Pith review arXiv 2018
-
[3]
Introducing Claude 4
Anthropic. Introducing Claude 4. Online; https:// www.anthropic.com/news/claude-4, 2025. 3
2025
-
[4]
Chat-based person retrieval via dialogue-refined cross- modal alignment
Yang Bai, Yucheng Ji, Min Cao, Jinqiao Wang, and Mang Ye. Chat-based person retrieval via dialogue-refined cross- modal alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3952–3962,
-
[5]
Spatialvlm: Endowing vision-language models with spatial reasoning capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 1
2024
-
[6]
Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024. 2
2024
-
[7]
Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024
An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024. 1
2024
-
[8]
Visual dialog
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 326–335, 2017. 1
2017
-
[9]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Interactive text-to-image retrieval with large language models: A plug-and-play approach
Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. Interactive text-to-image retrieval with large language models: A plug-and-play approach. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791–809, 2024. 1
2024
-
[11]
Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023
Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023. 1
2023
-
[12]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 1
work page internal anchor Pith review arXiv 2023
-
[13]
Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, and Xi Peng. Llava-reid: Selective multi-image questioner for interactive person re-identification.arXiv preprint arXiv:2504.10174, 2025. 1
-
[14]
Meissner, Siegfried L
Christian A. Meissner, Siegfried L. Sporer, and Jonathan W. Schooler. Person descriptions as eyewitness evidence. InThe Handbook of Eyewitness Psychology: Volume II: Memory for People, pages 3–34. Psychology Press, 2007. 2
2007
-
[15]
Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models
Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian, and Xiangyang Xue. Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24245–24254, 2025. 1
2025
-
[16]
Learn- ing transferable visual models from natural language su- pervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language su- pervision. InInternational conference on machine learn- ing, pages 8748–8763. PmLR, 2021. 2
2021
-
[17]
Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Ope- nai gpt-5 system card.arXiv preprint arXiv:2601.03267,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Harnessing the power of mllms for transferable text-to-image person reid
Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 17127–17137, 2024. 1
2024
-
[19]
Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness
Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness. In2024 International conference on advances in data engineering and intelligent computing systems (ADICS), pages 1–6. IEEE, 2024. 1
2024
-
[20]
Person transfer gan to bridge domain gap for person re- identification
Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88,
-
[21]
Mtmmc: a large-scale real-world multi-modal camera tracking benchmark
Sanghyun Woo, Kwanyong Park, Inkyu Shin, Myungchul Kim, and In So Kweon. Mtmmc: a large-scale real-world multi-modal camera tracking benchmark. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22335–22346, 2024. 4, 1
2024
-
[22]
SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition
Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, and Yunjian Zhang. Spatialbench: Benchmarking multi- modal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark
Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark. InProceedings of the 31st ACM international conference on multimedia, pages 4492– 4501, 2023. 1
2023
-
[24]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning repre- sentations, 2022. 1
2022
-
[25]
Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021
Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021. 1
2021
-
[26]
the warehouse area,
Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015. 1 ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search Supplementary Material Multi-Target Multi-Camera (MTMC) Videos & Tra...
2015
-
[27]
Presence:Was the candidate observed at both cam- eras? If not: IMPOSSIBLE (NOT PRESENT)
-
[28]
Time ordering:Is the candidate’s inter-camera gap <−5s? If so: IMPOSSIBLE (TIME REVERSAL)
-
[29]
Edge existence:Does a direct STTG edge exist? If not: UNKNOWN (excluded)
-
[30]
almost at the same time,
Plausibility (margin = 2.0 ):Is the gap too fast (< t min/2.0) or too slow ( > t max ×2.0 )? If so: IMPOSSIBLE. Otherwise: FEASIBLE. Quality filter.Tasks must contain at least one candidate eliminated by genuine temporal reasoning (TIME REVERSAL or TOO SLOW), excluding trivial presence-based filtering. After applying the quality fil- ter, 1,152 of 1,218 c...
-
[31]
It identifies which attributes have the highest elimination power among remaining candidates
Analystqueries the gallery, computes attribute distri- butions over the current candidate set, and retrieves zone structure (Track 2). It identifies which attributes have the highest elimination power among remaining candidates
-
[32]
Plannerreceives the Analyst’s summary together with the full dialogue history and decides the next action: ask about an attribute, request a spatial de- scription, or issue a temporal check
-
[33]
I’m not sure
Interviewerexecutes the chosen action by invoking the appropriate tool. For Track 3, temporal feasibility checking via check temporal (T5) is enforced as the mandatory first action. 4.Interpreterparses the witness’s natural-language re- sponse into a canonical attribute value and applies the corresponding filter to update the candidate set. This sequentia...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.