pith. machine review for the scientific record. sign in

arxiv: 2604.12762 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.MA

Recognition: unknown

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA
keywords multi-camera person searchagentic reasoninginteractive benchmarkspatio-temporal topology graphLLM agentsinformation asymmetryvisual reasoningsurveillance
0
0 comments X

The pith

ARGOS reformulates multi-camera person search as an interactive agent reasoning task with a spatio-temporal graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARGOS as a benchmark that turns the task of locating a person across multiple camera feeds into an interactive problem for an AI agent. Starting from a vague witness statement, the agent must decide which questions to pose, when to consult spatial or temporal tools, and how to discard incorrect possibilities while staying within a turn limit. A reader would care because many real searches begin with incomplete details, and direct matching approaches cannot cope with the planning and uncertainty required. The framework grounds decisions in a graph of camera connections and travel times between them. Experiments across thousands of tasks show that existing language models perform poorly and that the specialized tools account for large gains in accuracy.

Core claim

ARGOS is the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where),

What carries the argument

The Spatio-Temporal Topology Graph (STTG) that encodes camera connectivity and transition times to ground the agent's planning and tool use.

If this is right

  • Current LLM backbones leave the benchmark far from solved, with peak scores of 0.383 on the spatial track and 0.590 on the temporal track.
  • Ablations show that removing domain-specific tools reduces accuracy by as much as 49.6 percentage points.
  • The three tracks isolate semantic perception, spatial reasoning, and temporal reasoning for separate measurement.
  • The 2,691 tasks across 14 scenarios supply a diverse, realistic testbed for evaluating agent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agentic structure could be adapted to other incremental search problems that involve partial observations, such as robotic exploration or medical record querying.
  • Stronger vision-language models would likely raise performance most on the semantic perception track.
  • Person search deployments may need to add explicit planning layers rather than relying on retrieval alone.

Load-bearing premise

The Spatio-Temporal Topology Graph provides accurate and sufficient grounding for the agent's planning and tool use across the 14 scenarios.

What would settle it

If agents achieve similar success rates when the Spatio-Temporal Topology Graph is removed from their input and they must operate on general knowledge alone, the specific contribution of this graph-based grounding would be falsified.

Figures

Figures reproduced from arXiv: 2604.12762 by In So Kweon, Junmo Kim, Kwanyong Park, Myungchul Kim.

Figure 1
Figure 1. Figure 1: Overview. Left: An ARGOS agent in￾teracts with an ambiguous witness through multi-turn dialogue, combining appearance, spatial, and tempo￾ral queries. Right: The four-module agent architec￾ture (Analyst→Planner→Interviewer→Interpreter) forms an observe-think-act loop over the evaluation environment. ducts multi-turn dialogue with a witness, invokes tools grounded in a physically validated Spatio-Temporal T… view at source ↗
Figure 2
Figure 2. Figure 2: Left: STTG for a 16-camera factory environment. Nodes are grouped into zones by OVERLAP connectivity. Edge types: OVERLAP (blue), SOFT ADJ (orange), TRAVEL (gray); labels show median transition time. Right: 3D camera layout with sample imagery. gages in multi-turn dialogue with a witness simulator W. At each turn the agent selects an action— 1) asking about visual attributes, 2) querying spatial location, … view at source ↗
Figure 3
Figure 3. Figure 3: Cumulative success rate (SRt@T) vs. turn budget for Track 2 (left) and Track 3 (right), GPT-4o. The ARGOS Agent (blue) resolves tasks substantially faster than w/o Strat￾egy (orange). w/o Spatial/Temporal Tool (red) stays near zero. sharply lower (0.373 vs. 0.567) because it requires twice the turns. (c) Agentic interaction is the only viable path. Single-pass baselines converge near 11% Top-1 on Track 3—a… view at source ↗
Figure 4
Figure 4. Figure 4: Track 3: with vs. without temporal tool. A single temporal check eliminates 16 of 19 candidates by verifying spatio￾temporal transition feasibility against the STTG, demonstrating the tool’s information density. 5. Qualitative Analysis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: ARGOS benchmark construction pipeline. Start￾ing from synchronized 16-camera video (MTMMC [21]), we construct per-person galleries (Sec. A.1), extract 24 visual at￾tributes (Sec. A.2), and build a spatio-temporal transition graph (Sec. A.3). These three components feed into the task gener￾ation module (Sec. A.4), which produces 989 (Track 1), 550 (Track 2), and 1,152 (Track 3) task instances. Quality contr… view at source ↗
Figure 7
Figure 7. Figure 7: STTG verification tools. (a) Trajectory Inspector: a Gantt chart shows one person’s movement across cameras over time (top); selecting a specific camera and frame displays the original video frame with GT bounding-box overlay (bottom), confirming the person’s identity at that location. (b) STTG Issue Dashboard: suspicious transitions are ranked by severity (top); a side-by-side view compares the exit frame… view at source ↗
Figure 8
Figure 8. Figure 8: Track 2 tool demonstration: agent with spatial tool (left) vs. without spatial tool (right) on task T2 s10 92 (Factory, Medium). The spatial tool enables disambiguation via location queries, providing an orthogonal axis when appearance attributes are ambiguous. B. Benchmark Overview Details This section provides benchmark overview materials omitted from the main paper due to space constraints. B.1. Three T… view at source ↗
Figure 9
Figure 9. Figure 9: Overview of the three ARGOS tracks. Track 1 receives a completed dialogue for single-turn attribute parsing. Track 2 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: System architecture. Left: the ARGOS Agent consists of four LLM-driven modules Table 2: Information boundary of the ARGOS Agent. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Information boundary of the ARGOS Agent. The blARGOS Tl Ri [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Tool registry. Each tool is executed by the environ [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Universal failure analysis. (a) Distribution of per-task model failure counts. (b) Universal failure rate by difficulty level. (c) Task outcome overview (all-correct, partial, universal). (d) Candidate reduction patterns in Track 2 universal failures: 50% narrow to exactly 1 candidate but predict incorrectly. across models), indicating that the temporal reasoning itself is incomplete rather than a final-s… view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative comparison: ARGOS agent with LLM-guided strategy (left) vs. random question ordering (right) on task T2 s20 32 (Factory, Medium). Strategic attribute selection resolves the task in 3 turns; random ordering wastes 7 turns on uninformative attributes before reaching the same discriminative question. E. Additional Benchmark Details E.1. Full Attribute Taxonomy [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗
Figure 15
Figure 15. Figure 15: Oracle (ground-truth attributes) vs. ARGOS agent on task T2 s42 21 (School, Hard). The oracle resolves in 2 turns with exact-match filtering. The agent parses NL responses with slight errors, causing 7 stalled turns and convergence to the wrong candidate. This gap highlights NL parsing accuracy as the primary bottleneck for future improvement. observable attributes, while still receiving informative respo… view at source ↗
Figure 16
Figure 16. Figure 16: STTG comparison. Left: Factory environment (16 cameras, 110 edges, 9 atomic zones). Right: University campus (16 cameras, 149 edges, 6 atomic zones). Edge types: OVERLAP (blue), SOFT ADJ (orange), TRAVEL (gray). The university environment has denser inter-zone connectivity due to its open outdoor layout. core (6 cameras) surrounded by outdoor plazas and paths, resulting in denser inter-zone connectivity c… view at source ↗
read the original abstract

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Significance. If the STTG validation and task construction hold up under scrutiny, ARGOS could be a significant contribution by establishing a new benchmark for evaluating agentic reasoning in realistic multi-camera settings that integrate perception, spatial planning, and temporal inference under information asymmetry. The progressive track design, multi-backbone experiments, and tool ablations provide concrete evidence of current LLM limitations and the value of domain tools, which are strengths that could drive follow-on work in agent frameworks for surveillance and search tasks.

major comments (2)
  1. [STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.
  2. [Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.
minor comments (2)
  1. [Abstract] Abstract: The acronym TWS is used without prior definition or expansion.
  2. [Figures] Figures: Diagrams of the agent workflow and STTG would benefit from clearer legends and explicit cross-references in the main text for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the major comments point-by-point below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.

    Authors: We agree that additional details on the empirical validation process are essential to support the claims. In the revised version, we will include specifics on the data volume (number of person trajectories and time periods collected), error bounds from the validation, coverage of all camera pairs in the 14 scenarios, and robustness checks against variations in crowd density and time-of-day. These additions will be placed in the STTG description section to better justify its use as grounding for the agent. revision: yes

  2. Referee: [Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.

    Authors: We recognize that more comprehensive details on task construction and validation would strengthen the interpretation of the results. We will revise the Experiments section to provide a full description of how the 2,691 tasks were constructed across the three tracks, including the procedures used for validation (such as manual review or consistency checks), and include an error analysis breaking down agent failure modes. This will help demonstrate that the performance gaps are attributable to the challenges in agentic multi-camera reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ARGOS benchmark and framework

full rationale

The paper introduces a new benchmark (2,691 tasks across 14 scenarios) and agentic framework for multi-camera person search, reformulating it as interactive reasoning grounded in an STTG whose transition times are described as empirically validated. No load-bearing step reduces a claimed prediction, uniqueness theorem, or first-principles result to its own inputs by construction; the reported LLM performance (TWS 0.383/0.590) and tool-ablation drops are direct experimental outcomes rather than tautological fits. The framework definition and benchmark construction remain independent of the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the STTG as an accurate grounding structure and on the 2691 tasks across 14 scenarios being representative of real-world multi-camera person search under information asymmetry.

axioms (1)
  • domain assumption The Spatio-Temporal Topology Graph (STTG) accurately encodes camera connectivity and empirically validated transition times
    Invoked to ground the agent's spatial and temporal reasoning and tool invocation.
invented entities (2)
  • ARGOS agent no independent evidence
    purpose: Interactive reasoning entity that plans questions and uses tools under information asymmetry
    Core new component of the framework for reformulating person search.
  • Spatio-Temporal Topology Graph (STTG) no independent evidence
    purpose: Encoding spatial connectivity and temporal transition times between cameras
    Introduced to support grounded reasoning in the benchmark.

pith-pipeline@v0.9.0 · 5481 in / 1388 out tokens · 36258 ms · 2026-05-10T15:03:50.832146+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 7 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 3

  2. [2]

    On Evaluation of Embodied Navigation Agents

    Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 3

  3. [3]

    Introducing Claude 4

    Anthropic. Introducing Claude 4. Online; https:// www.anthropic.com/news/claude-4, 2025. 3

  4. [4]

    Chat-based person retrieval via dialogue-refined cross- modal alignment

    Yang Bai, Yucheng Ji, Min Cao, Jinqiao Wang, and Mang Ye. Chat-based person retrieval via dialogue-refined cross- modal alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3952–3962,

  5. [5]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 1

  6. [6]

    Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024. 2

  7. [7]

    Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024. 1

  8. [8]

    Visual dialog

    Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 326–335, 2017. 1

  9. [9]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2

  10. [10]

    Interactive text-to-image retrieval with large language models: A plug-and-play approach

    Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. Interactive text-to-image retrieval with large language models: A plug-and-play approach. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791–809, 2024. 1

  11. [11]

    Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023

    Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023. 1

  12. [12]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 1

  13. [13]

    Llava-reid: Selective multi-image questioner for interactive person re-identification.arXiv preprint arXiv:2504.10174, 2025

    Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, and Xi Peng. Llava-reid: Selective multi-image questioner for interactive person re-identification.arXiv preprint arXiv:2504.10174, 2025. 1

  14. [14]

    Meissner, Siegfried L

    Christian A. Meissner, Siegfried L. Sporer, and Jonathan W. Schooler. Person descriptions as eyewitness evidence. InThe Handbook of Eyewitness Psychology: Volume II: Memory for People, pages 3–34. Psychology Press, 2007. 2

  15. [15]

    Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models

    Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian, and Xiangyang Xue. Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24245–24254, 2025. 1

  16. [16]

    Learn- ing transferable visual models from natural language su- pervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language su- pervision. InInternational conference on machine learn- ing, pages 8748–8763. PmLR, 2021. 2

  17. [17]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Ope- nai gpt-5 system card.arXiv preprint arXiv:2601.03267,

  18. [18]

    Harnessing the power of mllms for transferable text-to-image person reid

    Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 17127–17137, 2024. 1

  19. [19]

    Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness

    Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness. In2024 International conference on advances in data engineering and intelligent computing systems (ADICS), pages 1–6. IEEE, 2024. 1

  20. [20]

    Person transfer gan to bridge domain gap for person re- identification

    Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88,

  21. [21]

    Mtmmc: a large-scale real-world multi-modal camera tracking benchmark

    Sanghyun Woo, Kwanyong Park, Inkyu Shin, Myungchul Kim, and In So Kweon. Mtmmc: a large-scale real-world multi-modal camera tracking benchmark. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22335–22346, 2024. 4, 1

  22. [22]

    SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

    Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, and Yunjian Zhang. Spatialbench: Benchmarking multi- modal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 1

  23. [23]

    Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark

    Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark. InProceedings of the 31st ACM international conference on multimedia, pages 4492– 4501, 2023. 1

  24. [24]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning repre- sentations, 2022. 1

  25. [25]

    Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021

    Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021. 1

  26. [26]

    the warehouse area,

    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015. 1 ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search Supplementary Material Multi-Target Multi-Camera (MTMC) Videos & Tra...

  27. [27]

    Presence:Was the candidate observed at both cam- eras? If not: IMPOSSIBLE (NOT PRESENT)

  28. [28]

    Time ordering:Is the candidate’s inter-camera gap <−5s? If so: IMPOSSIBLE (TIME REVERSAL)

  29. [29]

    Edge existence:Does a direct STTG edge exist? If not: UNKNOWN (excluded)

  30. [30]

    almost at the same time,

    Plausibility (margin = 2.0 ):Is the gap too fast (< t min/2.0) or too slow ( > t max ×2.0 )? If so: IMPOSSIBLE. Otherwise: FEASIBLE. Quality filter.Tasks must contain at least one candidate eliminated by genuine temporal reasoning (TIME REVERSAL or TOO SLOW), excluding trivial presence-based filtering. After applying the quality fil- ter, 1,152 of 1,218 c...

  31. [31]

    It identifies which attributes have the highest elimination power among remaining candidates

    Analystqueries the gallery, computes attribute distri- butions over the current candidate set, and retrieves zone structure (Track 2). It identifies which attributes have the highest elimination power among remaining candidates

  32. [32]

    Plannerreceives the Analyst’s summary together with the full dialogue history and decides the next action: ask about an attribute, request a spatial de- scription, or issue a temporal check

  33. [33]

    I’m not sure

    Interviewerexecutes the chosen action by invoking the appropriate tool. For Track 3, temporal feasibility checking via check temporal (T5) is enforced as the mandatory first action. 4.Interpreterparses the witness’s natural-language re- sponse into a canonical attribute value and applies the corresponding filter to update the candidate set. This sequentia...