arxiv: 2604.12762 · v1 · submitted 2026-04-14 · 💻 cs.CV · cs.AI· cs.MA

Recognition: unknown

ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

Myungchul Kim , Kwanyong Park , Junmo Kim , In So Kweon

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:03 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.MA

keywords multi-camera person searchagentic reasoninginteractive benchmarkspatio-temporal topology graphLLM agentsinformation asymmetryvisual reasoningsurveillance

0 comments

The pith

ARGOS reformulates multi-camera person search as an interactive agent reasoning task with a spatio-temporal graph.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ARGOS as a benchmark that turns the task of locating a person across multiple camera feeds into an interactive problem for an AI agent. Starting from a vague witness statement, the agent must decide which questions to pose, when to consult spatial or temporal tools, and how to discard incorrect possibilities while staying within a turn limit. A reader would care because many real searches begin with incomplete details, and direct matching approaches cannot cope with the planning and uncertainty required. The framework grounds decisions in a graph of camera connections and travel times between them. Experiments across thousands of tasks show that existing language models perform poorly and that the specialized tools account for large gains in accuracy.

Core claim

ARGOS is the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where),

What carries the argument

The Spatio-Temporal Topology Graph (STTG) that encodes camera connectivity and transition times to ground the agent's planning and tool use.

If this is right

Current LLM backbones leave the benchmark far from solved, with peak scores of 0.383 on the spatial track and 0.590 on the temporal track.
Ablations show that removing domain-specific tools reduces accuracy by as much as 49.6 percentage points.
The three tracks isolate semantic perception, spatial reasoning, and temporal reasoning for separate measurement.
The 2,691 tasks across 14 scenarios supply a diverse, realistic testbed for evaluating agent behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agentic structure could be adapted to other incremental search problems that involve partial observations, such as robotic exploration or medical record querying.
Stronger vision-language models would likely raise performance most on the semantic perception track.
Person search deployments may need to add explicit planning layers rather than relying on retrieval alone.

Load-bearing premise

The Spatio-Temporal Topology Graph provides accurate and sufficient grounding for the agent's planning and tool use across the 14 scenarios.

What would settle it

If agents achieve similar success rates when the Spatio-Temporal Topology Graph is removed from their input and they must operate on general knowledge alone, the specific contribution of this graph-based grounding would be falsified.

Figures

Figures reproduced from arXiv: 2604.12762 by In So Kweon, Junmo Kim, Kwanyong Park, Myungchul Kim.

**Figure 1.** Figure 1: Overview. Left: An ARGOS agent interacts with an ambiguous witness through multi-turn dialogue, combining appearance, spatial, and temporal queries. Right: The four-module agent architecture (Analyst→Planner→Interviewer→Interpreter) forms an observe-think-act loop over the evaluation environment. ducts multi-turn dialogue with a witness, invokes tools grounded in a physically validated Spatio-Temporal T… view at source ↗

**Figure 2.** Figure 2: Left: STTG for a 16-camera factory environment. Nodes are grouped into zones by OVERLAP connectivity. Edge types: OVERLAP (blue), SOFT ADJ (orange), TRAVEL (gray); labels show median transition time. Right: 3D camera layout with sample imagery. gages in multi-turn dialogue with a witness simulator W. At each turn the agent selects an action— 1) asking about visual attributes, 2) querying spatial location, … view at source ↗

**Figure 3.** Figure 3: Cumulative success rate (SRt@T) vs. turn budget for Track 2 (left) and Track 3 (right), GPT-4o. The ARGOS Agent (blue) resolves tasks substantially faster than w/o Strategy (orange). w/o Spatial/Temporal Tool (red) stays near zero. sharply lower (0.373 vs. 0.567) because it requires twice the turns. (c) Agentic interaction is the only viable path. Single-pass baselines converge near 11% Top-1 on Track 3—a… view at source ↗

**Figure 4.** Figure 4: Track 3: with vs. without temporal tool. A single temporal check eliminates 16 of 19 candidates by verifying spatiotemporal transition feasibility against the STTG, demonstrating the tool’s information density. 5. Qualitative Analysis [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: ARGOS benchmark construction pipeline. Starting from synchronized 16-camera video (MTMMC [21]), we construct per-person galleries (Sec. A.1), extract 24 visual attributes (Sec. A.2), and build a spatio-temporal transition graph (Sec. A.3). These three components feed into the task generation module (Sec. A.4), which produces 989 (Track 1), 550 (Track 2), and 1,152 (Track 3) task instances. Quality contr… view at source ↗

**Figure 7.** Figure 7: STTG verification tools. (a) Trajectory Inspector: a Gantt chart shows one person’s movement across cameras over time (top); selecting a specific camera and frame displays the original video frame with GT bounding-box overlay (bottom), confirming the person’s identity at that location. (b) STTG Issue Dashboard: suspicious transitions are ranked by severity (top); a side-by-side view compares the exit frame… view at source ↗

**Figure 8.** Figure 8: Track 2 tool demonstration: agent with spatial tool (left) vs. without spatial tool (right) on task T2 s10 92 (Factory, Medium). The spatial tool enables disambiguation via location queries, providing an orthogonal axis when appearance attributes are ambiguous. B. Benchmark Overview Details This section provides benchmark overview materials omitted from the main paper due to space constraints. B.1. Three T… view at source ↗

**Figure 9.** Figure 9: Overview of the three ARGOS tracks. Track 1 receives a completed dialogue for single-turn attribute parsing. Track 2 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: System architecture. Left: the ARGOS Agent consists of four LLM-driven modules Table 2: Information boundary of the ARGOS Agent. [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Information boundary of the ARGOS Agent. The blARGOS Tl Ri [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Tool registry. Each tool is executed by the environ [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Universal failure analysis. (a) Distribution of per-task model failure counts. (b) Universal failure rate by difficulty level. (c) Task outcome overview (all-correct, partial, universal). (d) Candidate reduction patterns in Track 2 universal failures: 50% narrow to exactly 1 candidate but predict incorrectly. across models), indicating that the temporal reasoning itself is incomplete rather than a final-s… view at source ↗

**Figure 14.** Figure 14: Qualitative comparison: ARGOS agent with LLM-guided strategy (left) vs. random question ordering (right) on task T2 s20 32 (Factory, Medium). Strategic attribute selection resolves the task in 3 turns; random ordering wastes 7 turns on uninformative attributes before reaching the same discriminative question. E. Additional Benchmark Details E.1. Full Attribute Taxonomy [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 15.** Figure 15: Oracle (ground-truth attributes) vs. ARGOS agent on task T2 s42 21 (School, Hard). The oracle resolves in 2 turns with exact-match filtering. The agent parses NL responses with slight errors, causing 7 stalled turns and convergence to the wrong candidate. This gap highlights NL parsing accuracy as the primary bottleneck for future improvement. observable attributes, while still receiving informative respo… view at source ↗

**Figure 16.** Figure 16: STTG comparison. Left: Factory environment (16 cameras, 110 edges, 9 atomic zones). Right: University campus (16 cameras, 149 edges, 6 atomic zones). Edge types: OVERLAP (blue), SOFT ADJ (orange), TRAVEL (gray). The university environment has denser inter-zone connectivity due to its open outdoor layout. core (6 cameras) surrounded by outdoor plazas and paths, resulting in denser inter-zone connectivity c… view at source ↗

read the original abstract

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARGOS gives a useful new benchmark for turning multi-camera person search into an interactive agent task with Who/Where/When tracks, but the STTG grounding needs clearer validation details to support the claims fully.

read the letter

ARGOS is the first benchmark to cast multi-camera person search as an agent that receives a vague statement and must plan questions, call spatial or temporal tools, and narrow candidates within a turn limit. The setup uses a Spatio-Temporal Topology Graph to encode camera connections and transition times, then splits the work into three tracks that increase in reasoning demand: semantic perception, spatial reasoning, and temporal reasoning. They created 2691 tasks across 14 real scenarios and ran four LLM backbones, showing the problem remains hard with top scores of 0.383 and 0.590, plus large drops when tools are removed. The ablation results are straightforward and make the case that domain tools matter. The progressive tracks and information-asymmetry design are a clean way to measure agent progress beyond standard retrieval metrics. The main soft spot is the STTG itself. The abstract states that transition times are empirically validated, yet it supplies no numbers on observation count, camera-pair coverage, or tests for dynamic factors such as crowds or time of day. Without those specifics, it is difficult to know whether the graph supplies reliable priors or whether some agent errors trace back to shaky timing data rather than reasoning limits. If the full paper shows the validation process with error bounds and cross-checks, this concern shrinks; otherwise it stays material. The work is aimed at researchers building LLM agents for vision or robotics and at groups that evaluate interactive reasoning in camera networks. Readers who care about grounded tool use or multi-view benchmarks will find concrete tasks and baselines here. It is coherent on its own terms and shows honest engagement with the limits of current models, so it deserves a serious referee even if the grounding section needs expansion.

Referee Report

2 major / 2 minor

Summary. We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Significance. If the STTG validation and task construction hold up under scrutiny, ARGOS could be a significant contribution by establishing a new benchmark for evaluating agentic reasoning in realistic multi-camera settings that integrate perception, spatial planning, and temporal inference under information asymmetry. The progressive track design, multi-backbone experiments, and tool ablations provide concrete evidence of current LLM limitations and the value of domain tools, which are strengths that could drive follow-on work in agent frameworks for surveillance and search tasks.

major comments (2)

[STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.
[Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.

minor comments (2)

[Abstract] Abstract: The acronym TWS is used without prior definition or expansion.
[Figures] Figures: Diagrams of the agent workflow and STTG would benefit from clearer legends and explicit cross-references in the main text for improved readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the major comments point-by-point below and plan to incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [STTG description] STTG description: The claim that transition times are 'empirically validated' lacks details on data volume, error bounds, camera-pair coverage, and robustness to dynamic conditions such as crowds or time-of-day variations. This is load-bearing for the central claim that the STTG provides accurate and sufficient grounding for agent planning and tool use across the 14 scenarios.

Authors: We agree that additional details on the empirical validation process are essential to support the claims. In the revised version, we will include specifics on the data volume (number of person trajectories and time periods collected), error bounds from the validation, coverage of all camera pairs in the 14 scenarios, and robustness checks against variations in crowd density and time-of-day. These additions will be placed in the STTG description section to better justify its use as grounding for the agent. revision: yes
Referee: [Experiments section] Experiments section: Results are reported on 2,691 tasks without full details on task construction, validation procedures for the task set, or error analysis of agent failures. This weakens support for interpreting the low TWS scores (0.383/0.590) as evidence of reasoning difficulty rather than potential issues in task design or STTG priors.

Authors: We recognize that more comprehensive details on task construction and validation would strengthen the interpretation of the results. We will revise the Experiments section to provide a full description of how the 2,691 tasks were constructed across the three tracks, including the procedures used for validation (such as manual review or consistency checks), and include an error analysis breaking down agent failure modes. This will help demonstrate that the performance gaps are attributable to the challenges in agentic multi-camera reasoning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ARGOS benchmark and framework

full rationale

The paper introduces a new benchmark (2,691 tasks across 14 scenarios) and agentic framework for multi-camera person search, reformulating it as interactive reasoning grounded in an STTG whose transition times are described as empirically validated. No load-bearing step reduces a claimed prediction, uniqueness theorem, or first-principles result to its own inputs by construction; the reported LLM performance (TWS 0.383/0.590) and tool-ablation drops are direct experimental outcomes rather than tautological fits. The framework definition and benchmark construction remain independent of the evaluation results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the STTG as an accurate grounding structure and on the 2691 tasks across 14 scenarios being representative of real-world multi-camera person search under information asymmetry.

axioms (1)

domain assumption The Spatio-Temporal Topology Graph (STTG) accurately encodes camera connectivity and empirically validated transition times
Invoked to ground the agent's spatial and temporal reasoning and tool invocation.

invented entities (2)

ARGOS agent no independent evidence
purpose: Interactive reasoning entity that plans questions and uses tools under information asymmetry
Core new component of the framework for reformulating person search.
Spatio-Temporal Topology Graph (STTG) no independent evidence
purpose: Encoding spatial connectivity and temporal transition times between cameras
Introduced to support grounded reasoning in the benchmark.

pith-pipeline@v0.9.0 · 5481 in / 1388 out tokens · 36258 ms · 2026-05-10T15:03:50.832146+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 7 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

On Evaluation of Embodied Navigation Agents

Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018. 3

work page internal anchor Pith review arXiv 2018
[3]

Introducing Claude 4

Anthropic. Introducing Claude 4. Online; https:// www.anthropic.com/news/claude-4, 2025. 3

2025
[4]

Chat-based person retrieval via dialogue-refined cross- modal alignment

Yang Bai, Yucheng Ji, Min Cao, Jinqiao Wang, and Mang Ye. Chat-based person retrieval via dialogue-refined cross- modal alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3952–3962,
[5]

Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 1

2024
[6]

Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foun- dation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185– 24198, 2024. 2

2024
[7]

Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision- language models.Advances in Neural Information Pro- cessing Systems, 37:135062–135093, 2024. 1

2024
[8]

Visual dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, Jos´e MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. InProceedings of the IEEE confer- ence on computer vision and pattern recognition, pages 326–335, 2017. 1

2017
[9]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Interactive text-to-image retrieval with large language models: A plug-and-play approach

Saehyung Lee, Sangwon Yu, Junsung Park, Jihun Yi, and Sungroh Yoon. Interactive text-to-image retrieval with large language models: A plug-and-play approach. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 791–809, 2024. 1

2024
[11]

Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023

Matan Levy, Rami Ben-Ari, Nir Darshan, and Dani Lischinski. Chatting makes perfect: Chat-based image retrieval.Advances in Neural Information Processing Systems, 36:61437–61449, 2023. 1

2023
[12]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023. 1

work page internal anchor Pith review arXiv 2023
[13]

Llava-reid: Selective multi-image questioner for interactive person re-identification.arXiv preprint arXiv:2504.10174, 2025

Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, and Xi Peng. Llava-reid: Selective multi-image questioner for interactive person re-identification.arXiv preprint arXiv:2504.10174, 2025. 1

work page arXiv 2025
[14]

Meissner, Siegfried L

Christian A. Meissner, Siegfried L. Sporer, and Jonathan W. Schooler. Person descriptions as eyewitness evidence. InThe Handbook of Eyewitness Psychology: Volume II: Memory for People, pages 3–34. Psychology Press, 2007. 2

2007
[15]

Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models

Ke Niu, Haiyang Yu, Mengyang Zhao, Teng Fu, Siyang Yi, Wei Lu, Bin Li, Xuelin Qian, and Xiangyang Xue. Chatreid: Open-ended interactive person retrieval via hier- archical progressive tuning for vision language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24245–24254, 2025. 1

2025
[16]

Learn- ing transferable visual models from natural language su- pervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language su- pervision. InInternational conference on machine learn- ing, pages 8748–8763. PmLR, 2021. 2

2021
[17]

OpenAI GPT-5 System Card

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Ope- nai gpt-5 system card.arXiv preprint arXiv:2601.03267,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Harnessing the power of mllms for transferable text-to-image person reid

Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, and Dapeng Tao. Harnessing the power of mllms for transferable text-to-image person reid. InPro- ceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 17127–17137, 2024. 1

2024
[19]

Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness

Rejin Varghese and M Sambath. Yolov8: A novel object detection algorithm with enhanced performance and ro- bustness. In2024 International conference on advances in data engineering and intelligent computing systems (ADICS), pages 1–6. IEEE, 2024. 1

2024
[20]

Person transfer gan to bridge domain gap for person re- identification

Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re- identification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88,
[21]

Mtmmc: a large-scale real-world multi-modal camera tracking benchmark

Sanghyun Woo, Kwanyong Park, Inkyu Shin, Myungchul Kim, and In So Kweon. Mtmmc: a large-scale real-world multi-modal camera tracking benchmark. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22335–22346, 2024. 4, 1

2024
[22]

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, and Yunjian Zhang. Spatialbench: Benchmarking multi- modal large language models for spatial cognition.arXiv preprint arXiv:2511.21471, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark

Shuyu Yang, Yinan Zhou, Zhedong Zheng, Yaxiong Wang, Li Zhu, and Yujiao Wu. Towards unified text- based person retrieval: A large-scale multi-attribute and language search benchmark. InProceedings of the 31st ACM international conference on multimedia, pages 4492– 4501, 2023. 1

2023
[24]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning repre- sentations, 2022. 1

2022
[25]

Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021

Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re- identification: A survey and outlook.IEEE transactions on pattern analysis and machine intelligence, 44(6):2872– 2893, 2021. 1

2021
[26]

the warehouse area,

Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jing- dong Wang, and Qi Tian. Scalable person re-identification: A benchmark. InProceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015. 1 ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search Supplementary Material Multi-Target Multi-Camera (MTMC) Videos & Tra...

2015
[27]

Presence:Was the candidate observed at both cam- eras? If not: IMPOSSIBLE (NOT PRESENT)
[28]

Time ordering:Is the candidate’s inter-camera gap <−5s? If so: IMPOSSIBLE (TIME REVERSAL)
[29]

Edge existence:Does a direct STTG edge exist? If not: UNKNOWN (excluded)
[30]

almost at the same time,

Plausibility (margin = 2.0 ):Is the gap too fast (< t min/2.0) or too slow ( > t max ×2.0 )? If so: IMPOSSIBLE. Otherwise: FEASIBLE. Quality filter.Tasks must contain at least one candidate eliminated by genuine temporal reasoning (TIME REVERSAL or TOO SLOW), excluding trivial presence-based filtering. After applying the quality fil- ter, 1,152 of 1,218 c...
[31]

It identifies which attributes have the highest elimination power among remaining candidates

Analystqueries the gallery, computes attribute distri- butions over the current candidate set, and retrieves zone structure (Track 2). It identifies which attributes have the highest elimination power among remaining candidates
[32]

Plannerreceives the Analyst’s summary together with the full dialogue history and decides the next action: ask about an attribute, request a spatial de- scription, or issue a temporal check
[33]

I’m not sure

Interviewerexecutes the chosen action by invoking the appropriate tool. For Track 3, temporal feasibility checking via check temporal (T5) is enforced as the mandatory first action. 4.Interpreterparses the witness’s natural-language re- sponse into a canonical attribute value and applies the corresponding filter to update the candidate set. This sequentia...