arxiv: 2604.15225 · v1 · submitted 2026-04-16 · 💻 cs.HC

Recognition: unknown

UrbanClipAtlas: A Visual Analytics Framework for Event and Scene Retrieval in Urban Videos

Joel Perca , Luis Sante , Juanpablo Heredia , Joao Rulff , Claudio Silva , Jorge Poco

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:56 UTC · model grok-4.3

classification 💻 cs.HC

keywords visual analyticsurban video analysisretrieval-augmented generationevent retrievalknowledge graphvideo groundingscene interpretationvision-language models

0 comments

The pith

UrbanClipAtlas aligns LLM text outputs with video clips and object detections to support reliable event retrieval in long urban recordings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces URBANCLIPATLAS as a visual analytics system that breaks long street-intersection videos into short clips and generates textual descriptions using a vision-language model. These descriptions are indexed for semantic search and connected through a knowledge graph to a domain taxonomy, with entities aligned to detected objects and trajectories in the footage. Users interact via an augmented chat interface to retrieve scenes and verify interpretations directly against the visual evidence. This setup targets labor-intensive manual review of urban videos for insights on safety and mobility patterns.

Core claim

URBANCLIPATLAS combines retrieval-augmented generation, taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification.

What carries the argument

The taxonomy-aware knowledge graph that maps entities and relations from LLM answers onto a domain-specific taxonomy while aligning them with detected objects and trajectories for visual grounding and verification.

Load-bearing premise

The vision-language model produces sufficiently accurate clip descriptions and the taxonomy-aware knowledge graph correctly aligns entities with detected objects and trajectories.

What would settle it

Running the case studies on the StreetAware dataset and observing that retrieved clips fail to match the events described in the chat outputs or that aligned objects and trajectories do not correspond to the textual reasoning.

Figures

Figures reproduced from arXiv: 2604.15225 by Claudio Silva, Joao Rulff, Joel Perca, Jorge Poco, Juanpablo Heredia, Luis Sante.

**Figure 1.** Figure 1: URBANCLIPATLAS interface. (A) The Chat Panel displays the user’s query, the RAG-generated narrative answer, and entitylevel tooltips linked to the knowledge graph. (B) The Video Player with tracking overlays shows the current frame with dynamic entities and highlighted static layout elements. (C) The Related Clips Timeline summarizes retrieved clips across videos, with cells encoded by their semantic rele… view at source ↗

**Figure 2.** Figure 2: URBANCLIPATLAS main workflow. (A) Preprocessing: long videos are segmented into clips, described by a VLM, indexed in a vector store, and paired with static layout masks, yielding semantic and spatial representations for each video. (B) Augmented Narrative Generation: at query time, the system enriches the user’s question, retrieves relevant clips, and composes narrative answers using the precomputed embed… view at source ↗

**Figure 3.** Figure 3: Hierarchical taxonomy for knowledge-graph construction. The taxonomy organizes urban-traffic concepts into five top-level categories—Agents, Motion Descriptors, Individual Behaviors, Safety Situations, and Environment Entities—providing a consistent semantic basis for entity extraction, event classification, and grounding across all components of URBANCLIPATLAS. situations, we obtain Dv,i = VLM(Cv,i ,P). D… view at source ↗

**Figure 4.** Figure 4: Taxonomy-guided entity alignment. The Knowledge Graph Construction module uses the generated answer and the fixed taxonomy to build a structured graph G ∗ of entities and relations. The Entity Grounding Engine then combines G ∗ , the retrieved clip, and the precomputed masks to localize dynamic entities through detections and tracks, and to select relevant static environment masks for visualization. com… view at source ↗

**Figure 5.** Figure 5: Trajectories in Case Study 1. The supporting 3D trajectory chart shows the motion of the man on a bike (green) and the dark SUV (purple) over time. Insets (a)–(c) show key frames: (a) the onset of the conflict as both approach the crosswalk; (b) the cyclist leaves the crosswalk while the SUV passes; and (c) the cyclist continues along the trajectory after the near miss. 7. Evaluation We qualitatively eva… view at source ↗

**Figure 6.** Figure 6: Follow-up querying for risk-oriented analysis. (A) Initial prompt asking for large vehicles blocking the intersection: the Chat Panel (A.2–A.3) and Video Player (A.1) highlight a coach bus executing a slow but compliant turning maneuver. (B) Follow-up prompt (B.1) explicitly focuses on safety risks associated with that scene: the updated narrative (B.2) and entity tooltips (B.3–B.4) reveal that a different… view at source ↗

read the original abstract

Extracting actionable insights from long-duration urban videos is often labor-intensive: analysts must manually sift through raw footage to pinpoint target events or uncover broader behavioral trends. In this work, we present URBANCLIPATLAS, a visual analytics system for exploring long urban videos recorded at street intersections. URBANCLIPATLAS combines retrieval-augmented generation (RAG), taxonomy-aware entity extraction, and video grounding to support event retrieval and interpretation. The system segments extended recordings into short clips, generates textual descriptions with a vision-language model, and indexes them for semantic retrieval. A knowledge graph maps entities and relations from LLM answers onto a domain-specific taxonomy and aligns them with detected objects and trajectories to support visual grounding and verification. URBANCLIPATLAS supports scene retrieval through an augmented chat-based interface and improves scene interpretation by tightly aligning textual outputs with video evidence. This design strengthens the connection between textual reasoning and visual evidence, reducing the effort required to validate model outputs and refine hypotheses. We demonstrate the usefulness of URBANCLIPATLAS on the StreetAware dataset through two case studies involving hazardous scenarios and crossing dynamics at street intersections. URBANCLIPATLAS helps analysts reason about safety- and mobility-related patterns across large urban video collections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UrbanClipAtlas wires together RAG, a taxonomy-aware KG, and video grounding into a chat interface for urban intersection videos, but the two case studies supply no numbers on accuracy or time savings.

read the letter

The core of this paper is a working system that breaks long street videos into clips, runs a VLM for descriptions, builds a knowledge graph that ties extracted entities to a domain taxonomy and to detected objects/trajectories, then lets an analyst query via chat and see the matching video segments. That pipeline is new as a single packaged tool for this exact setting, even if each piece (RAG, KG alignment, grounding) has been used before. The StreetAware case studies on hazardous events and crossing behavior show the interface in action and make a plausible case that analysts could spend less time scrubbing raw footage. The design choice to keep textual reasoning anchored to actual detections is sensible and addresses a real pain point in VLM-based video tools. The soft spot is exactly what the stress-test note flags: no quantitative results anywhere. There are no retrieval metrics, no measured alignment error between KG entities and trajectories, no user-study data on reduced validation time, and no comparison against simpler baselines. Two narrative demonstrations do not establish that the text-to-video correspondence holds reliably when VLMs hallucinate attributes or when entity extraction slips. Without those numbers the central claim stays untested. This work is aimed at visual-analytics researchers and urban-safety practitioners who already deal with intersection cameras and want a retrieval layer on top. A reader building similar systems would pick up the architecture and the taxonomy idea, but would still need to run their own evaluations. It is coherent on its own terms and shows clear thinking about the workflow, so it deserves a serious referee rather than a desk reject. I would send it out, with the expectation that reviewers will ask for precision/recall figures and at least one controlled user task measuring effort reduction.

Referee Report

2 major / 2 minor

Summary. The paper presents URBANCLIPATLAS, a visual analytics framework for event and scene retrieval in long urban videos from street intersections. It segments recordings into clips, generates descriptions via vision-language models, indexes them for semantic search using retrieval-augmented generation, builds a taxonomy-aware knowledge graph to extract and align entities/relations with detected objects and trajectories, and offers an augmented chat interface for querying and grounding. Usefulness is shown via two qualitative case studies on the StreetAware dataset involving hazardous scenarios and crossing dynamics.

Significance. If the alignment between textual reasoning and visual evidence holds, the system could meaningfully reduce analyst effort in exploring large urban video collections for safety and mobility insights, advancing practical visual analytics tools in HCI. The integration of domain taxonomy with KG grounding and RAG is a targeted strength for interpretability in video-based urban analysis.

major comments (2)

[Case studies / demonstration] Case studies (as described in the abstract and demonstration sections): the two case studies consist solely of narrative walkthroughs with no quantitative metrics on VLM description accuracy, entity-trajectory alignment error rates, retrieval precision/recall, grounding success, or user-study measures of reduced validation effort, leaving the central claim that the system 'improves scene interpretation by tightly aligning textual outputs with video evidence' and 'reducing the effort required to validate model outputs' unsupported.
[Abstract / System Overview] Abstract and system description: the assumption that the taxonomy-aware knowledge graph 'correctly aligns entities with detected objects and trajectories' and that VLM outputs are sufficiently faithful is presented without any validation data or error analysis, despite the skeptic note highlighting that VLMs routinely hallucinate attributes and trajectories.

minor comments (2)

Add a dedicated evaluation section with at least basic quantitative measures (e.g., alignment accuracy on a sample, retrieval metrics) or a small user study to make the claims falsifiable.
Clarify the exact pipeline for how LLM answers are mapped onto the KG and then grounded back to video timestamps/objects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on URBANCLIPATLAS. We address the concerns about the qualitative nature of the case studies and the lack of validation for system assumptions below, and will revise the manuscript accordingly to strengthen the presentation of evidence.

read point-by-point responses

Referee: [Case studies / demonstration] Case studies (as described in the abstract and demonstration sections): the two case studies consist solely of narrative walkthroughs with no quantitative metrics on VLM description accuracy, entity-trajectory alignment error rates, retrieval precision/recall, grounding success, or user-study measures of reduced validation effort, leaving the central claim that the system 'improves scene interpretation by tightly aligning textual outputs with video evidence' and 'reducing the effort required to validate model outputs' unsupported.

Authors: We agree that the current case studies are qualitative narrative walkthroughs and that this leaves the central claims about improved interpretation and reduced validation effort without quantitative backing. The demonstrations were chosen to illustrate the integrated workflow on real urban video scenarios from the StreetAware dataset. In the revised manuscript we will add quantitative support, including retrieval precision/recall on a set of analyst-style queries and manual verification of entity-trajectory alignment accuracy on sampled clips, to better substantiate the benefits. revision: yes
Referee: [Abstract / System Overview] Abstract and system description: the assumption that the taxonomy-aware knowledge graph 'correctly aligns entities with detected objects and trajectories' and that VLM outputs are sufficiently faithful is presented without any validation data or error analysis, despite the skeptic note highlighting that VLMs routinely hallucinate attributes and trajectories.

Authors: We acknowledge that the abstract and system overview present the alignment mechanism and VLM usage without accompanying validation data or error analysis. The knowledge graph is designed to map extracted entities to detected objects for grounding, but we did not include a dedicated quantitative assessment of alignment accuracy or hallucination rates. In revision we will moderate the language in the abstract and system description to avoid implying guaranteed correctness, add an explicit limitations discussion on VLM hallucinations, and incorporate preliminary alignment verification results drawn from the case-study clips. revision: partial

Circularity Check

0 steps flagged

No circularity: system description without derivations, fits, or self-referential predictions

full rationale

The paper is a system description of URBANCLIPATLAS that integrates external components (VLMs for clip descriptions, LLMs for RAG, object detection for trajectories, and a taxonomy-aware KG) to support retrieval and grounding in urban videos. No mathematical derivations, parameter fitting, uniqueness theorems, or predictions appear in the provided text; claims about improved interpretation and reduced validation effort are supported only by two qualitative case studies on StreetAware rather than any internal chain that reduces to its own inputs by construction. Self-citations are absent from the load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on assumptions about the reliability of off-the-shelf vision-language models and the utility of a custom domain taxonomy; no free parameters are fitted in the abstract description.

axioms (2)

domain assumption Vision-language models generate textual descriptions of video clips that are accurate enough for semantic retrieval.
Invoked when the system segments videos and indexes descriptions for chat-based search.
domain assumption A domain-specific taxonomy can be used to map LLM-extracted entities and relations into a knowledge graph that aligns with detected objects and trajectories.
Central to the grounding and verification step described in the abstract.

invented entities (1)

Taxonomy-aware knowledge graph no independent evidence
purpose: Maps entities from LLM answers onto a domain taxonomy and aligns them with video objects for visual grounding.
Introduced as a core component of the system without external validation data mentioned.

pith-pipeline@v0.9.0 · 5538 in / 1480 out tokens · 55302 ms · 2026-05-10T09:56:47.980203+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 2 internal anchors

[1]

In: 2024 IEEE 27th International Conference on Intelligent Transportation Sys- tems (ITSC), pp

[ADC24] AREFEENM. A., DEBNATHB., CHAKRADHARS.: Traffi- cLens: Multi-camera traffic video analysis using LLMs. InIEEE 27th International Conference on Intelligent Transportation Systems (ITSC) (Sept. 2024), pp. 3974–3981.doi:10.1109/ITSC58415.2024. 10920144. 3 [ADSUC24] AREFEENM. A., DEBNATHB., SARWARUDDINM. Y., CHAKRADHARS.: ViTA: An efficient video-to-te...

work page doi:10.1109/itsc58415.2024 2024
[2]

68– 77.doi:10.5220/0009422300680077

(2020), Berns K., Helfert M., Gusikhin O., (Eds.), SCITEPRESS, pp. 68– 77.doi:10.5220/0009422300680077. 3 [CE25] COSCIAA., ENDERTA.: VisPile: A visual analytics system for analyzing multiple text documents with large language models and knowledge graphs,

work page doi:10.5220/0009422300680077 2020
[3]

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection , isbn =

URL:https://arxiv.org/abs/2510. 09605,arXiv:2510.09605. 7 [CMS∗22] CHENGB., MISRAI., SCHWINGA. G., KIRILLOVA., GIRDHARR.: Masked-attention mask transformer for universal im- age segmentation. In2022 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR)(2022), pp. 1280–1289.doi: 10.1109/CVPR52688.2022.00135. 5 [CNC∗20] CHANG. Y.-Y., NONAT...

work page doi:10.1109/cvpr52688.2022.00135 2022
[4]

arXiv:2401.17270.doi:10.48550/arXiv.2401.17270. 3, 5 [CZYW17] CHENP., ZENGW., YUG., WANGY.: Surrogate safety analysis of pedestrian-vehicle conflict at intersections using unmanned aerial vehicle videos.Journal of Advanced Transportation 2017, 1 (2017), 5202150.doi:10.1155/2017/5202150. 2 [DHZ∗18] DENGM., HUANGJ., ZHANGY., LIUH., TANGL., TANG J., YANGX.: ...

work page doi:10.48550/arxiv.2401.17270 2017
[5]

Last updated July 26, 2024; accessed November 28,

2024
[6]

LightRAG: Simple and Fast Retrieval-Augmented Generation

arXiv:2410.05779.doi:10.48550/arXiv.2410.05779. 2 [HERMCP25] HEREDIAJ., ESTRADA-RAYMEL., MATOS- CANGALAYAJ., POCOJ.: Interactive exploration and ex- planation of spatio-temporal anomalies with graph-llm inte- gration. In2025 38th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI)(2025), pp. 1–6.doi: 10.1109/SIBGRAPI67909.2025.11223398. 2 [JCL...

work page internal anchor Pith review doi:10.48550/arxiv.2410.05779 2025
[7]

Vide- orag: Retrieval-augmented generation over video corpus,

arXiv:2501.05874.doi:10.48550/arXiv.2501.05874. 2, 3 [LZY∗24] LUOY., ZHENGX., YANGX., LIG., LINH., HUANGJ., JIJ., CHAOF., LUOJ., JIR.: Video-RAG: Visually-aligned retrieval- augmented long video comprehension, Dec

work page doi:10.48550/arxiv.2501.05874
[8]

Video-rag: Visually-aligned retrieval- augmented long video comprehension.arXiv preprint arXiv:2411.13093, 2024

arXiv:2411.13093. doi:10.48550/arXiv.2411.13093. 3 [MPCK∗25] MAOM., PEREZ-CABARCASM. M., KALLAKURIU., WAYTOWICHN. R., LINX., MOHSENINT.: Multi-RAG: A multi- modal retrieval-augmented generation system for adaptive video under- standing, June

work page doi:10.48550/arxiv.2411.13093
[10]

2 [RXX∗25] RENX., XUL., XIAL., WANGS., YIND., HUANGC.: VideoRAG: Retrieval-augmented generation with extreme long-context videos,

URL:https://arxiv.org/ abs/2410.22092,arXiv:2410.22092. 2 [RXX∗25] RENX., XUL., XIAL., WANGS., YIND., HUANGC.: VideoRAG: Retrieval-augmented generation with extreme long-context videos,

work page arXiv
[11]

2, 3 © 17 April 2026 The Author(s)

URL:https://arxiv.org/abs/2502.01549, arXiv:2502.01549. 2, 3 © 17 April 2026 The Author(s). 12 of 12J. Perca et al. /URBANCLIPATLAS [SH19] SCHÖNINGJ., HEIDEMANNG.: Visual video analytics for interactive video content analysis. InAdvances in Information and Communication Networks(Cham, 2019), Arai K., Kapoor S., Bhatia R., (Eds.), Springer International Pu...

work page doi:10.1109/tits.2016.2568920 2026
[13]

Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

arXiv:2503.08576.doi:10.48550/arXiv. 2503.08576. 3 [WGQ∗24] WUT., GES., QINJ., WUG., WANGL.: Open-vocabulary spatio-temporal action detection,

work page internal anchor Pith review doi:10.48550/arxiv
[14]

org/abs/2405.10832,arXiv:2405.10832

URL:https://arxiv. org/abs/2405.10832,arXiv:2405.10832. 3 [WQ20] WUA., QUH.: Multimodal analysis of video collections: Vi- sual exploration of presentation techniques in TED talks.IEEE Transac- tions on Visualization and Computer Graphics 26, 7 (2020), 2429–2442. doi:10.1109/TVCG.2018.2889081. 3 © 17 April 2026 The Author(s)

work page doi:10.1109/tvcg.2018.2889081 2020