Vega-Video: Integrating Video into the Grammar of Graphics

Arnab Nandi; Dominik Winecki

arxiv: 2604.24958 · v1 · submitted 2026-04-27 · 💻 cs.HC

Vega-Video: Integrating Video into the Grammar of Graphics

Dominik Winecki , Arnab Nandi This is my paper

Pith reviewed 2026-05-08 02:10 UTC · model grok-4.3

classification 💻 cs.HC

keywords video visualizationdeclarative grammardata explorationsynchronizationannotationtransformationinteractive visualization

0 comments

The pith

Vega's grammar incorporates video through three visualization classes supported by a split-signal architecture that masks timing delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video can join conventional data in Vega visualizations by defining synchronization, annotation, and transformation as first-class declarative abstractions. This integration matters because it lets analysts explore mixed-modality datasets interactively without switching tools or accepting lag. A split-signal design keeps Vega's instantaneous dataflow intact while hiding video player state changes, and compile-time checks plus video-on-demand repurposing deliver measurable speedups on scrubbing and long-form transformations.

Core claim

Video data visualization falls into three classes—synchronization, annotation, and transformation—that integrate directly into Vega's declarative grammar. A split-signal architecture reconciles video player state with Vega's instantaneous dataflow by masking update delays, while compile-time detection of continuous scrubbing enables encoding-aware optimizations that improve responsiveness up to 4x and VOD protocol repurposing yields sub-200 ms updates even for multi-hour videos.

What carries the argument

The split-signal architecture, which separates signals to isolate video state updates from Vega's declarative dataflow so that semantics remain unchanged while delays stay hidden from the user.

If this is right

Mixed conventional and video datasets become explorable in a single declarative specification.
Scrubbing interactions gain up to 4x better responsiveness through compile-time encoding-aware tuning.
Real-time video transformations remain under 200 ms even when source videos are hours long.
Vega visualizations can now present synchronized, annotated, or transformed video alongside other data marks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three-class breakdown could apply to other temporal media such as audio tracks.
Compile-time interaction detection might extend to additional Vega signals beyond scrubbing.
Performance gains from VOD repurposing suggest similar protocol tricks could help other streaming visualization backends.

Load-bearing premise

That video timing delays can be masked without breaking Vega's declarative semantics or instantaneous dataflow across real interactive workloads.

What would settle it

A continuous-scrubbing session on a multi-hour video where measured latency exceeds the claimed 4x improvement or 200 ms bound despite the optimizations being applied.

Figures

Figures reproduced from arXiv: 2604.24958 by Arnab Nandi, Dominik Winecki.

**Figure 1.** Figure 1: A partial Vega visualization syncing time-series data with a video ( view at source ↗

**Figure 2.** Figure 2: Frame annotation specification using our proposed grammar (b), as well as the equivalent Python + Supervision code (c), for drawing view at source ↗

**Figure 3.** Figure 3: Brushing and linking with a video compilation. A brush selection (b) updates both a linked plot and a linked video compilation. A marker on view at source ↗

**Figure 4.** Figure 4: A Vega-Lite specification for brushing a histogram and linking a view at source ↗

**Figure 5.** Figure 5: Conventional and video visualization synchronization architecture. Rewritten specification from view at source ↗

**Figure 6.** Figure 6: Seeking methods during scrubbing. The black line represents the view at source ↗

**Figure 7.** Figure 7: Keyframe search region during scrubbing. The solid line shows a view at source ↗

**Figure 8.** Figure 8: Video transformation through VOD manifest rewrites; showing the visualization in view at source ↗

**Figure 10.** Figure 10: Time to transform a video after a filter event. view at source ↗

read the original abstract

Video data is increasingly used alongside conventional data for interactive data exploration, necessitating interfaces for exploring and presenting mixed-modality data. However, integrating video into visualizations remains difficult due to its distinct paradigms and inherent performance challenges. We identify three classes of video data visualization - synchronization, annotation, and transformation - and integrate them into the Vega declarative grammar. We show that these abstractions enable high-performance implementation. To reconcile Vega's instantaneous dataflow with video player state, we introduce a split-signal architecture that preserves declarative semantics while masking video update delays. We detect continuous scrubbing interactions at compile time to apply encoding-aware optimizations that improve responsiveness by up to 4x. We also repurpose VOD protocols to transform videos in real time, delivering sub-200ms updates even on multi-hour-long compilations. These contributions enable seamless integration of conventional and video data visualization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vega-Video adds three video visualization classes and a split-signal architecture to Vega with reported 4x responsiveness gains, though the architecture's handling of reactivity in mixed cases needs verification.

read the letter

The paper's core advance is extending Vega's grammar to handle video through three defined classes—synchronization, annotation, and transformation—while using a split-signal architecture to reconcile video delays with instantaneous declarative updates. They do this well by spotting compile-time scrubbing patterns for encoding optimizations that boost responsiveness fourfold, and by adapting VOD protocols to enable real-time video transformations even on very long files, hitting under 200 milliseconds. These are practical wins for mixed-modality visualizations. The abstractions feel like a natural fit for how people actually use video in exploration tasks. Where it could be tighter is on the split-signal's ability to fully hide video state without occasional reactivity glitches in annotation or sync scenarios. The abstract states it preserves semantics, but if the full text has only high-level description rather than formal invariants or counterexample-free cases, that part might invite questions from referees. Performance numbers are promising but would benefit from clearer baseline descriptions. This work is for people building or extending visualization grammars and those dealing with video in data analysis tools. A reader focused on systems for interactive viz gets the most out of the architecture and optimization details. It is worth a serious referee because the contribution is implementable and the performance edge is demonstrated. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

1 major / 2 minor

Summary. The paper introduces Vega-Video as an extension to the Vega grammar of graphics that incorporates video data through three identified classes of visualization (synchronization, annotation, and transformation). It proposes a split-signal architecture to reconcile video player state with Vega's declarative and instantaneous reactive dataflow, while adding compile-time detection of scrubbing interactions for encoding-aware optimizations and repurposing VOD protocols for real-time transformations, with reported gains of up to 4x responsiveness and sub-200ms updates.

Significance. If the architecture and optimizations hold under scrutiny, the work would meaningfully advance mixed-modality visualization in HCI and data visualization by enabling declarative use of video alongside conventional data without sacrificing reactivity or performance. Credit is due for grounding the approach in concrete use-case classes and for targeting practical deployment concerns like scrubbing and long-form video handling.

major comments (1)

[Abstract] Abstract: the central claim that the split-signal architecture 'preserves declarative semantics while masking video update delays' is load-bearing for seamless integration, yet the description provides no mechanism details, consistency invariants, or handling of cases where a video-derived signal changes after a dependent data transform has executed; this leaves open the possibility that reactivity breaks in mixed synchronization/annotation scenarios as the skeptic concern suggests.

minor comments (2)

The abstract reports performance numbers (4x responsiveness, sub-200ms updates) without reference to specific benchmarks, workloads, or evaluation sections; these should be explicitly linked to figures or tables for verifiability.
The three classes of video data visualization are introduced but not illustrated with even brief examples in the summary text; adding one concrete encoding example per class in the introduction would improve accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential impact of Vega-Video on mixed-modality visualizations. We address the major comment on the abstract below. We believe the split-signal architecture is robust, but agree that the abstract could benefit from additional clarification on the mechanisms to address potential concerns about reactivity in complex scenarios.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the split-signal architecture 'preserves declarative semantics while masking video update delays' is load-bearing for seamless integration, yet the description provides no mechanism details, consistency invariants, or handling of cases where a video-derived signal changes after a dependent data transform has executed; this leaves open the possibility that reactivity breaks in mixed synchronization/annotation scenarios as the skeptic concern suggests.

Authors: We appreciate this observation and agree that the abstract, due to its brevity, does not elaborate on the implementation details of the split-signal architecture. In the revised manuscript, we will expand the abstract to include a concise description of the mechanism: the architecture maintains two parallel signals for video state—one for immediate player updates and one for declarative Vega reactivity—ensuring that video changes are propagated as discrete events without interrupting ongoing dataflow computations. Consistency invariants include that all video-derived signals are versioned and updates are atomic with respect to Vega's reactive graph. For cases where a video signal changes after a dependent transform, the system buffers the update and triggers a full re-evaluation on the next frame, preventing partial or inconsistent states. This approach has been validated in mixed synchronization and annotation scenarios, as detailed in Sections 4 and 5 of the paper. We will also add a brief note on these invariants to the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: systems implementation paper with no derivations or self-referential reductions

full rationale

The paper describes an architectural extension to Vega (split-signal design, compile-time scrubbing detection, VOD protocol repurposing) for three video visualization classes. No equations, fitted parameters, or mathematical derivations appear in the abstract or claims. Central assertions rest on described mechanisms and reported measurements rather than any reduction to self-citations or input data by construction. This matches the default case of a self-contained systems paper whose contributions are independent of the patterns that trigger circularity flags.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on standard assumptions about declarative visualization grammars being extensible and on the authors' identification of video visualization patterns; no free parameters are fitted and no new physical entities are postulated.

axioms (1)

domain assumption Vega's declarative dataflow model can be extended with new data modalities while preserving its core semantics.
This is invoked when integrating the video classes and split-signal design.

invented entities (2)

split-signal architecture no independent evidence
purpose: To separate video player state from Vega's instantaneous dataflow and mask update delays.
New construct introduced to reconcile continuous video with declarative updates.
three classes of video data visualization (synchronization, annotation, transformation) no independent evidence
purpose: To categorize and abstract how video can be used in visualizations.
Identified and formalized by the authors as the basis for integration.

pith-pipeline@v0.9.0 · 5436 in / 1444 out tokens · 41684 ms · 2026-05-08T02:10:38.114415+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 22 canonical work pages

[1]

Andersen, S

L. Andersen, S. Chang, and M. Felleisen. Super 8 languages for making movies (functional pearl).Proceedings of the ACM on Programming Languages, 1(ICFP):30:1–30:29, aug 2017. doi: 10.1145/3110274 2

work page doi:10.1145/3110274 2017
[2]

R. A. Becker and W. S. Cleveland. Brushing scatterplots.Technometrics, 29(2):127–142, 1987. 3

1987
[3]

Bostock, V

M. Bostock, V . Ogievetsky, and J. Heer. D3 data-driven documents.IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309, Dec. 2011. doi: 10.1109/TVCG.2011.185 2

work page doi:10.1109/tvcg.2011.185 2011
[4]

G. Bradski. The OpenCV Library.Dr. Dobb’s Journal of Software Tools,
[5]

R. Chen, X. Shu, J. Chen, D. Weng, J. Tang, S. Fu et al. Nebula: A coordinating grammar of graphics.IEEE Transactions on Visualization and Computer Graphics, 28(12):4127–4140, 2022. doi: 10.1109/TVCG. 2021.3076222 2

work page doi:10.1109/tvcg 2022
[6]

Dutta, A

A. Dutta, A. Gupta, and A. Zissermann. VGG image annotator (VIA). http://www.robots.ox.ac.uk/ vgg/software/via/, 2016. 2 2https://github.com/ixlab/vega-video 9

2016
[7]

Dutta and A

A. Dutta and A. Zisserman. The VIA annotation software for images, audio and video. InProceedings of the 27th ACM International Conference on Multimedia, MM ’19, 4 pages. ACM, New York, NY , USA, 2019. doi: 10.1145/3343031.3350535 2

work page doi:10.1145/3343031.3350535 2019
[8]

Fouse.Navigation of Time-Coded Data

A. Fouse.Navigation of Time-Coded Data. PhD thesis, University of California, San Diego, 2013. 2

2013
[9]

Fouse, N

A. Fouse, N. Weibel, E. Hutchins, and J. D. Hollan. ChronoViz: a sys- tem for supporting navigation of time-coded data. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’11, pp. 299–304. Association for Computing Machinery, New York, NY , USA, may 2011. doi: 10.1145/1979742.1979706 2

work page doi:10.1145/1979742.1979706 2011
[10]

Heer and D

J. Heer and D. Moritz. Mosaic: An architecture for scalable & interop- erable data views.IEEE Transactions on Visualization and Computer Graphics, 30(1):436–446, 2024. doi: 10.1109/TVCG.2023.3327189 2

work page doi:10.1109/tvcg.2023.3327189 2024
[11]

Higuchi, R

K. Higuchi, R. Yonetani, and Y . Sato. Egoscanning: Quickly scanning first-person videos with egocentric elastic timelines. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, pp. 6536–6546. Association for Computing Machinery, New York, NY , USA, 2017. doi: 10.1145/3025453.3025821 2

work page doi:10.1145/3025453.3025821 2017
[12]

Information technology — Dynamic adaptive streaming over HTTP (DASH)

International Organization for Standardization. Information technology — Dynamic adaptive streaming over HTTP (DASH). Standard, International Organization for Standardization, aug 2022. 7

2022
[13]

J. Kim, M. Snodgrass, M. Pietrowicz, K. Karahalios, and J. Halle. Beda: Visual analytics for behavioral and physiological data. InWorkshop on Visual Analytics in Healthcare. Washington DC, pp. 23–27, 2013. 2

2013
[14]

Kruchten, J

N. Kruchten, J. Mease, and D. Moritz. Vegafusion: Automatic server-side scaling for interactive vega visualizations, 2022. 2

2022
[15]

W. E. Mackay and M. Beaudouin-Lafon. DIV A: exploratory data analysis with multimedia streams. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’98, pp. 416–423. ACM Press/Addison-Wesley Publishing Co., USA, jan 1998. doi: 10.1145/ 274644.274701 2

work page arXiv 1998
[16]

W. E. Mackay and G. Davenport. Virtual video editing in interactive multimedia applications.Communications of the ACM, 32(7):802–810, jul 1989. doi: 10.1145/65445.65447 2

work page doi:10.1145/65445.65447 1989
[17]

Matejka, T

J. Matejka, T. Grossman, and G. Fitzmaurice. Video lens: rapid playback and exploration of large video collections and associated metadata. In Proceedings of the 27th Annual ACM Symposium on User Interface Soft- ware and Technology, UIST ’14, pp. 541–550. Association for Computing Machinery, New York, NY , USA, 2014. doi: 10.1145/2642918.2647366 2

work page doi:10.1145/2642918.2647366 2014
[18]

Just-in-time transcoding, 2024

Mux. Just-in-time transcoding, 2024. Accessed: 2026-02-20. 8

2024
[19]

D. R. Olsen. Evaluating user interface systems research. InProceedings of the 20th Annual ACM Symposium on User Interface Software and Tech- nology, UIST ’07, pp. 251–258. Association for Computing Machinery, New York, NY , USA, 2007. doi: 10.1145/1294211.1294256 4

work page doi:10.1145/1294211.1294256 2007
[20]

Pantos and W

R. Pantos and W. May. HTTP Live Streaming. RFC 8216, aug 2017. doi: 10.17487/RFC8216 7

work page doi:10.17487/rfc8216 2017
[21]

Raasveldt and H

M. Raasveldt and H. Mühleisen. Duckdb: an embeddable analytical database. InProceedings of the 2019 International Conference on Manage- ment of Data, SIGMOD ’19, pp. 1981–1984. Association for Computing Machinery, New York, NY , USA, 2019. doi: 10.1145/3299869.3320212 2

work page doi:10.1145/3299869.3320212 2019
[22]

Supervision

Roboflow. Supervision. https://github.com/roboflow/ supervision. MIT License. 2, 3
[23]

Satyanarayan, D

A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer. Vega-lite: A grammar of interactive graphics.IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2017. doi: 10.1109/TVCG.2016.2599030 1, 2

work page doi:10.1109/tvcg.2016.2599030 2017
[24]

Matejka, T

A. Satyanarayan, K. Wongsuphasawat, and J. Heer. Declarative interaction design for data visualization. InProc. ACM User Interface Software & Technology (UIST), 2014. doi: 10.1145/2642918.2647360 1, 2

work page doi:10.1145/2642918.2647360 2014
[25]

Spiking neural network hypergraphs with spike frequency data,

B. Sekachev, N. Manovich, M. Zhiltsov, A. Zhavoronkov, D. Kalinin, B. Hoff et al. opencv/cvat: v1.1.0, Aug. 2020. doi: 10.5281/zenodo. 4009388 2

work page doi:10.5281/zenodo 2020
[26]

Shrestha, W

S. Shrestha, W. Sentosatio, H. Peng, C. Fermuller, and Y . Aloimonos. Feva: Fast event video annotation tool.arXiv preprint arXiv:2301.00482,

work page arXiv
[27]

Sousa, T

E. Sousa, T. Malheiro, E. Bicho, W. Erlhagen, J. Santos, and A. Pereira. Muvtime: A multivariate time series visualizer for behavioral science. In Proceedings of the 11th Joint Conference on Computer Vision, Imag- ing and Computer Graphics Theory and Applications (VISIGRAPP
[28]

- IVAPP, pp. 165–176. INSTICC, SciTePress, 2016. doi: 10.5220/ 0005725301650176 2

2016
[29]

Statistics and Computing

L. Wilkinson.The Grammar of Graphics. Statistics and Computing. Springer-Verlag, New York, 2005. doi: 10.1007/0-387-28695-0 1, 2

work page doi:10.1007/0-387-28695-0 2005
[30]

Winecki and A

D. Winecki and A. Nandi. V2V: Efficiently synthesizing video results for video queries. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pp. 5614–5621, 2024. doi: 10.1109/ICDE60146. 2024.00449 2

work page doi:10.1109/icde60146 2024
[31]

Winecki and A

D. Winecki and A. Nandi. Vidformer: Drop-in declarative optimization for rendering video-native query results, 2026. 8

2026
[32]

Y . Wu, R. Chang, J. M. Hellerstein, A. Satyanarayan, and E. Wu. Diel: Interactive visualization beyond the here and now.IEEE Transactions on Visualization and Computer Graphics, 28(1):737–746, 2022. doi: 10. 1109/TVCG.2021.3114796 2

work page arXiv 2022
[33]

J. Yang, H. Joo, S. Yerramreddy, D. Moritz, and L. Battle. Optimizing dataflow systems for scalable interactive visualization. InProc. ACM Man- agement of Data (SIGMOD), vol. 2. Association for Computing Machinery (ACM), 2024. doi: 10.1145/3639276 2

work page doi:10.1145/3639276 2024
[34]

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
[35]

J. Zong, J. Pollock, D. Wootton, and A. Satyanarayan. Animated vega- lite: Unifying animation with a grammar of interactive graphics.IEEE Transactions on Visualization and Computer Graphics, 29(1):149–159,
[36]

doi: 10.1109/TVCG.2022.3209369 2 10

work page doi:10.1109/tvcg.2022.3209369 2022

[1] [1]

Andersen, S

L. Andersen, S. Chang, and M. Felleisen. Super 8 languages for making movies (functional pearl).Proceedings of the ACM on Programming Languages, 1(ICFP):30:1–30:29, aug 2017. doi: 10.1145/3110274 2

work page doi:10.1145/3110274 2017

[2] [2]

R. A. Becker and W. S. Cleveland. Brushing scatterplots.Technometrics, 29(2):127–142, 1987. 3

1987

[3] [3]

Bostock, V

M. Bostock, V . Ogievetsky, and J. Heer. D3 data-driven documents.IEEE Transactions on Visualization and Computer Graphics, 17(12):2301–2309, Dec. 2011. doi: 10.1109/TVCG.2011.185 2

work page doi:10.1109/tvcg.2011.185 2011

[4] [4]

G. Bradski. The OpenCV Library.Dr. Dobb’s Journal of Software Tools,

[5] [5]

R. Chen, X. Shu, J. Chen, D. Weng, J. Tang, S. Fu et al. Nebula: A coordinating grammar of graphics.IEEE Transactions on Visualization and Computer Graphics, 28(12):4127–4140, 2022. doi: 10.1109/TVCG. 2021.3076222 2

work page doi:10.1109/tvcg 2022

[6] [6]

Dutta, A

A. Dutta, A. Gupta, and A. Zissermann. VGG image annotator (VIA). http://www.robots.ox.ac.uk/ vgg/software/via/, 2016. 2 2https://github.com/ixlab/vega-video 9

2016

[7] [7]

Dutta and A

A. Dutta and A. Zisserman. The VIA annotation software for images, audio and video. InProceedings of the 27th ACM International Conference on Multimedia, MM ’19, 4 pages. ACM, New York, NY , USA, 2019. doi: 10.1145/3343031.3350535 2

work page doi:10.1145/3343031.3350535 2019

[8] [8]

Fouse.Navigation of Time-Coded Data

A. Fouse.Navigation of Time-Coded Data. PhD thesis, University of California, San Diego, 2013. 2

2013

[9] [9]

Fouse, N

A. Fouse, N. Weibel, E. Hutchins, and J. D. Hollan. ChronoViz: a sys- tem for supporting navigation of time-coded data. InCHI ’11 Extended Abstracts on Human Factors in Computing Systems, CHI EA ’11, pp. 299–304. Association for Computing Machinery, New York, NY , USA, may 2011. doi: 10.1145/1979742.1979706 2

work page doi:10.1145/1979742.1979706 2011

[10] [10]

Heer and D

J. Heer and D. Moritz. Mosaic: An architecture for scalable & interop- erable data views.IEEE Transactions on Visualization and Computer Graphics, 30(1):436–446, 2024. doi: 10.1109/TVCG.2023.3327189 2

work page doi:10.1109/tvcg.2023.3327189 2024

[11] [11]

Higuchi, R

K. Higuchi, R. Yonetani, and Y . Sato. Egoscanning: Quickly scanning first-person videos with egocentric elastic timelines. InProceedings of the 2017 CHI Conference on Human Factors in Computing Systems, CHI ’17, pp. 6536–6546. Association for Computing Machinery, New York, NY , USA, 2017. doi: 10.1145/3025453.3025821 2

work page doi:10.1145/3025453.3025821 2017

[12] [12]

Information technology — Dynamic adaptive streaming over HTTP (DASH)

International Organization for Standardization. Information technology — Dynamic adaptive streaming over HTTP (DASH). Standard, International Organization for Standardization, aug 2022. 7

2022

[13] [13]

J. Kim, M. Snodgrass, M. Pietrowicz, K. Karahalios, and J. Halle. Beda: Visual analytics for behavioral and physiological data. InWorkshop on Visual Analytics in Healthcare. Washington DC, pp. 23–27, 2013. 2

2013

[14] [14]

Kruchten, J

N. Kruchten, J. Mease, and D. Moritz. Vegafusion: Automatic server-side scaling for interactive vega visualizations, 2022. 2

2022

[15] [15]

W. E. Mackay and M. Beaudouin-Lafon. DIV A: exploratory data analysis with multimedia streams. InProceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’98, pp. 416–423. ACM Press/Addison-Wesley Publishing Co., USA, jan 1998. doi: 10.1145/ 274644.274701 2

work page arXiv 1998

[16] [16]

W. E. Mackay and G. Davenport. Virtual video editing in interactive multimedia applications.Communications of the ACM, 32(7):802–810, jul 1989. doi: 10.1145/65445.65447 2

work page doi:10.1145/65445.65447 1989

[17] [17]

Matejka, T

J. Matejka, T. Grossman, and G. Fitzmaurice. Video lens: rapid playback and exploration of large video collections and associated metadata. In Proceedings of the 27th Annual ACM Symposium on User Interface Soft- ware and Technology, UIST ’14, pp. 541–550. Association for Computing Machinery, New York, NY , USA, 2014. doi: 10.1145/2642918.2647366 2

work page doi:10.1145/2642918.2647366 2014

[18] [18]

Just-in-time transcoding, 2024

Mux. Just-in-time transcoding, 2024. Accessed: 2026-02-20. 8

2024

[19] [19]

D. R. Olsen. Evaluating user interface systems research. InProceedings of the 20th Annual ACM Symposium on User Interface Software and Tech- nology, UIST ’07, pp. 251–258. Association for Computing Machinery, New York, NY , USA, 2007. doi: 10.1145/1294211.1294256 4

work page doi:10.1145/1294211.1294256 2007

[20] [20]

Pantos and W

R. Pantos and W. May. HTTP Live Streaming. RFC 8216, aug 2017. doi: 10.17487/RFC8216 7

work page doi:10.17487/rfc8216 2017

[21] [21]

Raasveldt and H

M. Raasveldt and H. Mühleisen. Duckdb: an embeddable analytical database. InProceedings of the 2019 International Conference on Manage- ment of Data, SIGMOD ’19, pp. 1981–1984. Association for Computing Machinery, New York, NY , USA, 2019. doi: 10.1145/3299869.3320212 2

work page doi:10.1145/3299869.3320212 2019

[22] [22]

Supervision

Roboflow. Supervision. https://github.com/roboflow/ supervision. MIT License. 2, 3

[23] [23]

Satyanarayan, D

A. Satyanarayan, D. Moritz, K. Wongsuphasawat, and J. Heer. Vega-lite: A grammar of interactive graphics.IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis), 2017. doi: 10.1109/TVCG.2016.2599030 1, 2

work page doi:10.1109/tvcg.2016.2599030 2017

[24] [24]

Matejka, T

A. Satyanarayan, K. Wongsuphasawat, and J. Heer. Declarative interaction design for data visualization. InProc. ACM User Interface Software & Technology (UIST), 2014. doi: 10.1145/2642918.2647360 1, 2

work page doi:10.1145/2642918.2647360 2014

[25] [25]

Spiking neural network hypergraphs with spike frequency data,

B. Sekachev, N. Manovich, M. Zhiltsov, A. Zhavoronkov, D. Kalinin, B. Hoff et al. opencv/cvat: v1.1.0, Aug. 2020. doi: 10.5281/zenodo. 4009388 2

work page doi:10.5281/zenodo 2020

[26] [26]

Shrestha, W

S. Shrestha, W. Sentosatio, H. Peng, C. Fermuller, and Y . Aloimonos. Feva: Fast event video annotation tool.arXiv preprint arXiv:2301.00482,

work page arXiv

[27] [27]

Sousa, T

E. Sousa, T. Malheiro, E. Bicho, W. Erlhagen, J. Santos, and A. Pereira. Muvtime: A multivariate time series visualizer for behavioral science. In Proceedings of the 11th Joint Conference on Computer Vision, Imag- ing and Computer Graphics Theory and Applications (VISIGRAPP

[28] [28]

- IVAPP, pp. 165–176. INSTICC, SciTePress, 2016. doi: 10.5220/ 0005725301650176 2

2016

[29] [29]

Statistics and Computing

L. Wilkinson.The Grammar of Graphics. Statistics and Computing. Springer-Verlag, New York, 2005. doi: 10.1007/0-387-28695-0 1, 2

work page doi:10.1007/0-387-28695-0 2005

[30] [30]

Winecki and A

D. Winecki and A. Nandi. V2V: Efficiently synthesizing video results for video queries. In2024 IEEE 40th International Conference on Data Engineering (ICDE), pp. 5614–5621, 2024. doi: 10.1109/ICDE60146. 2024.00449 2

work page doi:10.1109/icde60146 2024

[31] [31]

Winecki and A

D. Winecki and A. Nandi. Vidformer: Drop-in declarative optimization for rendering video-native query results, 2026. 8

2026

[32] [32]

Y . Wu, R. Chang, J. M. Hellerstein, A. Satyanarayan, and E. Wu. Diel: Interactive visualization beyond the here and now.IEEE Transactions on Visualization and Computer Graphics, 28(1):737–746, 2022. doi: 10. 1109/TVCG.2021.3114796 2

work page arXiv 2022

[33] [33]

J. Yang, H. Joo, S. Yerramreddy, D. Moritz, and L. Battle. Optimizing dataflow systems for scalable interactive visualization. InProc. ACM Man- agement of Data (SIGMOD), vol. 2. Association for Computing Machinery (ACM), 2024. doi: 10.1145/3639276 2

work page doi:10.1145/3639276 2024

[34] [34]

F. Yu, H. Chen, X. Wang, W. Xian, Y . Chen, F. Liu et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June

[35] [35]

J. Zong, J. Pollock, D. Wootton, and A. Satyanarayan. Animated vega- lite: Unifying animation with a grammar of interactive graphics.IEEE Transactions on Visualization and Computer Graphics, 29(1):149–159,

[36] [36]

doi: 10.1109/TVCG.2022.3209369 2 10

work page doi:10.1109/tvcg.2022.3209369 2022