SiLVi: Simple Interface for Labeling Video Interactions

2); (2) Behavioral Ecology & Sociobiology Unit; 3); (3) Department of Sociobiology/Anthropology; Alexander S. Ecker (1) ((1) Institute of Computer Science; Campus Institute Data Science; Claudia Fichtel (2); Elif Karakoc (2); German Primate Center; Germany

arxiv: 2511.03819 · v2 · submitted 2025-11-05 · 💻 cs.CV · q-bio.QM

SiLVi: Simple Interface for Labeling Video Interactions

Ozan Kanbertay (1) , Richard Vogg (1 , 2) , Elif Karakoc (2) , Peter M. Kappeler (2 , 3) , Claudia Fichtel (2) , Alexander S. Ecker (1) ((1) Institute of Computer Science

show 8 more authors

Campus Institute Data Science University of G\"ottingen (2) Behavioral Ecology & Sociobiology Unit German Primate Center G\"ottingen Germany (3) Department of Sociobiology/Anthropology Germany)

This is my paper

Pith reviewed 2026-05-18 00:37 UTC · model grok-4.3

classification 💻 cs.CV q-bio.QM

keywords video annotationbehavior labelinginteraction detectioncomputer visionanimal behavioropen-source toolscene graphcamera trap analysis

0 comments

The pith

SiLVi lets researchers label both animal positions and their interactions in the same video interface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SiLVi, an open-source tool that combines tracking of individual animals in videos with labeling of their behaviors and social interactions. Existing tools handle either localization or behavioral annotation but not both together, forcing researchers to switch between separate programs. SiLVi produces structured outputs that can directly train computer vision models to detect fine-grained actions and relationships automatically. This integration aims to support larger-scale studies of animal social behavior from camera traps or field observations. The software is presented as a bridge between behavioral ecology and machine learning for video analysis.

Core claim

SiLVi is an open-source labeling software that integrates both localization of individuals and annotation of their interactions within video data, generating structured outputs suitable for training and validating computer vision models for automated fine-grained behavioral analyses.

What carries the argument

SiLVi, the single annotation interface that links spatial localization of animals to behavioral and interaction labels in video frames.

If this is right

Researchers can generate consistent training data for models that detect both positions and interactions without switching tools.
The structured outputs support validation of automated systems for analyzing social behavior in large video collections.
The approach extends beyond animals to labeling human interactions that require dynamic scene graphs.
Behavioral ecologists gain a direct way to produce data usable by computer vision pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Annotation time could decrease because users avoid exporting and re-importing data between separate programs.
Models might learn interaction patterns more reliably when location and label data come from the same consistent interface.
Wildlife monitoring projects could scale analysis by feeding SiLVi-labeled videos into existing detection frameworks.
A natural test would measure whether downstream models show higher precision on interaction classes when trained on integrated versus split-tool datasets.

Load-bearing premise

Integrating localization and interaction labeling into one tool will produce data that meaningfully improves training of computer vision models for behavioral analysis.

What would settle it

A side-by-side test showing that models trained on data from separate localization and labeling tools achieve equal or better accuracy and require less total annotation effort than models trained on SiLVi outputs.

Figures

Figures reproduced from arXiv: 2511.03819 by 2), (2) Behavioral Ecology & Sociobiology Unit, 3), (3) Department of Sociobiology/Anthropology, Alexander S. Ecker (1) ((1) Institute of Computer Science, Campus Institute Data Science, Claudia Fichtel (2), Elif Karakoc (2), German Primate Center, Germany, Germany), G\"ottingen, Ozan Kanbertay (1), Peter M. Kappeler (2, Richard Vogg (1, University of G\"ottingen.

**Figure 2.** Figure 2: Examples of different types of interaction. Gaze can be detected on single images, while the interactions with the feeding box often require temporal context. We tested the app with videos of redfronted lemurs (Eulemur rufifrons) in the wild. The setup of the experiments with eight cameras filming the lemurs during social learning experiments in Kirindy Forest, Madagascar, described in detail by Karakoc … view at source ↗

read the original abstract

Computer vision methods are increasingly used for the automated analysis of large volumes of video data collected through camera traps, drones, or direct observations of animals in the wild. While recent advances have focused primarily on detecting individual actions, much less work has addressed the detection and annotation of interactions -- a crucial aspect for understanding social and individualized animal behavior. Existing open-source annotation tools support either behavioral labeling without localization of individuals, or localization without the capacity to capture interactions. To bridge this gap, we present SiLVi, an open-source labeling software that integrates both functionalities. SiLVi enables researchers to annotate behaviors and interactions directly within video data, generating structured outputs suitable for training and validating computer vision models. By linking behavioral ecology with computer vision, SiLVi facilitates the development of automated approaches for fine-grained behavioral analyses. Although developed primarily in the context of animal behavior, SiLVi could be useful more broadly to annotate human interactions in other videos that require extracting dynamic scene graphs. The software, along with documentation and download instructions, is available at: https://silvi.eckerlab.org.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SiLVi combines localization and interaction labeling in one open-source interface for animal videos, but supplies no user tests or comparisons to show the integration actually helps.

read the letter

SiLVi is a new open-source annotation tool that lets users mark both where animals are in a video frame and what interactions they are having, all in the same interface. The work is new in that it packages localization and interaction labeling together for animal behavior videos, something the authors say prior tools did not do in one place. It generates outputs meant for training computer vision models on fine-grained behaviors and dynamic scene graphs. The fact that the code and docs are publicly available at the given link is helpful, and the tool seems aimed at making it easier to create datasets that link behavioral ecology with CV methods. What stands out is the practical focus. The abstract explains the gap clearly: most annotation software either tracks individuals without interaction details or labels behaviors without positions. SiLVi tries to handle both so researchers can get structured data for automated analysis. The soft spot is the missing evaluation. The paper describes the tool and its intended use but does not include any user testing, time comparisons against separate tools, or examples of how the annotations perform when used to train a model. Without that, it's difficult to tell if the single interface actually reduces effort or improves label quality as claimed. This is a common issue with tool description papers, but it means the benefit is still theoretical at this stage. Readers who work with camera trap or drone videos of animals and need to annotate interactions for machine learning would get the most out of this. It could also apply to human interaction videos if the interface is flexible enough. The paper is not pushing new algorithms or theory, so it is best for people looking for a ready annotation solution rather than a methods paper. I would recommend sending it to peer review. The integration addresses a documented workflow problem, and referees could help by suggesting additions like basic usability metrics or more comparison to existing software. Even with light evaluation, the release of functional open-source code makes it worth considering for publication in a venue that accepts tool papers.

Referee Report

0 major / 2 minor

Summary. The manuscript presents SiLVi, an open-source labeling tool that integrates localization of individuals within video frames and annotation of their behaviors and interactions. It produces structured outputs intended for training and validating computer vision models, with primary application to animal behavior studies and potential extension to human interaction videos requiring dynamic scene graphs. The software, documentation, and download instructions are provided at https://silvi.eckerlab.org.

Significance. If the described functionality is implemented as stated, the tool fills a practical gap between separate localization and behavioral-labeling tools, offering a unified interface that could streamline annotation workflows for researchers working on fine-grained video analysis. The open-source release with documentation is a clear strength. The stress-test concern (absence of user studies or timing comparisons) does not land as a load-bearing issue here, because the manuscript is a tool-description paper whose central contribution is the presentation of the integrated interface and its intended outputs rather than an empirical claim of measured superiority.

minor comments (2)

[Abstract] Abstract: the output format is described only at a high level as 'structured outputs suitable for training... models'; a brief concrete example of the exported annotation schema (e.g., JSON keys for bounding boxes, interaction labels, timestamps) in the main text would improve clarity for potential users.
[Introduction / Related Work] The manuscript would benefit from a short 'Related Work' subsection that explicitly names the 'existing open-source annotation tools' referenced in the abstract and states in one sentence how SiLVi differs from each.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. We appreciate the recognition that SiLVi addresses a practical gap by integrating localization with behavioral and interaction labeling in a single open-source interface, and that the work is appropriately positioned as a tool-description paper rather than an empirical comparison study.

Circularity Check

0 steps flagged

No circularity: tool-description paper with no derivations or fitted claims

full rationale

The manuscript is a software-tool description paper. It presents SiLVi as an integrated annotation interface and states its intended outputs and use cases. No equations, parameter fits, predictions, or first-principles derivations appear. Consequently none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, self-citation load-bearing, uniqueness imported from authors, ansatz smuggled via citation, or renaming) can be instantiated. The central claim that the single interface bridges a gap is an untested design assertion, but that is a question of empirical support, not circular reduction of the argument to its own inputs. The paper is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is a software interface rather than a theoretical result; no free parameters, axioms, or invented entities are introduced or required.

pith-pipeline@v0.9.0 · 5814 in / 1047 out tokens · 25689 ms · 2026-05-18T00:37:19.403339+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Chimpvlm: Ethogram-enhanced chimpanzee behaviour recognition

Brookes, Otto et al. (2024a). “Chimpvlm: Ethogram-enhanced chimpanzee behaviour recognition”. In:arXiv preprint arXiv:2404.08937. Brookes, Otto et al. (2024b). “PanAf20K: A large video dataset for wild ape detection and behaviour recognition”. In:International Journal of Computer Vision132.8, pp. 3086–3102. Chen, Zexin et al. (2023). “AlphaTracker: a mult...

work page doi:10.5281/zenodo.7863887 2023
[2]

BEHAVE - facilitating behaviour coding from videos with AI-detected animals

Elhorst, Reinoud, Martyna Syposz, and Katarzyna Wojczulanis-Jakubas (2025). “BEHAVE - facilitating behaviour coding from videos with AI-detected animals”. In:Ecological Informatics 87, p. 103106. Friard, Olivier and Marco Gamba (2016). “BORIS: a free, versatile open-source event-logging software for video/audio coding and live observations”. In:Methods in...

work page arXiv 2025

[1] [1]

Chimpvlm: Ethogram-enhanced chimpanzee behaviour recognition

Brookes, Otto et al. (2024a). “Chimpvlm: Ethogram-enhanced chimpanzee behaviour recognition”. In:arXiv preprint arXiv:2404.08937. Brookes, Otto et al. (2024b). “PanAf20K: A large video dataset for wild ape detection and behaviour recognition”. In:International Journal of Computer Vision132.8, pp. 3086–3102. Chen, Zexin et al. (2023). “AlphaTracker: a mult...

work page doi:10.5281/zenodo.7863887 2023

[2] [2]

BEHAVE - facilitating behaviour coding from videos with AI-detected animals

Elhorst, Reinoud, Martyna Syposz, and Katarzyna Wojczulanis-Jakubas (2025). “BEHAVE - facilitating behaviour coding from videos with AI-detected animals”. In:Ecological Informatics 87, p. 103106. Friard, Olivier and Marco Gamba (2016). “BORIS: a free, versatile open-source event-logging software for video/audio coding and live observations”. In:Methods in...

work page arXiv 2025