pith. sign in

arxiv: 2604.04467 · v1 · submitted 2026-04-06 · 💻 cs.CV

Group-DINOmics: Incorporating People Dynamics into DINO for Self-supervised Group Activity Feature Learning

Pith reviewed 2026-05-10 20:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learninggroup activity recognitionDINOpretext tasksperson flow estimationfeature learningcomputer vision
0
0 comments X p. Extension

The pith

Adapting DINO with person flow and group object location estimation enables self-supervised learning of group activity features that outperform prior static-feature approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to learn useful features for group activities in video without any group activity labels by extending the DINO self-supervised vision model. It introduces two pretext tasks: person flow estimation to encode how individuals move within the group, and group-relevant object location estimation to encode spatial scene context. These tasks are combined with DINO's local and global features so that the resulting group activity features become sensitive to both motion dynamics and collective relations. A sympathetic reader would care because group activity understanding has historically depended on costly manual annotations or limited static cues, restricting scale and real-world applicability.

Core claim

Group-DINOmics adapts DINO by training it with person flow estimation to represent local person motions and group-relevant object location estimation to capture global scene context, producing group-dynamics-aware features that achieve state-of-the-art performance in group activity retrieval and recognition on public datasets without using group activity annotations.

What carries the argument

The two pretext tasks—person flow estimation for local motion and group-relevant object location estimation for global context—applied to DINO's local and global features.

If this is right

  • The learned features reach state-of-the-art results for group activity retrieval and recognition across public datasets.
  • Ablation studies confirm that each added component, including the two pretext tasks, contributes to the final performance.
  • The method shows that dynamics-aware and group-aware signals can be injected into self-supervised learning without requiring activity annotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pretext-task strategy might transfer to other unlabeled multi-person video settings where motion and spatial relations matter.
  • Large-scale unlabeled group videos could now be used to train features that later support downstream tasks such as anomaly detection in crowds.
  • Combining these dynamics signals with additional self-supervised objectives could further close the gap to fully supervised group activity models.

Load-bearing premise

That person flow estimation and group-relevant object location estimation together supply enough dynamic and contextual information to learn effective group activity features from unlabeled video.

What would settle it

A controlled experiment on a group activity dataset where removing either pretext task produces no gain over a plain DINO baseline, or where the full method fails to exceed strong supervised baselines that use activity labels.

Figures

Figures reproduced from arXiv: 2604.04467 by Chihiro Nakatani, Norimichi Ukita, Ryuki Tezuka.

Figure 1
Figure 1. Figure 1: Our self-supervised GAF learning augmented by two [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our network. (a) Image feature extractor. Group-relevant objects are inpainted to enhance global feature learning [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Inpainting to enhance global feature embedding into a [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pretext tasks: overview of person flow estimation, (a) [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison of group activity retrieval on VBD. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparison of group activity retrieval on NBA [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Confusion matrices of GAR by nearest neighbor retrieval [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Visual comparison of group activity retrieval on VBD. [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison of group activity retrieval on NBA. [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Detail of our pretext tasks. Our pretext tasks comprise person-flow estimation and group-relevant object location estimation. [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: GAR accuracy curve by the KNN classification on [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Performance under different noise levels. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visualization of the learned GAF space on VBD. Unlike Fig. [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

This paper proposes Group Activity Feature (GAF) learning without group activity annotations. Unlike prior work, which uses low-level static local features to learn GAFs, we propose leveraging dynamics-aware and group-aware pretext tasks, along with local and global features provided by DINO, for group-dynamics-aware GAF learning. To adapt DINO and GAF learning to local dynamics and global group features, our pretext tasks use person flow estimation and group-relevant object location estimation, respectively. Person flow estimation is used to represent the local motion of each person, which is an important cue for understanding group activities. In contrast, group-relevant object location estimation encourages GAFs to learn scene context (e.g., spatial relations of people and objects) as global features. Comprehensive experiments on public datasets demonstrate the state-of-the-art performance of our method in group activity retrieval and recognition. Our ablation studies verify the effectiveness of each component in our method. Code: https://github.com/tezuka0001/Group-DINOmics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The paper proposes Group-DINOmics, a self-supervised method extending DINO with two pretext tasks—person flow estimation to capture local person dynamics and group-relevant object location estimation to capture global group context—for learning annotation-free group activity features. It claims state-of-the-art results on group activity retrieval and recognition using public datasets, supported by ablation studies verifying each component.

Significance. If the empirical claims hold, the work contributes to self-supervised video understanding by injecting dynamics and context into DINO features via targeted pretext tasks, offering a practical alternative to annotation-heavy approaches for multi-person activity analysis. The open code link supports reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the significance of our approach, and recommendation for minor revision. We appreciate the acknowledgment of the contributions of Group-DINOmics, the use of public datasets, ablation studies, and the open code for reproducibility.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper augments DINO with two independent pretext tasks (person flow estimation for local dynamics and group-relevant object location estimation for global context) to learn group activity features without annotations. These tasks are defined externally to the target GAFs and evaluated on public datasets with ablations. No self-definitional reduction, fitted-input-as-prediction, or load-bearing self-citation chain appears; the central claim rests on empirical adaptation of standard self-supervised components rather than re-deriving inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the new pretext tasks and the assumption that DINO can be adapted this way for group activities. No specific free parameters or invented entities are detailed in the abstract.

free parameters (1)
  • Loss balancing parameters for pretext tasks
    Standard in multi-task self-supervised learning but unspecified in abstract.
axioms (1)
  • domain assumption DINO provides useful local and global features suitable for adaptation to group dynamics
    The method explicitly builds on DINO as a base model.

pith-pipeline@v0.9.0 · 5488 in / 1276 out tokens · 60225 ms · 2026-05-10T20:28:03.986352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    GOAL: global-local object alignment learning

    Hyungyu Choi, Young Kyun Jang, and Chanho Eom. GOAL: global-local object alignment learning. In CVPR, 2025

  3. [3]

    Girshick

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In ICCV, 2017

  4. [4]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. In ICLR, 2022

  5. [5]

    Iterative Scale-Up ExpansionIoU and Deep Features Association for Multi-Object Tracking in Sports

    Hsiang-Wei Huang, Cheng-Yen Yang, Jiacheng Sun, Pyong Kun Kim, Kwang-Ju Kim, Kyoungoh Lee, Chung-I Huang, and Jenq-Neng Hwang. Iterative Scale-Up ExpansionIoU and Deep Features Association for Multi-Object Tracking in Sports. In WACVW, 2024

  6. [6]

    Vo, Patrick Labatut, and Piotr Bojanowski

    Cijo Jose, Th´eo Moutakanni, Dahyun Kang, Federico Baldassarre, Timoth´ee Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Micha¨el Ramamonjisoa, Maxime Oquab, Oriane Sim´eoni, Huy V. Vo, Patrick Labatut, and Piotr Bojanowski. Dinov2 meets text: A unified framework for image- and pixel-level vision-language alignment. In CVPR, 2025

  7. [7]

    Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, and James M. Rehg. Gaze-lle: Gaze target estimation via large-scale learned encoders. In CVPR, 2025

  8. [8]

    Region-based representations revisited

    Michal Shlapentokh-Rothman, Ansel Blume, Yao Xiao, Yuqun Wu, Sethuraman TV, Heyi Tao, Jae Yong Lee, Wilfredo Torres, Yu-Xiong Wang, and Derek Hoiem. Region-based representations revisited. In CVPR, 2024