pith. sign in

arxiv: 1906.12165 · v1 · pith:JTLF7JO7new · submitted 2019-06-28 · 💻 cs.CV · cs.IR

Localizing Unseen Activities in Video via Image Query

Pith reviewed 2026-05-25 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords image-based activity localizationunseen activitiesself-attentiontransformer encodervideo localizationActivityNetend-to-end retrieval
0
0 comments X

The pith

A self-attention interaction localizer retrieves unseen activities in videos from image queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called Image-Based Activity Localization, where the goal is to find segments of unseen activities in untrimmed videos given only an image as the query. It highlights three challenges in this setting: filtering out irrelevant parts of the query image, managing localization when the query is not perfectly accurate, and pinpointing exact start and end times. The proposed solution uses a region self-attention method with relative position encoding to get detailed image features, a local transformer encoder to fuse and reason over image and video information in multiple steps, and an order-sensitive localizer to output the target segment directly. The authors also introduce the ActivityIBAL dataset based on ActivityNet and show through experiments that the method can handle activities not encountered during training.

Core claim

The self-attention interaction localizer, built from region self-attention with relative position encoding, a local transformer encoder for multi-step fusion, and an order-sensitive localizer, enables end-to-end retrieval of unseen activities in videos using image queries by solving the problems of inessential content, fuzzy localization, and boundary precision.

What carries the argument

The self-attention interaction localizer performs region-level attention on the image query with relative positions, then uses a local transformer to interact with video features, and finally localizes the segment in an order-sensitive manner.

If this is right

  • The model can localize activities outside the training vocabulary without additional labels.
  • Image queries allow specification of complex or novel actions that are hard to describe in words.
  • End-to-end training avoids separate proposal generation and classification steps.
  • The approach scales to new activities by changing only the query image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This setup might support interactive video search where users provide example images instead of text.
  • Similar interaction mechanisms could apply to localizing events in other sequential data like audio or time series.
  • Performance on the new dataset suggests potential for zero-shot transfer in video tasks beyond action localization.

Load-bearing premise

That the combination of region self-attention with relative position encoding and the local transformer encoder can remove semantically inessential image content, manage inaccurate queries, and produce precise segment boundaries for unseen activities.

What would settle it

A test case where an image query contains both the target activity and distracting elements, and the model fails to localize the correct segment in a video with multiple similar actions.

Figures

Figures reproduced from arXiv: 1906.12165 by Deng Cai, Jingkuan Song, Zhijie Lin, Zhou Zhao, Zhu Zhang.

Figure 1
Figure 1. Figure 1: Image-Based Activity Localization. real world. Thus, in practical applications, these approaches are hard to recognize those unseen activities, which do not appear in the training data. But if there is an image query as a reference, we can localize the unseen activities by matching visual contents with respect to objects, appearances, motions and interactions. As shown in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 2
Figure 2. Figure 2: The Framework of Self-Attention Interaction Localizer for Image-Based Activity Localization. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The variants of scaled dot-product attention. (a) The re [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: A typical example of image-based activity localization. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

Action localization in untrimmed videos is an important topic in the field of video understanding. However, existing action localization methods are restricted to a pre-defined set of actions and cannot localize unseen activities. Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization. This task faces three inherent challenges: (1) how to eliminate the influence of semantically inessential contents in image queries; (2) how to deal with the fuzzy localization of inaccurate image queries; (3) how to determine the precise boundaries of target segments. We then propose a novel self-attention interaction localizer to retrieve unseen activities in an end-to-end fashion. Specifically, we first devise a region self-attention method with relative position encoding to learn fine-grained image region representations. Then, we employ a local transformer encoder to build multi-step fusion and reasoning of image and video contents. We next adopt an order-sensitive localizer to directly retrieve the target segment. Furthermore, we construct a new dataset ActivityIBAL by reorganizing the ActivityNet dataset. The extensive experiments show the effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces the task of Image-Based Activity Localization to localize unseen activities in untrimmed videos via image queries. It identifies three challenges (eliminating semantically inessential contents in queries, handling fuzzy localization from inaccurate queries, and determining precise boundaries) and proposes a self-attention interaction localizer with three components: region self-attention with relative position encoding for fine-grained image representations, a local transformer encoder for multi-step fusion and reasoning, and an order-sensitive localizer for direct segment retrieval. The authors also construct the ActivityIBAL dataset by reorganizing ActivityNet and state that extensive experiments demonstrate the method's effectiveness.

Significance. If the experimental results support the claims, the work would be significant as it defines a novel open-set task in video understanding that moves beyond closed vocabularies of actions. The explicit mapping of components to the three stated challenges is a coherent design choice, and the new dataset would provide a useful benchmark for future work on image-query localization.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'the extensive experiments show the effectiveness of our method' is load-bearing for the contribution, yet the provided manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons, making it impossible to verify whether the three components address the three challenges.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'the extensive experiments show the effectiveness of our method' is load-bearing for the contribution, yet the provided manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons, making it impossible to verify whether the three components address the three challenges.

    Authors: We acknowledge that the version of the manuscript under review did not include the full experimental section. The complete manuscript contains quantitative results on the ActivityIBAL dataset, ablation studies isolating each of the three proposed components, error analysis, and comparisons against baselines. These results are structured to demonstrate how region self-attention mitigates semantically inessential content (challenge 1), the local transformer handles fuzzy localization from inaccurate queries (challenge 2), and the order-sensitive localizer produces precise boundaries (challenge 3). In the revised manuscript we will make these linkages explicit and ensure all supporting evidence appears in the main body rather than being omitted. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a new task (image-query localization of unseen activities) and an architecture with three components (region self-attention + relative position encoding, local transformer encoder, order-sensitive localizer) plus a reorganized ActivityNet-derived dataset. No equations, derivations, or first-principles results are presented that reduce to fitted parameters, self-definitions, or self-citation chains. The components are introduced as design choices to address explicitly stated challenges; performance is asserted only via “extensive experiments,” with no load-bearing uniqueness theorems or ansatzes imported from the authors’ prior work. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities can be identified in detail.

pith-pipeline@v0.9.0 · 5731 in / 1001 out tokens · 35955 ms · 2026-05-25T13:41:00.418674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    Bottom-up and top-down attention for im- age captioning and visual question answering

    [Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for im- age captioning and visual question answering. In CVPR,

  2. [2]

    Layer Normalization

    [Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge- offrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

  3. [3]

    Sst: Single-stream temporal action proposals

    [Buch et al., 2017] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In CVPR, pages 6373–6382. IEEE,

  4. [4]

    Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding

    [Caba Heilbron et al., 2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding. In CVPR, pages 961–970,

  5. [5]

    Rethinking the faster r-cnn architecture for temporal action localization

    [Chao et al., 2018] Yu-Wei Chao, Sudheendra Vijaya- narasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, pages 1130–1139,

  6. [6]

    Localizing natural language in videos

    [Chen et al., 2019] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. Localizing natural language in videos. In AAAI,

  7. [7]

    Video re-localization

    [Feng et al., 2018] Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. Video re-localization. In ECCV, pages 51–66,

  8. [8]

    TALL: temporal activity localization via language query

    [Gao et al., 2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: temporal activity localization via language query. In ICCV, pages 5277–5285. IEEE,

  9. [9]

    Deep residual learning for image recog- nition

    [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In CVPR, pages 770–778,

  10. [10]

    Localizing moments in video with natural lan- guage

    [Hendricks et al., 2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural lan- guage. In ICCV, pages 5803–5812,

  11. [11]

    Single shot temporal action detection

    [Lin et al., 2017] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In MM, pages 988–

  12. [12]

    Natural Language Inference by Tree-Based Convolution and Heuristic Matching

    [Mou et al., 2015] Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Natural language in- ference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422,

  13. [13]

    Rectified linear units improve restricted boltzmann ma- chines

    [Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann ma- chines. In ICML, pages 807–814,

  14. [14]

    Weakly supervised action lo- calization by sparse temporal pooling network

    [Nguyen et al., 2018] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action lo- calization by sparse temporal pooling network. In CVPR, pages 6752–6761,

  15. [15]

    Faster r-cnn: Towards real-time object detection with region proposal networks

    [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir- shick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99,

  16. [16]

    Temporal action localization in untrimmed videos via multi-stage cnns

    [Shou et al., 2016] Zheng Shou, Dongang Wang, and Shih- Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058,

  17. [17]

    Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos

    [Shou et al., 2017] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos. In CVPR, pages 1417–1426. IEEE,

  18. [18]

    Two-stream convolutional networks for action recognition in videos

    [Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576,

  19. [19]

    A multi-stream bi- directional recurrent neural network for fine-grained ac- tion detection

    [Singh et al., 2016] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi- directional recurrent neural network for fine-grained ac- tion detection. In CVPR, pages 1961–1970,

  20. [20]

    Unsupervised action discovery and localization in videos

    [Soomro and Shah, 2017] Khurram Soomro and Mubarak Shah. Unsupervised action discovery and localization in videos. In CVPR, pages 696–705,

  21. [21]

    Learning spa- tiotemporal features with 3d convolutional networks

    [Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497,

  22. [22]

    Attention is all you need

    [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008,

  23. [23]

    Untrimmednets for weakly su- pervised action recognition and detection

    [Wang et al., 2017] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly su- pervised action recognition and detection. In CVPR,

  24. [24]

    Non-local neural networks

    [Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803,

  25. [25]

    R- c3d: region convolutional 3d network for temporal activity detection

    [Xu et al., 2017] Huijuan Xu, Abir Das, and Kate Saenko. R- c3d: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803,

  26. [26]

    End-to-end learning of action de- tection from frame glimpses in videos

    [Yeung et al., 2016] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action de- tection from frame glimpses in videos. In CVPR, pages 2678–2687,

  27. [27]

    Cross-modal interaction networks for query-based moment retrieval in videos

    [Zhang et al., 2019] Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR. ACM,

  28. [28]

    Temporal ac- tion detection with structured segment networks

    [Zhao et al., 2017] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal ac- tion detection with structured segment networks. In ICCV, 2017