Localizing Unseen Activities in Video via Image Query

Deng Cai; Jingkuan Song; Zhijie Lin; Zhou Zhao; Zhu Zhang

arxiv: 1906.12165 · v1 · pith:JTLF7JO7new · submitted 2019-06-28 · 💻 cs.CV · cs.IR

Localizing Unseen Activities in Video via Image Query

Zhu Zhang , Zhou Zhao , Zhijie Lin , Jingkuan Song , Deng Cai This is my paper

Pith reviewed 2026-05-25 13:41 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords image-based activity localizationunseen activitiesself-attentiontransformer encodervideo localizationActivityNetend-to-end retrieval

0 comments

The pith

A self-attention interaction localizer retrieves unseen activities in videos from image queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines a new task called Image-Based Activity Localization, where the goal is to find segments of unseen activities in untrimmed videos given only an image as the query. It highlights three challenges in this setting: filtering out irrelevant parts of the query image, managing localization when the query is not perfectly accurate, and pinpointing exact start and end times. The proposed solution uses a region self-attention method with relative position encoding to get detailed image features, a local transformer encoder to fuse and reason over image and video information in multiple steps, and an order-sensitive localizer to output the target segment directly. The authors also introduce the ActivityIBAL dataset based on ActivityNet and show through experiments that the method can handle activities not encountered during training.

Core claim

The self-attention interaction localizer, built from region self-attention with relative position encoding, a local transformer encoder for multi-step fusion, and an order-sensitive localizer, enables end-to-end retrieval of unseen activities in videos using image queries by solving the problems of inessential content, fuzzy localization, and boundary precision.

What carries the argument

The self-attention interaction localizer performs region-level attention on the image query with relative positions, then uses a local transformer to interact with video features, and finally localizes the segment in an order-sensitive manner.

If this is right

The model can localize activities outside the training vocabulary without additional labels.
Image queries allow specification of complex or novel actions that are hard to describe in words.
End-to-end training avoids separate proposal generation and classification steps.
The approach scales to new activities by changing only the query image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This setup might support interactive video search where users provide example images instead of text.
Similar interaction mechanisms could apply to localizing events in other sequential data like audio or time series.
Performance on the new dataset suggests potential for zero-shot transfer in video tasks beyond action localization.

Load-bearing premise

That the combination of region self-attention with relative position encoding and the local transformer encoder can remove semantically inessential image content, manage inaccurate queries, and produce precise segment boundaries for unseen activities.

What would settle it

A test case where an image query contains both the target activity and distracting elements, and the model fails to localize the correct segment in a video with multiple similar actions.

Figures

Figures reproduced from arXiv: 1906.12165 by Deng Cai, Jingkuan Song, Zhijie Lin, Zhou Zhao, Zhu Zhang.

**Figure 1.** Figure 1: Image-Based Activity Localization. real world. Thus, in practical applications, these approaches are hard to recognize those unseen activities, which do not appear in the training data. But if there is an image query as a reference, we can localize the unseen activities by matching visual contents with respect to objects, appearances, motions and interactions. As shown in [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 2.** Figure 2: The Framework of Self-Attention Interaction Localizer for Image-Based Activity Localization. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The variants of scaled dot-product attention. (a) The re [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: A typical example of image-based activity localization. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Action localization in untrimmed videos is an important topic in the field of video understanding. However, existing action localization methods are restricted to a pre-defined set of actions and cannot localize unseen activities. Thus, we consider a new task to localize unseen activities in videos via image queries, named Image-Based Activity Localization. This task faces three inherent challenges: (1) how to eliminate the influence of semantically inessential contents in image queries; (2) how to deal with the fuzzy localization of inaccurate image queries; (3) how to determine the precise boundaries of target segments. We then propose a novel self-attention interaction localizer to retrieve unseen activities in an end-to-end fashion. Specifically, we first devise a region self-attention method with relative position encoding to learn fine-grained image region representations. Then, we employ a local transformer encoder to build multi-step fusion and reasoning of image and video contents. We next adopt an order-sensitive localizer to directly retrieve the target segment. Furthermore, we construct a new dataset ActivityIBAL by reorganizing the ActivityNet dataset. The extensive experiments show the effectiveness of our method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New task for localizing unseen video activities from image queries, with a plausible attention-based model, but no results or ablations visible so effectiveness stays unverified.

read the letter

The main takeaway is that this paper defines Image-Based Activity Localization as a fresh task: finding activities in untrimmed videos when the query is an image rather than a text label or fixed class list. They reorganize ActivityNet into ActivityIBAL and sketch a model that uses region self-attention with relative position encoding, a local transformer encoder for fusion, and an order-sensitive localizer for boundaries. The three challenges they list line up directly with the three components they add, which keeps the proposal coherent on its own terms. That structure is the clearest contribution here. The dataset step is also practical for anyone who wants to test query-driven localization. The architecture choices look like reasonable extensions of existing attention ideas to this cross-modal setting. The soft spot is obvious and central: the abstract states that extensive experiments show effectiveness, yet supplies zero numbers, baselines, ablations, or error analysis. Without those, it is impossible to judge whether the components actually solve the stated challenges or whether the model beats simpler alternatives. The claim remains a design assertion rather than a demonstrated result. This work is aimed at researchers in action localization who want to move past closed action vocabularies toward retrieval-style queries. A reader interested in attention mechanisms for image-video alignment could pick up the component ideas. I would flag it for a reading group as a maybe, mainly to talk through the task definition. I would not cite it in the next year because there is no evidence yet to build on. It deserves peer review because the task is new, the challenges are stated plainly, and the proposal has internal logic, even if the experiments will need close checking once the full paper is available.

Referee Report

1 major / 0 minor

Summary. The paper introduces the task of Image-Based Activity Localization to localize unseen activities in untrimmed videos via image queries. It identifies three challenges (eliminating semantically inessential contents in queries, handling fuzzy localization from inaccurate queries, and determining precise boundaries) and proposes a self-attention interaction localizer with three components: region self-attention with relative position encoding for fine-grained image representations, a local transformer encoder for multi-step fusion and reasoning, and an order-sensitive localizer for direct segment retrieval. The authors also construct the ActivityIBAL dataset by reorganizing ActivityNet and state that extensive experiments demonstrate the method's effectiveness.

Significance. If the experimental results support the claims, the work would be significant as it defines a novel open-set task in video understanding that moves beyond closed vocabularies of actions. The explicit mapping of components to the three stated challenges is a coherent design choice, and the new dataset would provide a useful benchmark for future work on image-query localization.

major comments (1)

[Abstract] Abstract: the central claim that 'the extensive experiments show the effectiveness of our method' is load-bearing for the contribution, yet the provided manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons, making it impossible to verify whether the three components address the three challenges.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'the extensive experiments show the effectiveness of our method' is load-bearing for the contribution, yet the provided manuscript supplies no quantitative results, ablation studies, error analysis, or comparisons, making it impossible to verify whether the three components address the three challenges.

Authors: We acknowledge that the version of the manuscript under review did not include the full experimental section. The complete manuscript contains quantitative results on the ActivityIBAL dataset, ablation studies isolating each of the three proposed components, error analysis, and comparisons against baselines. These results are structured to demonstrate how region self-attention mitigates semantically inessential content (challenge 1), the local transformer handles fuzzy localization from inaccurate queries (challenge 2), and the order-sensitive localizer produces precise boundaries (challenge 3). In the revised manuscript we will make these linkages explicit and ensure all supporting evidence appears in the main body rather than being omitted. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes a new task (image-query localization of unseen activities) and an architecture with three components (region self-attention + relative position encoding, local transformer encoder, order-sensitive localizer) plus a reorganized ActivityNet-derived dataset. No equations, derivations, or first-principles results are presented that reduce to fitted parameters, self-definitions, or self-citation chains. The components are introduced as design choices to address explicitly stated challenges; performance is asserted only via “extensive experiments,” with no load-bearing uniqueness theorems or ansatzes imported from the authors’ prior work. The derivation chain is therefore self-contained and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no free parameters, axioms, or invented entities can be identified in detail.

pith-pipeline@v0.9.0 · 5731 in / 1001 out tokens · 35955 ms · 2026-05-25T13:41:00.418674+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

[1]

Bottom-up and top-down attention for im- age captioning and visual question answering

[Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for im- age captioning and visual question answering. In CVPR,

work page 2018
[2]

Layer Normalization

[Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge- offrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Sst: Single-stream temporal action proposals

[Buch et al., 2017] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In CVPR, pages 6373–6382. IEEE,

work page 2017
[4]

Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding

[Caba Heilbron et al., 2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding. In CVPR, pages 961–970,

work page 2015
[5]

Rethinking the faster r-cnn architecture for temporal action localization

[Chao et al., 2018] Yu-Wei Chao, Sudheendra Vijaya- narasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, pages 1130–1139,

work page 2018
[6]

Localizing natural language in videos

[Chen et al., 2019] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. Localizing natural language in videos. In AAAI,

work page 2019
[7]

Video re-localization

[Feng et al., 2018] Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. Video re-localization. In ECCV, pages 51–66,

work page 2018
[8]

TALL: temporal activity localization via language query

[Gao et al., 2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: temporal activity localization via language query. In ICCV, pages 5277–5285. IEEE,

work page 2017
[9]

Deep residual learning for image recog- nition

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In CVPR, pages 770–778,

work page 2016
[10]

Localizing moments in video with natural lan- guage

[Hendricks et al., 2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural lan- guage. In ICCV, pages 5803–5812,

work page 2017
[11]

Single shot temporal action detection

[Lin et al., 2017] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In MM, pages 988–

work page 2017
[12]

Natural Language Inference by Tree-Based Convolution and Heuristic Matching

[Mou et al., 2015] Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Natural language in- ference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422,

work page internal anchor Pith review Pith/arXiv arXiv 2015
[13]

Rectiﬁed linear units improve restricted boltzmann ma- chines

[Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann ma- chines. In ICML, pages 807–814,

work page 2010
[14]

Weakly supervised action lo- calization by sparse temporal pooling network

[Nguyen et al., 2018] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action lo- calization by sparse temporal pooling network. In CVPR, pages 6752–6761,

work page 2018
[15]

Faster r-cnn: Towards real-time object detection with region proposal networks

[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir- shick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99,

work page 2015
[16]

Temporal action localization in untrimmed videos via multi-stage cnns

[Shou et al., 2016] Zheng Shou, Dongang Wang, and Shih- Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058,

work page 2016
[17]

Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos

[Shou et al., 2017] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos. In CVPR, pages 1417–1426. IEEE,

work page 2017
[18]

Two-stream convolutional networks for action recognition in videos

[Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576,

work page 2014
[19]

A multi-stream bi- directional recurrent neural network for ﬁne-grained ac- tion detection

[Singh et al., 2016] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi- directional recurrent neural network for ﬁne-grained ac- tion detection. In CVPR, pages 1961–1970,

work page 2016
[20]

Unsupervised action discovery and localization in videos

[Soomro and Shah, 2017] Khurram Soomro and Mubarak Shah. Unsupervised action discovery and localization in videos. In CVPR, pages 696–705,

work page 2017
[21]

Learning spa- tiotemporal features with 3d convolutional networks

[Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497,

work page 2015
[22]

Attention is all you need

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008,

work page 2017
[23]

Untrimmednets for weakly su- pervised action recognition and detection

[Wang et al., 2017] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly su- pervised action recognition and detection. In CVPR,

work page 2017
[24]

Non-local neural networks

[Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803,

work page 2018
[25]

R- c3d: region convolutional 3d network for temporal activity detection

[Xu et al., 2017] Huijuan Xu, Abir Das, and Kate Saenko. R- c3d: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803,

work page 2017
[26]

End-to-end learning of action de- tection from frame glimpses in videos

[Yeung et al., 2016] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action de- tection from frame glimpses in videos. In CVPR, pages 2678–2687,

work page 2016
[27]

Cross-modal interaction networks for query-based moment retrieval in videos

[Zhang et al., 2019] Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR. ACM,

work page 2019
[28]

Temporal ac- tion detection with structured segment networks

[Zhao et al., 2017] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal ac- tion detection with structured segment networks. In ICCV, 2017

work page 2017

[1] [1]

Bottom-up and top-down attention for im- age captioning and visual question answering

[Anderson et al., 2018] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for im- age captioning and visual question answering. In CVPR,

work page 2018

[2] [2]

Layer Normalization

[Ba et al., 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge- offrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv 2016

[3] [3]

Sst: Single-stream temporal action proposals

[Buch et al., 2017] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In CVPR, pages 6373–6382. IEEE,

work page 2017

[4] [4]

Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding

[Caba Heilbron et al., 2015] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Ac- tivitynet: A large-scale video benchmark for human activ- ity understanding. In CVPR, pages 961–970,

work page 2015

[5] [5]

Rethinking the faster r-cnn architecture for temporal action localization

[Chao et al., 2018] Yu-Wei Chao, Sudheendra Vijaya- narasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In CVPR, pages 1130–1139,

work page 2018

[6] [6]

Localizing natural language in videos

[Chen et al., 2019] Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, and Jiebo Luo. Localizing natural language in videos. In AAAI,

work page 2019

[7] [7]

Video re-localization

[Feng et al., 2018] Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. Video re-localization. In ECCV, pages 51–66,

work page 2018

[8] [8]

TALL: temporal activity localization via language query

[Gao et al., 2017] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. TALL: temporal activity localization via language query. In ICCV, pages 5277–5285. IEEE,

work page 2017

[9] [9]

Deep residual learning for image recog- nition

[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In CVPR, pages 770–778,

work page 2016

[10] [10]

Localizing moments in video with natural lan- guage

[Hendricks et al., 2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural lan- guage. In ICCV, pages 5803–5812,

work page 2017

[11] [11]

Single shot temporal action detection

[Lin et al., 2017] Tianwei Lin, Xu Zhao, and Zheng Shou. Single shot temporal action detection. In MM, pages 988–

work page 2017

[12] [12]

Natural Language Inference by Tree-Based Convolution and Heuristic Matching

[Mou et al., 2015] Lili Mou, Rui Men, Ge Li, Yan Xu, Lu Zhang, Rui Yan, and Zhi Jin. Natural language in- ference by tree-based convolution and heuristic matching. arXiv preprint arXiv:1512.08422,

work page internal anchor Pith review Pith/arXiv arXiv 2015

[13] [13]

Rectiﬁed linear units improve restricted boltzmann ma- chines

[Nair and Hinton, 2010] Vinod Nair and Geoffrey E Hinton. Rectiﬁed linear units improve restricted boltzmann ma- chines. In ICML, pages 807–814,

work page 2010

[14] [14]

Weakly supervised action lo- calization by sparse temporal pooling network

[Nguyen et al., 2018] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action lo- calization by sparse temporal pooling network. In CVPR, pages 6752–6761,

work page 2018

[15] [15]

Faster r-cnn: Towards real-time object detection with region proposal networks

[Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Gir- shick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99,

work page 2015

[16] [16]

Temporal action localization in untrimmed videos via multi-stage cnns

[Shou et al., 2016] Zheng Shou, Dongang Wang, and Shih- Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In CVPR, pages 1049–1058,

work page 2016

[17] [17]

Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos

[Shou et al., 2017] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise tem- poral action localization in untrimmed videos. In CVPR, pages 1417–1426. IEEE,

work page 2017

[18] [18]

Two-stream convolutional networks for action recognition in videos

[Simonyan and Zisserman, 2014] Karen Simonyan and An- drew Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576,

work page 2014

[19] [19]

A multi-stream bi- directional recurrent neural network for ﬁne-grained ac- tion detection

[Singh et al., 2016] Bharat Singh, Tim K Marks, Michael Jones, Oncel Tuzel, and Ming Shao. A multi-stream bi- directional recurrent neural network for ﬁne-grained ac- tion detection. In CVPR, pages 1961–1970,

work page 2016

[20] [20]

Unsupervised action discovery and localization in videos

[Soomro and Shah, 2017] Khurram Soomro and Mubarak Shah. Unsupervised action discovery and localization in videos. In CVPR, pages 696–705,

work page 2017

[21] [21]

Learning spa- tiotemporal features with 3d convolutional networks

[Tran et al., 2015] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497,

work page 2015

[22] [22]

Attention is all you need

[Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008,

work page 2017

[23] [23]

Untrimmednets for weakly su- pervised action recognition and detection

[Wang et al., 2017] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly su- pervised action recognition and detection. In CVPR,

work page 2017

[24] [24]

Non-local neural networks

[Wang et al., 2018] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In CVPR, pages 7794–7803,

work page 2018

[25] [25]

R- c3d: region convolutional 3d network for temporal activity detection

[Xu et al., 2017] Huijuan Xu, Abir Das, and Kate Saenko. R- c3d: region convolutional 3d network for temporal activity detection. In ICCV, pages 5794–5803,

work page 2017

[26] [26]

End-to-end learning of action de- tection from frame glimpses in videos

[Yeung et al., 2016] Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action de- tection from frame glimpses in videos. In CVPR, pages 2678–2687,

work page 2016

[27] [27]

Cross-modal interaction networks for query-based moment retrieval in videos

[Zhang et al., 2019] Zhu Zhang, Zhijie Lin, Zhou Zhao, and Zhenxin Xiao. Cross-modal interaction networks for query-based moment retrieval in videos. In SIGIR. ACM,

work page 2019

[28] [28]

Temporal ac- tion detection with structured segment networks

[Zhao et al., 2017] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal ac- tion detection with structured segment networks. In ICCV, 2017

work page 2017