Emergence of Exploratory Look-Around Behaviors through Active Observation Completion

Dinesh Jayaraman; Kristen Grauman; Santhosh K. Ramakrishnan

arxiv: 1906.11407 · v1 · pith:GUSJ2ND7new · submitted 2019-06-27 · 💻 cs.CV · cs.RO

Emergence of Exploratory Look-Around Behaviors through Active Observation Completion

Santhosh K. Ramakrishnan , Dinesh Jayaraman , Kristen Grauman This is my paper

Pith reviewed 2026-05-25 15:17 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords active perceptionreinforcement learningobservation completionlook-around behavioruncertainty reductionsidekick policy learningvisual explorationgeneralization

0 comments

The pith

Training an agent to complete partial observations by reducing uncertainty produces policies that generalize to useful look-around behaviors in other active perception tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how an agent can learn to acquire informative visual observations on its own. It trains the agent with reinforcement learning, rewarding it for choosing glimpses that reduce uncertainty about the unseen parts of a scene before inferring the full environment. A sidekick policy learning method handles sparse rewards by using extra information available only during training. If this works, exploratory look-around behavior emerges from one self-supervised objective and transfers to multiple perception tasks without needing new rewards or retraining for each.

Core claim

The paper claims that the proposed reinforcement learning methods, which train agents to complete partial observations via uncertainty reduction and use sidekick policy learning, learn observation policies that not only succeed at the completion task but also generalize to exhibit useful look-around behavior for a range of active perception tasks.

What carries the argument

The central mechanism is a reinforcement learning policy trained to select short sequences of glimpses that minimize uncertainty when inferring the full environment, combined with sidekick policy learning that exploits greater observability at training time.

If this is right

The policies succeed at the trained task of inferring full scenes from partial glimpses.
The same policies exhibit useful look-around behavior on other active perception tasks without retraining.
Exploratory behavior arises without designing separate rewards for each new task.
Sidekick policy learning mitigates sparse rewards during the initial training phase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Uncertainty reduction may act as a broadly useful objective for bootstrapping exploration in visual agents.
The training setup could apply in simulation environments where full scene access is available only while learning the policy.
Similar single-objective training might transfer to related problems like active object search or mapping.

Load-bearing premise

That training solely to reduce uncertainty in observation completion will produce exploratory policies that transfer to other active perception tasks without task-specific rewards or fine-tuning.

What would settle it

A direct test where the learned policies show no performance gain over random glimpse selection or non-exploratory baselines on a held-out active perception task such as object classification from limited views.

Figures

Figures reproduced from arXiv: 1906.11407 by Dinesh Jayaraman, Kristen Grauman, Santhosh K. Ramakrishnan.

**Figure 2.** Figure 2: Approach overview: The agent (actor) encodes individual views from the environment [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Scene and object completion accuracy under different agent behaviors. Top plots [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Episodes of active observation completion for SUN360 (left) and ModelNet (right). [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Three examples of reconstructions after T = 6 glimpses (in order to generate more complete images). The first column shows the ground-truth viewgrids (equirectangular projections for SUN), the second column shows the corresponding GAN-refined reconstructions of lookaround and rnd-actions agents, and the third column shows handpicked unseen views (marked on the ground-truth) and the corresponding angles. P… view at source ↗

**Figure 6.** Figure 6: The ground truth 360 panorama or viewgrid, agent glimpse inputs, and final GAN [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Architecture of our active observation completion system. While the input-output [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

read the original abstract

Standard computer vision systems assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself. We address the problem of learning to look around: how can an agent learn to acquire informative visual observations? We propose a reinforcement learning solution, where the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment. Specifically, the agent is trained to select a short sequence of glimpses after which it must infer the appearance of its full environment. To address the challenge of sparse rewards, we further introduce sidekick policy learning, which exploits the asymmetry in observability between training and test time. The proposed methods learn observation policies that not only perform the completion task for which they are trained, but also generalize to exhibit useful "look-around" behavior for a range of active perception tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The sidekick policy learning trick is the actual contribution worth noting, but the zero-shot transfer claim for look-around behavior needs stronger evidence than the abstract provides.

read the letter

The main thing to know about this paper is that it trains reinforcement learning agents to select short sequences of visual glimpses by rewarding them for being able to complete the full scene observation afterward, and it introduces sidekick policy learning to deal with the fact that rewards are sparse by letting a helper policy use full information during training. What is actually new is this sidekick mechanism that exploits the training versus test time difference in what the agent can see. The paper does well in identifying the challenge of learning exploratory behavior without task-specific supervision and in proposing a reward based on uncertainty reduction through completion. The soft spots are around the generalization claim. The abstract asserts that the learned policies transfer to a range of active perception tasks, but this depends on the exploratory behavior being task-agnostic. If the policy ends up exploiting patterns specific to the completion network, like particular ways of reducing pixel uncertainty, then transfer to unrelated tasks could be limited. The stress-test note is on point here; the experiments would need to demonstrate zero-shot usefulness on things like object search or 3D reconstruction without fine-tuning. Since the initial review was abstract only, the full paper's results section is key to checking if ablations support the broad claim. The formulation looks like a standard RL setup with an added training trick, and there is no sign of circular reasoning. The citation pattern covers relevant prior work in active vision. This paper is aimed at researchers in active vision, embodied AI, and RL for robotics. Someone building perception systems for unstructured environments could get ideas from the approach. I would send it to peer review because the core idea is solid and the problem matters, even if the transfer results might require more work to convince.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a reinforcement learning framework in which an agent learns to select short sequences of visual glimpses to reduce uncertainty about unobserved portions of its environment, thereby completing partial observations. To address sparse rewards, the authors introduce sidekick policy learning that exploits an asymmetry in observability between training and test time. The central claim is that policies trained solely on this observation-completion objective generalize zero-shot to exhibit useful exploratory look-around behavior across a range of unrelated active-perception tasks without task-specific rewards or fine-tuning.

Significance. If the zero-shot generalization results are robustly demonstrated, the work would be significant for active vision: it offers a self-supervised route to task-agnostic exploration policies, reducing reliance on hand-crafted rewards for each downstream perception problem. The sidekick technique is a practical contribution to sparse-reward RL in partially observable visual settings.

major comments (2)

[§4] §4 (Experiments): the generalization claim requires explicit zero-shot transfer results on tasks that are demonstrably unrelated to observation completion (e.g., object detection or navigation); without quantitative metrics, baselines, and ablations showing that performance does not collapse when the reconstruction head is removed, the claim that the behavior is task-agnostic remains unverified.
[§3.2] §3.2 (Sidekick policy learning): the formulation must clarify whether the sidekick policy is trained with access to ground-truth full observations only during training or whether any auxiliary loss inadvertently leaks test-time information; if the latter, the learned glimpse-selection policy may overfit to completion-specific uncertainty patterns rather than producing general exploration.

minor comments (2)

[§3] Notation for the uncertainty reward and the glimpse-selection action space should be defined once in §3 and used consistently; currently the abstract and method use slightly different phrasing for the same quantities.
[Figures] Figure captions should state whether error bars represent standard deviation across seeds or across environments, and whether the reported numbers are from the same policy checkpoint used for all downstream tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our contributions. We address each major point below and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: [§4] §4 (Experiments): the generalization claim requires explicit zero-shot transfer results on tasks that are demonstrably unrelated to observation completion (e.g., object detection or navigation); without quantitative metrics, baselines, and ablations showing that performance does not collapse when the reconstruction head is removed, the claim that the behavior is task-agnostic remains unverified.

Authors: The active-perception tasks evaluated in the original manuscript (active object recognition and similar look-around problems) are unrelated to the observation-completion training objective, as they involve different reward structures and goals at test time. Nevertheless, to directly address the concern, the revised manuscript will include additional zero-shot transfer experiments on navigation and object detection, along with the requested quantitative metrics, baselines, and an ablation removing the reconstruction head. This strengthens the evidence without altering the core claims. revision: partial
Referee: [§3.2] §3.2 (Sidekick policy learning): the formulation must clarify whether the sidekick policy is trained with access to ground-truth full observations only during training or whether any auxiliary loss inadvertently leaks test-time information; if the latter, the learned glimpse-selection policy may overfit to completion-specific uncertainty patterns rather than producing general exploration.

Authors: The sidekick policy receives ground-truth full observations exclusively during training to generate dense rewards; at test time the policy has no access to full observations or any auxiliary signals derived from them. No auxiliary loss uses or leaks test-time information. We will revise §3.2 to state this asymmetry explicitly and emphasize that the resulting policy is not specialized to completion-specific uncertainty. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained RL formulation

full rationale

The paper defines an RL objective that rewards uncertainty reduction on an observation-completion task and augments it with sidekick policy learning that exploits an explicit training/test observability asymmetry. No load-bearing step equates a claimed prediction or generalization result to its own fitted inputs or to a self-citation chain; the generalization behavior is presented as an empirical outcome of the learned policy rather than a mathematical identity or renamed input. The derivation therefore remains independent of the target transfer results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard RL assumptions plus the novel sidekick technique; no explicit free parameters or invented physical entities are named in the abstract, but the method implicitly depends on typical RL training choices and the domain assumption of glimpse-based partial observability.

free parameters (1)

RL training hyperparameters
Standard learning rates, discount factors, and reward scaling are required for any RL implementation but are not specified in the abstract.

axioms (1)

domain assumption Partial visual observations can be obtained via discrete glimpses and uncertainty can be quantified for reward computation
Invoked in the description of the reward signal and the completion task.

invented entities (1)

sidekick policy learning no independent evidence
purpose: Exploit asymmetry in observability between training and test time to address sparse rewards
New auxiliary training procedure introduced to make the main RL objective tractable

pith-pipeline@v0.9.0 · 5679 in / 1367 out tokens · 31200 ms · 2026-05-25T15:17:26.783097+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment... select a short sequence of glimpses after which it must infer the appearance of its full environment
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

sidekick policy learning, which exploits the asymmetry in observability between training and test time

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 6 internal anchors

[1]

remember

The transfer performance of our policies is better than that of rnd-actions on all tasks. This shows that intelligent sequential camera control has scope for improving these perception tasks’ efﬁciency. Overall, our look-around policy transfers well across tasks, competing with or even outperforming the supervised task-speciﬁc policies. Furthermore, our l...

work page
[2]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015

work page 2015
[3]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014

work page 2014
[4]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[5]

Development of three-dimensional object completion in infancy

Kasey C Soska and Scott P Johnson. Development of three-dimensional object completion in infancy. In Child development, 2008

work page 2008
[6]

Systems in development: motor skill acquisition facilitates three-dimensional object completion

Kasey C Soska, Karen E Adolph, and Scott P Johnson. Systems in development: motor skill acquisition facilitates three-dimensional object completion. In Developmental psychology, 2010

work page 2010
[7]

Perception of partly occluded objects in infancy

Philip J Kellman and Elizabeth S Spelke. Perception of partly occluded objects in infancy. In Cognitive psychology, 1983

work page 1983
[8]

Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search

Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. In Psychological review, 2006. 34

work page 2006
[9]

Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion

Dinesh Jayaraman and Kristen Grauman. Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion. In European Conference on Computer Vision, 2016

work page 2016
[10]

Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning

Mohsen Malmir, Karan Sikka, Deborah Forster, Javier R Movellan, and Garison Cottrell. Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning. In British Machine Vision Conference, 2015

work page 2015
[11]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Computer Vision and Pattern Recognition, IEEE Conference on, 2015

work page 2015
[12]

A dataset for developing and benchmarking active vision

Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Ko ˇseck´a, and Alexander C Berg. A dataset for developing and benchmarking active vision. In Robotics and Automation, IEEE International Conference on, 2017

work page 2017
[13]

End-to-end learning of action detection from frame glimpses in videos

Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Computer Vision and Pattern Recognition, IEEE Conference on, 2016

work page 2016
[14]

Mathe, A

S. Mathe, A. Pirinen, and C. Sminchisescu. Reinforcement learning for visual object de- tection. In Computer Vision and Pattern Recognition, IEEE Conference on , 2016

work page 2016
[15]

Karayev, T

S. Karayev, T. Baumgartner, M. Fritz, and T. Darrell. Timely object recognition. In Ad- vances in Neural Information Processing Systems , 2012

work page 2012
[16]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In International Conference on Machine Learning, 2017. 35

work page 2017
[17]

Learning exploration policies for naviga- tion

Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for naviga- tion. In International Conference on Learning Representations , 2019

work page 2019
[18]

Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges

Benjamin Hepp, Debadeepta Dey, Sudipta N. Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges. Learn-to-score: Efﬁcient 3d scene exploration by predicting view utility. In The European Conference on Computer Vision, September 2018

work page 2018
[19]

Im2pano3d: Extrapolating 360 structure and semantics beyond the ﬁeld of view

Shuran Song, Andy Zeng, Angel X Chang, Manolis Savva, Silvio Savarese, and Thomas Funkhouser. Im2pano3d: Extrapolating 360 structure and semantics beyond the ﬁeld of view. In Computer Vision and Pattern Recognition, IEEE Conference on , pages 3847– 3856, 2018

work page 2018
[20]

Deep view morph- ing

Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. Deep view morph- ing. In Computer Vision and Pattern Recognition, IEEE Conference on , volume 2, 2017

work page 2017
[21]

Deep convo- lutional inverse graphics network

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo- lutional inverse graphics network. In Advances in neural information processing systems , pages 2539–2547, 2015

work page 2015
[22]

Shapecodes: Self-supervised fea- ture learning by lifting views to viewgrids

Dinesh Jayaraman, Ruohan Gao, and Kristen Grauman. Shapecodes: Self-supervised fea- ture learning by lifting views to viewgrids. European Conference on Computer Vision , 2018

work page 2018
[23]

Neural scene representation and rendering

SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018

work page 2018
[24]

Learning to look around: Intelligently exploring unseen environments for unknown tasks

Dinesh Jayaraman and Kristen Grauman. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In Computer Vision and Pattern Recognition, IEEE Conference on, 2018. 36

work page 2018
[25]

Ramakrishnan and Kristen Grauman

Santhosh K. Ramakrishnan and Kristen Grauman. Sidekick Policy Learning for Active Visual Exploration. In European Conference on Computer Vision, 2018

work page 2018
[26]

Pairwise decomposition of im- age sequences for active multi-view recognition

Edward Johns, Stefan Leutenegger, and Andrew J Davison. Pairwise decomposition of im- age sequences for active multi-view recognition. In Computer Vision and Pattern Recogni- tion, IEEE Conference on, 2016

work page 2016
[27]

Visual Semantic Planning using Deep Successor Representa- tions

Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Visual Semantic Planning using Deep Successor Representa- tions. In Computer Vision, IEEE International Conference on , 2017

work page 2017
[28]

Unifying Map and Landmark Based Representations for Visual Navigation

Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Target-driven visual navigation in indoor scenes using deep reinforcement learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation, IEEE International Conference on , 2017

work page 2017
[30]

Jayaraman and K

D. Jayaraman and K. Grauman. End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018
[31]

Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning

Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning. In Advances in Neural Information Processing Systems , 2014

work page 2014
[32]

Learning with intelligent teacher

Vladimir Vapnik and Rauf Izmailov. Learning with intelligent teacher. In Symposium on Conformal and Probabilistic Prediction with Applications , 2016. 37

work page 2016
[33]

Recognizing scene viewpoint using panoramic place representation

Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. In Computer Vision and Pattern Recogni- tion, IEEE Conference on, 2012

work page 2012
[34]

Graph-based visual saliency

Jonathan Harel, Christof Koch, and Pietro Perona. Graph-based visual saliency. In Ad- vances in Neural Information Processing Systems , 2006

work page 2006
[35]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition, IEEE Conference on, pages 5967–5976. IEEE, 2017

work page 2017
[36]

3d- r2n2: A uniﬁed approach for single and multi-view 3d object reconstruction

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d- r2n2: A uniﬁed approach for single and multi-view 3d object reconstruction. In Proceed- ings of the European Conference on Computer Vision (ECCV) , 2016

work page 2016
[37]

Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Computer Vision and Pattern Recognition, IEEE Conference on, July 2017

work page 2017
[38]

Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images

Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on Robot Learning, 2017

work page 2017
[40]

Asymmetric actor critic for image-based robot learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning. Robotics: Science and Systems , 2018. 38

work page 2018
[41]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Ba- tra. Embodied Question Answering. In Computer Vision and Pattern Recognition, IEEE Conference on, 2018

work page 2018
[42]

Building Generalizable Agents with a Realistic and Rich 3D Environment

Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments. InComputer Vision and Pattern Recognition, IEEE Conference on, 2018

work page 2018
[44]

Semi-parametric topological memory for navigation

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. International Conference on Learning Representations , 2018

work page 2018
[45]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[46]

Learning real-world robot policies by dreaming

AJ Piergiovanni, Alan Wu, and Michael S Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813, 2018

work page arXiv 2018
[47]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997
[48]

Learning stochastic feedforward networks

Radford M Neal. Learning stochastic feedforward networks. Department of Computer Science, University of Toronto, 64(9), 1990

work page 1990
[49]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction

work page
[50]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016. 39

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

A machine learning approach to visual perception of forest trails for mobile robots

Alessandro Giusti, J ´erˆome Guzzi, Dan C Cires ¸an, Fang-Lin He, Juan P Rodr´ıguez, Flavio Fontana, Matthias Faessler, Christian Forster, J ¨urgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 2016

work page 2016
[52]

One-shot imitation learning

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems , 2017

work page 2017
[53]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014
[54]

Spherenet: Learn- ing spherical representations for detection and classiﬁcation in omnidirectional images

Coors, Benjamin and Paul Condurache, Alexandru and Geiger, Andreas. Spherenet: Learn- ing spherical representations for detection and classiﬁcation in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV) , 2018

work page 2018
[55]

environment

For simplicity of presentation, we represent an “environment” as X where the agent ex- plores a novel scene, looking outward in new viewing directions. However, experiments will also use X as an object where the agent moves around an object, looking inward at it from new viewing angles. Figure 1 illustrates the two scenarios

work page
[56]

The angles were selected to break symmetry and reduce redundancy of views

work page
[57]

For the sake of brevity, we report the best performances among the two sidekick variants we proposed in (24)

work page
[58]

grid-of-grids

We reﬁne the decoded viewgrids (for both our method and the baseline) with a pix2pix (34)- style conditional Generative Adversarial Network (GAN), detailed in the Supplementary Materials. 40 Acknowledgements: We thank Yu-Chuan Su, Kimberly Hsiao, Bo Xiong and Philipp Kr¨ahenb¨uhl for helpful discussions. Funding: The University of Texas at AUstin is suppo...

work page 2019

[1] [1]

remember

The transfer performance of our policies is better than that of rnd-actions on all tasks. This shows that intelligent sequential camera control has scope for improving these perception tasks’ efﬁciency. Overall, our look-around policy transfers well across tasks, competing with or even outperforming the supervised task-speciﬁc policies. Furthermore, our l...

work page

[2] [2]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015

work page 2015

[3] [3]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014

work page 2014

[4] [4]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[5] [5]

Development of three-dimensional object completion in infancy

Kasey C Soska and Scott P Johnson. Development of three-dimensional object completion in infancy. In Child development, 2008

work page 2008

[6] [6]

Systems in development: motor skill acquisition facilitates three-dimensional object completion

Kasey C Soska, Karen E Adolph, and Scott P Johnson. Systems in development: motor skill acquisition facilitates three-dimensional object completion. In Developmental psychology, 2010

work page 2010

[7] [7]

Perception of partly occluded objects in infancy

Philip J Kellman and Elizabeth S Spelke. Perception of partly occluded objects in infancy. In Cognitive psychology, 1983

work page 1983

[8] [8]

Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search

Antonio Torralba, Aude Oliva, Monica S Castelhano, and John M Henderson. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. In Psychological review, 2006. 34

work page 2006

[9] [9]

Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion

Dinesh Jayaraman and Kristen Grauman. Look-ahead before you leap: end-to-end active recognition by forecasting the effect of motion. In European Conference on Computer Vision, 2016

work page 2016

[10] [10]

Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning

Mohsen Malmir, Karan Sikka, Deborah Forster, Javier R Movellan, and Garison Cottrell. Deep q-learning for active recognition of germs: Baseline performance on a standardized dataset for active learning. In British Machine Vision Conference, 2015

work page 2015

[11] [11]

3d shapenets: A deep representation for volumetric shapes

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d shapenets: A deep representation for volumetric shapes. In Computer Vision and Pattern Recognition, IEEE Conference on, 2015

work page 2015

[12] [12]

A dataset for developing and benchmarking active vision

Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Ko ˇseck´a, and Alexander C Berg. A dataset for developing and benchmarking active vision. In Robotics and Automation, IEEE International Conference on, 2017

work page 2017

[13] [13]

End-to-end learning of action detection from frame glimpses in videos

Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. End-to-end learning of action detection from frame glimpses in videos. In Computer Vision and Pattern Recognition, IEEE Conference on, 2016

work page 2016

[14] [14]

Mathe, A

S. Mathe, A. Pirinen, and C. Sminchisescu. Reinforcement learning for visual object de- tection. In Computer Vision and Pattern Recognition, IEEE Conference on , 2016

work page 2016

[15] [15]

Karayev, T

S. Karayev, T. Baumgartner, M. Fritz, and T. Darrell. Timely object recognition. In Ad- vances in Neural Information Processing Systems , 2012

work page 2012

[16] [16]

Efros, and Trevor Darrell

Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven ex- ploration by self-supervised prediction. In International Conference on Machine Learning, 2017. 35

work page 2017

[17] [17]

Learning exploration policies for naviga- tion

Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for naviga- tion. In International Conference on Learning Representations , 2019

work page 2019

[18] [18]

Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges

Benjamin Hepp, Debadeepta Dey, Sudipta N. Sinha, Ashish Kapoor, Neel Joshi, and Otmar Hilliges. Learn-to-score: Efﬁcient 3d scene exploration by predicting view utility. In The European Conference on Computer Vision, September 2018

work page 2018

[19] [19]

Im2pano3d: Extrapolating 360 structure and semantics beyond the ﬁeld of view

Shuran Song, Andy Zeng, Angel X Chang, Manolis Savva, Silvio Savarese, and Thomas Funkhouser. Im2pano3d: Extrapolating 360 structure and semantics beyond the ﬁeld of view. In Computer Vision and Pattern Recognition, IEEE Conference on , pages 3847– 3856, 2018

work page 2018

[20] [20]

Deep view morph- ing

Dinghuang Ji, Junghyun Kwon, Max McFarland, and Silvio Savarese. Deep view morph- ing. In Computer Vision and Pattern Recognition, IEEE Conference on , volume 2, 2017

work page 2017

[21] [21]

Deep convo- lutional inverse graphics network

Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo- lutional inverse graphics network. In Advances in neural information processing systems , pages 2539–2547, 2015

work page 2015

[22] [22]

Shapecodes: Self-supervised fea- ture learning by lifting views to viewgrids

Dinesh Jayaraman, Ruohan Gao, and Kristen Grauman. Shapecodes: Self-supervised fea- ture learning by lifting views to viewgrids. European Conference on Computer Vision , 2018

work page 2018

[23] [23]

Neural scene representation and rendering

SM Ali Eslami, Danilo Jimenez Rezende, Frederic Besse, Fabio Viola, Ari S Morcos, Marta Garnelo, Avraham Ruderman, Andrei A Rusu, Ivo Danihelka, Karol Gregor, et al. Neural scene representation and rendering. Science, 360(6394):1204–1210, 2018

work page 2018

[24] [24]

Learning to look around: Intelligently exploring unseen environments for unknown tasks

Dinesh Jayaraman and Kristen Grauman. Learning to look around: Intelligently exploring unseen environments for unknown tasks. In Computer Vision and Pattern Recognition, IEEE Conference on, 2018. 36

work page 2018

[25] [25]

Ramakrishnan and Kristen Grauman

Santhosh K. Ramakrishnan and Kristen Grauman. Sidekick Policy Learning for Active Visual Exploration. In European Conference on Computer Vision, 2018

work page 2018

[26] [26]

Pairwise decomposition of im- age sequences for active multi-view recognition

Edward Johns, Stefan Leutenegger, and Andrew J Davison. Pairwise decomposition of im- age sequences for active multi-view recognition. In Computer Vision and Pattern Recogni- tion, IEEE Conference on, 2016

work page 2016

[27] [27]

Visual Semantic Planning using Deep Successor Representa- tions

Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Visual Semantic Planning using Deep Successor Representa- tions. In Computer Vision, IEEE International Conference on , 2017

work page 2017

[28] [28]

Unifying Map and Landmark Based Representations for Visual Navigation

Saurabh Gupta, David Fouhey, Sergey Levine, and Jitendra Malik. Unifying map and landmark based representations for visual navigation. arXiv preprint arXiv:1712.08125 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Target-driven visual navigation in indoor scenes using deep reinforcement learning

Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics and Automation, IEEE International Conference on , 2017

work page 2017

[30] [30]

Jayaraman and K

D. Jayaraman and K. Grauman. End-to-end policy learning for active visual categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2018

work page 2018

[31] [31]

Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning

Xiaoxiao Guo, Satinder Singh, Honglak Lee, Richard L Lewis, and Xiaoshi Wang. Deep learning for real-time atari game play using ofﬂine monte-carlo tree search planning. In Advances in Neural Information Processing Systems , 2014

work page 2014

[32] [32]

Learning with intelligent teacher

Vladimir Vapnik and Rauf Izmailov. Learning with intelligent teacher. In Symposium on Conformal and Probabilistic Prediction with Applications , 2016. 37

work page 2016

[33] [33]

Recognizing scene viewpoint using panoramic place representation

Jianxiong Xiao, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Recognizing scene viewpoint using panoramic place representation. In Computer Vision and Pattern Recogni- tion, IEEE Conference on, 2012

work page 2012

[34] [34]

Graph-based visual saliency

Jonathan Harel, Christof Koch, and Pietro Perona. Graph-based visual saliency. In Ad- vances in Neural Information Processing Systems , 2006

work page 2006

[35] [35]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Computer Vision and Pattern Recognition, IEEE Conference on, pages 5967–5976. IEEE, 2017

work page 2017

[36] [36]

3d- r2n2: A uniﬁed approach for single and multi-view 3d object reconstruction

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d- r2n2: A uniﬁed approach for single and multi-view 3d object reconstruction. In Proceed- ings of the European Conference on Computer Vision (ECCV) , 2016

work page 2016

[37] [37]

Haoqiang Fan, Hao Su, and Leonidas J. Guibas. A point set generation network for 3d object reconstruction from a single image. In Computer Vision and Pattern Recognition, IEEE Conference on, July 2017

work page 2017

[38] [38]

Pixel2Mesh: Generating 3D Mesh Models from Single RGB Images

Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. arXiv preprint arXiv:1804.01654, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

Carla: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on Robot Learning, 2017

work page 2017

[40] [40]

Asymmetric actor critic for image-based robot learning

Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning. Robotics: Science and Systems , 2018. 38

work page 2018

[41] [41]

Embodied Question Answering

Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Ba- tra. Embodied Question Answering. In Computer Vision and Pattern Recognition, IEEE Conference on, 2018

work page 2018

[42] [42]

Building Generalizable Agents with a Realistic and Rich 3D Environment

Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. Vision-and-language navigation: Inter- preting visually-grounded navigation instructions in real environments. InComputer Vision and Pattern Recognition, IEEE Conference on, 2018

work page 2018

[44] [44]

Semi-parametric topological memory for navigation

Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for navigation. International Conference on Learning Representations , 2018

work page 2018

[45] [45]

World Models

David Ha and J ¨urgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[46] [46]

Learning real-world robot policies by dreaming

AJ Piergiovanni, Alan Wu, and Michael S Ryoo. Learning real-world robot policies by dreaming. arXiv preprint arXiv:1805.07813, 2018

work page arXiv 2018

[47] [47]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997

work page 1997

[48] [48]

Learning stochastic feedforward networks

Radford M Neal. Learning stochastic feedforward networks. Department of Computer Science, University of Toronto, 64(9), 1990

work page 1990

[49] [49]

Reinforcement learning: An introduction

Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction

work page

[50] [50]

End to End Learning for Self-Driving Cars

Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316, 2016. 39

work page internal anchor Pith review Pith/arXiv arXiv 2016

[51] [51]

A machine learning approach to visual perception of forest trails for mobile robots

Alessandro Giusti, J ´erˆome Guzzi, Dan C Cires ¸an, Fang-Lin He, Juan P Rodr´ıguez, Flavio Fontana, Matthias Faessler, Christian Forster, J ¨urgen Schmidhuber, Gianni Di Caro, et al. A machine learning approach to visual perception of forest trails for mobile robots. IEEE Robotics and Automation Letters, 2016

work page 2016

[52] [52]

One-shot imitation learning

Yan Duan, Marcin Andrychowicz, Bradly Stadie, OpenAI Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning. In Advances in Neural Information Processing Systems , 2017

work page 2017

[53] [53]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014

work page 2014

[54] [54]

Spherenet: Learn- ing spherical representations for detection and classiﬁcation in omnidirectional images

Coors, Benjamin and Paul Condurache, Alexandru and Geiger, Andreas. Spherenet: Learn- ing spherical representations for detection and classiﬁcation in omnidirectional images. In Proceedings of the European Conference on Computer Vision (ECCV) , 2018

work page 2018

[55] [55]

environment

For simplicity of presentation, we represent an “environment” as X where the agent ex- plores a novel scene, looking outward in new viewing directions. However, experiments will also use X as an object where the agent moves around an object, looking inward at it from new viewing angles. Figure 1 illustrates the two scenarios

work page

[56] [56]

The angles were selected to break symmetry and reduce redundancy of views

work page

[57] [57]

For the sake of brevity, we report the best performances among the two sidekick variants we proposed in (24)

work page

[58] [58]

grid-of-grids

We reﬁne the decoded viewgrids (for both our method and the baseline) with a pix2pix (34)- style conditional Generative Adversarial Network (GAN), detailed in the Supplementary Materials. 40 Acknowledgements: We thank Yu-Chuan Su, Kimberly Hsiao, Bo Xiong and Philipp Kr¨ahenb¨uhl for helpful discussions. Funding: The University of Texas at AUstin is suppo...

work page 2019