pith. sign in

arxiv: 1906.11951 · v1 · pith:KVZXHVLFnew · submitted 2019-06-27 · 💻 cs.LG · cs.CV· stat.ML

Supervise Thyself: Examining Self-Supervised Representations in Interactive Environments

Pith reviewed 2026-05-25 14:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords self-supervised learningrepresentation learninginteractive environmentsFlappy BirdSonic the Hedgehogvisual featuresstate capturegeneralizability
0
0 comments X

The pith

The usefulness of self-supervised representations in games depends heavily on the environment's visuals and dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Self-supervised methods let agents learn representations by observing the outcomes of their own actions, which is useful in environments without dense rewards or labels. The paper tests several such methods on Flappy Bird and Sonic the Hedgehog, measuring how well the representations capture the true agent state and how well they generalize to new levels or textures. It also visualizes which parts of the screen the representations attend to. The central result is that no method performs best in all cases; instead, the value of each representation depends on the specific visuals and movement rules of the game being played.

Core claim

Our results show that the utility of the representations is highly dependent on the visuals and dynamics of the environment.

What carries the argument

Two evaluation contexts: the extent to which the representations capture true state information of the agent, and how generalizable the representations are to novel situations such as new levels and textures.

If this is right

  • Representations from one self-supervised method may suit environments with certain visuals while another method suits environments with different dynamics.
  • State capture and generalizability can trade off, so a representation that scores high on one may score low on the other.
  • Visualizing attention can reveal whether a representation focuses on task-relevant objects or on background elements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pretraining choices for control agents may need to be tuned per environment rather than applied uniformly across games.
  • The same dependency could appear in robotics settings where camera images and physics vary across tasks.
  • Combining multiple self-supervised objectives might reduce sensitivity to a single environment's visuals and dynamics.

Load-bearing premise

That the two evaluation contexts are sufficient proxies for determining which representations best capture meaningful features for downstream tasks such as control or exploration.

What would settle it

Running the learned representations as input features in an actual control or exploration task and finding that the method with highest state-capture and generalizability scores does not produce the best downstream performance.

Figures

Figures reproduced from arXiv: 1906.11951 by Christopher Pal, Evan Racah.

Figure 1
Figure 1. Figure 1: General architecture for self-supervised embedding. Shown for Flappy Bird. Two or three frames are each input to the base encoder then the outputs from the encoder, φ(x) are concatenated and passed to a linear softmax layer that classifies either a) ”how many time steps are between a pair of frames?” for the TDC model (Aytar et al., 2018), b) ”what action was taken to go from the first frame to second?” fo… view at source ↗
Figure 2
Figure 2. Figure 2: Qualititative Inspection of Feature Maps Flappy Bird feature maps from the last conv layer of the encoder superimposed on top of a sequence of frames they are a function of. Red pixels are high values, blue are low values [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sonic feature maps from the last conv layer of the encoder superimposed on top of the frames they are a function of for from left: random CNN, VAE, inverse Model, tuple verification, and temporal distance classification. Red is high values, blue are low values (Mnih et al., 2016), using empirical returns from extrin￾sic rewards as a measure of utility of each feature space. Lastly, trying to infer the posi… view at source ↗
Figure 4
Figure 4. Figure 4: Predicting in Feature Space: Architecture for predict￾ing in feature space: an embedding at time step is concatenated with the action at time t and put through a linear layer to get the predicted embedding at time step t+1 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualititative Inspection of Feature Maps (longer version) Flappy Bird feature maps from the last conv layer of the encoder superimposed on top of a sequence of frames they are a function of. Red pixels are high values, blue are low values [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Self-supervised methods, wherein an agent learns representations solely by observing the results of its actions, become crucial in environments which do not provide a dense reward signal or have labels. In most cases, such methods are used for pretraining or auxiliary tasks for "downstream" tasks, such as control, exploration, or imitation learning. However, it is not clear which method's representations best capture meaningful features of the environment, and which are best suited for which types of environments. We present a small-scale study of self-supervised methods on two visual environments: Flappy Bird and Sonic The Hedgehog. In particular, we quantitatively evaluate the representations learned from these tasks in two contexts: a) the extent to which the representations capture true state information of the agent and b) how generalizable these representations are to novel situations, like new levels and textures. Lastly, we evaluate these self-supervised features by visualizing which parts of the environment they focus on. Our results show that the utility of the representations is highly dependent on the visuals and dynamics of the environment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents a small-scale empirical study comparing self-supervised representation learning methods in two visual interactive environments (Flappy Bird and Sonic the Hedgehog). Representations are evaluated quantitatively on (a) extent of true state capture and (b) generalizability to novel levels/textures, supplemented by saliency visualizations; the central claim is that representation utility is highly dependent on the visuals and dynamics of the environment.

Significance. If the proxy-based findings hold, the work usefully demonstrates environment-specific variation in self-supervised representations, providing a concrete basis for method selection in different visual/dynamics regimes. The comparative design across two distinct games and the inclusion of both quantitative proxies and visualizations are strengths for a small-scale study.

major comments (2)
  1. [Abstract] Abstract: the framing states that the study addresses 'which method's representations best capture meaningful features of the environment, and which are best suited for which types of environments' in the context of downstream tasks (control, exploration, imitation), yet the reported results contain no direct measurements on those tasks and rely solely on the two proxy contexts; this makes the dependence claim less directly supported for the stated practical utility.
  2. [Evaluation sections] Evaluation sections (state capture and generalizability): the two proxy metrics are presented as sufficient to determine representation utility, but the manuscript provides no correlation analysis, ablation, or discussion showing that performance on these proxies predicts downstream task performance; without this link the central claim that utility 'is highly dependent on the visuals and dynamics' rests on an unverified assumption.
minor comments (2)
  1. The manuscript would benefit from explicit listing of the exact self-supervised methods compared, the precise definitions of the state-capture and generalizability metrics, and any statistical tests used to support the dependence conclusion.
  2. Saliency visualizations are mentioned but their quantitative relation to the proxy metrics is not detailed; adding this would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the framing states that the study addresses 'which method's representations best capture meaningful features of the environment, and which are best suited for which types of environments' in the context of downstream tasks (control, exploration, imitation), yet the reported results contain no direct measurements on those tasks and rely solely on the two proxy contexts; this makes the dependence claim less directly supported for the stated practical utility.

    Authors: The abstract motivates the work by referencing downstream tasks but then specifies that the evaluations use two proxy contexts (state capture and generalizability). To better align the framing with the actual results, we will revise the abstract to state explicitly that the study assesses representation utility via these proxies rather than through direct measurements on control, exploration, or imitation. This change will ensure the dependence claim is tied directly to the reported findings. revision: yes

  2. Referee: [Evaluation sections] Evaluation sections (state capture and generalizability): the two proxy metrics are presented as sufficient to determine representation utility, but the manuscript provides no correlation analysis, ablation, or discussion showing that performance on these proxies predicts downstream task performance; without this link the central claim that utility 'is highly dependent on the visuals and dynamics' rests on an unverified assumption.

    Authors: We agree that the manuscript contains no explicit correlation analysis or ablation linking proxy performance to downstream task results. As a small-scale empirical study, the work centers on the proxies themselves. We will add a short discussion paragraph in the evaluation sections that (a) motivates the proxies by their relevance to feature capture and generalization and (b) acknowledges that predictive validity for downstream tasks is not demonstrated here and would require additional experiments. The central claim will be qualified to refer specifically to the observed variation across the two proxy contexts. revision: partial

Circularity Check

0 steps flagged

Empirical comparison of self-supervised representations contains no circular derivation steps

full rationale

The paper is an empirical study that trains several self-supervised models on Flappy Bird and Sonic, then measures two proxy quantities (state capture via linear probes or similar, and generalization to novel levels/textures) plus saliency maps. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-citations by construction. The abstract and results sections frame the work as an experimental comparison whose conclusions follow directly from the reported measurements on the chosen environments; no load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The evaluation is therefore self-contained against its own experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract contains no mathematical derivations, free parameters, axioms, or postulated entities; the contribution is an empirical comparison of existing self-supervised techniques.

pith-pipeline@v0.9.0 · 5712 in / 1014 out tokens · 38789 ms · 2026-05-25T14:29:10.741636+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 18 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Learning to see by moving

    Agrawal, P., Carreira, J., and Malik, J. Learning to see by moving. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 37--45, 2015

  3. [3]

    V., Abbeel, P., Malik, J., and Levine, S

    Agrawal, P., Nair, A. V., Abbeel, P., Malik, J., and Levine, S. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, pp.\ 5074--5082, 2016

  4. [4]

    Exploration by random distillation

    Anonymous. Exploration by random distillation. 2018. URL https://openreview.net/pdf?id=H1lJJnR5Ym. Submitted to ICLR 2019

  5. [5]

    A Theoretical Analysis of Contrastive Unsupervised Representation Learning

    Arora, S., Khandeparkar, H., Khodak, M., Plevrakis, O., and Saunshi, N. A theoretical analysis of contrastive unsupervised representation learning. arXiv preprint arXiv:1902.09229, 2019

  6. [6]

    Playing hard exploration games by watching YouTube

    Aytar, Y., Pfaff, T., Budden, D., Paine, T. L., Wang, Z., and de Freitas, N. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592, 2018

  7. [7]

    Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018

  8. [8]

    Contingency-Aware Exploration in Reinforcement Learning

    Choi, J., Guo, Y., Moczulski, M., Oh, J., Wu, N., Norouzi, M., and Lee, H. Contingency-aware exploration in reinforcement learning. arXiv preprint arXiv:1811.01483, 2018

  9. [9]

    SentEval: An Evaluation Toolkit for Universal Sentence Representations

    Conneau, A. and Kiela, D. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018

  10. [10]

    M., Ghodrati, A., and Tuytelaars, T

    Fernando, B., Gavves, E., Oramas, J. M., Ghodrati, A., and Tuytelaars, T. Modeling video evolution for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 5378--5387, 2015

  11. [11]

    Self-supervised video representation learning with odd-one-out networks

    Fernando, B., Bilen, H., Gavves, E., and Gould, S. Self-supervised video representation learning with odd-one-out networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp.\ 5729--5738. IEEE, 2017

  12. [12]

    World Models

    Ha, D. and Schmidhuber, J. World models. arXiv preprint arXiv:1803.10122, 2018

  13. [13]

    and Morioka, H

    Hyvarinen, A. and Morioka, H. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems, pp.\ 3765--3773, 2016

  14. [14]

    Hyvarinen, A., Sasaki, H., and Turner, R. E. Nonlinear ica using auxiliary variables and generalized contrastive learning. arXiv preprint arXiv:1805.08651, 2018

  15. [15]

    Reinforcement Learning with Unsupervised Auxiliary Tasks

    Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016

  16. [16]

    and Grauman, K

    Jayaraman, D. and Grauman, K. Learning image representations tied to ego-motion. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 1413--1421, 2015

  17. [17]

    and Brock, O

    Jonschkowski, R. and Brock, O. Learning state representations with robotic priors. Autonomous Robots, 39 0 (3): 0 407--428, 2015

  18. [18]

    PVEs: Position-Velocity Encoders for Unsupervised Learning of Structured State Representations

    Jonschkowski, R., Hafner, R., Scholz, J., and Riedmiller, M. Pves: Position-velocity encoders for unsupervised learning of structured state representations. arXiv preprint arXiv:1705.09805, 2017

  19. [19]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  20. [20]

    Learning world models with self-supervised learning, 2018

    LeCun, Y. Learning world models with self-supervised learning, 2018. Presented at ICML worlshop on Generative Modeling in RL

  21. [21]

    State representation learning for control: An overview

    Lesort, T., D \' az-Rodr \' guez, N., Goudou, J.-F., and Filliat, D. State representation learning for control: An overview. Neural Networks, 2018

  22. [22]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013

  23. [23]

    Learning to Navigate in Complex Environments

    Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A. J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al. Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673, 2016

  24. [24]

    L., and Hebert, M

    Misra, I., Zitnick, C. L., and Hebert, M. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp.\ 527--544. Springer, 2016

  25. [25]

    P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K

    Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp.\ 1928--1937, 2016

  26. [26]

    Gotta Learn Fast: A New Benchmark for Generalization in RL

    Nichol, A., Pfau, V., Hesse, C., Klimov, O., and Schulman, J. Gotta learn fast: A new benchmark for generalization in rl. arXiv preprint arXiv:1804.03720, 2018

  27. [27]

    A., and Darrell, T

    Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction. 2017 a

  28. [28]

    Learning features by watching objects move

    Pathak, D., Girshick, R., Doll \'a r, P., Darrell, T., and Hariharan, B. Learning features by watching objects move. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 6024--6033. IEEE, 2017 b

  29. [29]

    Deep contextualized word representations

    Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018

  30. [30]

    S-RL Toolbox: Environments, Datasets and Evaluation Metrics for State Representation Learning

    Raffin, A., Hill, A., Traor \'e , R., Lesort, T., D \' az-Rodr \' guez, N., and Filliat, D. S-rl toolbox: Environments, datasets and evaluation metrics for state representation learning. arXiv preprint arXiv:1809.09369, 2018

  31. [31]

    Time-Contrastive Networks: Self-Supervised Learning from Video

    Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E., Schaal, S., and Levine, S. Time-contrastive networks: Self-supervised learning from video. arXiv preprint arXiv:1704.06888, 2017

  32. [32]

    Loss is its own Reward: Self-Supervision for Reinforcement Learning

    Shelhamer, E., Mahmoudieh, P., Argus, M., and Darrell, T. Loss is its own reward: Self-supervision for reinforcement learning. arXiv preprint arXiv:1612.07307, 2016

  33. [33]

    Subramanian, S., Trischler, A., Bengio, Y., and Pal, C. J. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018

  34. [34]

    Pygame learning environment

    Tasfi, N. Pygame learning environment. https://github.com/ntasfi/PyGame-Learning-Environment, 2016

  35. [35]

    Tracking emerges by colorizing videos

    Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., and Murphy, K. Tracking emerges by colorizing videos. In European Conference on Computer Vision, pp.\ 402--419. Springer, 2018

  36. [36]

    and Gupta, A

    Wang, X. and Gupta, A. Unsupervised learning of visual representations using videos. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 2794--2802, 2015

  37. [37]

    J., Zisserman, A., and Freeman, W

    Wei, D., Lim, J. J., Zisserman, A., and Freeman, W. T. Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 8052--8060, 2018