{"work":{"id":"022de141-49ac-43d5-9667-e36913ec1950","openalex_id":null,"doi":null,"arxiv_id":"2210.00030","raw_key":null,"title":"VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training","authors":null,"authors_text":"Yecheng Jason Ma, Shagun Sodhani, Dinesh Jayaraman, Osbert Bastani, Vikash Kumar, Amy Zhang","year":2022,"venue":"cs.RO","abstract":"Reward and representation learning are two long-standing challenges for learning an expanding set of robot manipulation skills from sensory observations. Given the inherent cost and scarcity of in-domain, task-specific robot data, learning from large, diverse, offline human videos has emerged as a promising path towards acquiring a generally useful visual representation for control; however, how these human videos can be used for general-purpose reward learning remains an open question. We introduce $\\textbf{V}$alue-$\\textbf{I}$mplicit $\\textbf{P}$re-training (VIP), a self-supervised pre-trained visual representation capable of generating dense and smooth reward functions for unseen robotic tasks. VIP casts representation learning from human videos as an offline goal-conditioned reinforcement learning problem and derives a self-supervised dual goal-conditioned value-function objective that does not depend on actions, enabling pre-training on unlabeled human videos. Theoretically, VIP can be understood as a novel implicit time contrastive objective that generates a temporally smooth embedding, enabling the value function to be implicitly defined via the embedding distance, which can then be used to construct the reward for any goal-image specified downstream task. Trained on large-scale Ego4D human videos and without any fine-tuning on in-domain, task-specific data, VIP's frozen representation can provide dense visual reward for an extensive set of simulated and $\\textbf{real-robot}$ tasks, enabling diverse reward-based visual control methods and significantly outperforming all prior pre-trained representations. Notably, VIP can enable simple, $\\textbf{few-shot}$ offline RL on a suite of real-world robot tasks with as few as 20 trajectories.","external_url":"https://arxiv.org/abs/2210.00030","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T06:44:02.523299+00:00","pith_arxiv_id":"2210.00030","created_at":"2026-05-10T03:03:37.472273+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training","render_title":"VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training"},"hub":{"state":{"work_id":"022de141-49ac-43d5-9667-e36913ec1950","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":26,"external_cited_by_count":null,"distinct_field_count":5,"first_pith_cited_at":"2023-07-12T07:40:48+00:00","last_pith_cited_at":"2026-05-17T14:55:32+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-09T13:54:59.493748+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":12},{"context_role":"extension","n":1}],"polarity_counts":[{"context_polarity":"background","n":12},{"context_polarity":"extend","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}