{"work":{"id":"780aaeee-ac26-46b1-b6ff-64a7a624e694","openalex_id":null,"doi":null,"arxiv_id":"2212.03191","raw_key":null,"title":"InternVideo: General Video Foundation Models via Generative and Discriminative Learning","authors":null,"authors_text":"Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao","year":2022,"venue":"cs.CV","abstract":"The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .","external_url":"https://arxiv.org/abs/2212.03191","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-23T07:15:28.443300+00:00","pith_arxiv_id":"2212.03191","created_at":"2026-05-10T13:30:26.610278+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"InternVideo: General Video Foundation Models via Generative and Discriminative Learning","render_title":"InternVideo: General Video Foundation Models via Generative and Discriminative Learning"},"hub":{"state":{"work_id":"780aaeee-ac26-46b1-b6ff-64a7a624e694","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":24,"external_cited_by_count":null,"distinct_field_count":1,"first_pith_cited_at":"2023-05-10T17:59:04+00:00","last_pith_cited_at":"2026-05-12T17:59:51+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-11T14:08:11.569646+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":3},{"context_role":"baseline","n":1},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"background","n":3},{"context_polarity":"baseline","n":1},{"context_polarity":"use_method","n":1}],"runs":{},"summary":{},"graph":{},"authors":[]}}