pith. sign in

arxiv: 2202.10890 · v2 · pith:RSHDZDU4new · submitted 2022-02-22 · 💻 cs.CV

HiP: Hierarchical Perceiver

classification 💻 cs.CV
keywords embeddingsgeneralityhierarchicalhigh-resolutionimagesinputslearningmodel
0
0 comments X
read the original abstract

General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

    cs.LG 2026-06 unverdicted novelty 7.0

    LH-NeF learns tokenized neural-field representations via a locality-preserving hierarchical encoder, achieving 42× lower memory and 133× larger batches than modality-agnostic meta-learning baselines while matching or ...