arxiv: 2107.14795 · v3 · submitted 2021-07-30 · 💻 cs.LG · cs.CL· cs.CV· cs.SD· eess.AS

Recognition: 1 theorem link

· Lean Theorem

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Andrew Jaegle , Sebastian Borgeaud , Jean-Baptiste Alayrac , Carl Doersch , Catalin Ionescu , David Ding , Skanda Koppula , Daniel Zoran

show 7 more authors

Andrew Brock Evan Shelhamer Olivier H\'enaff Matthew M. Botvinick Andrew Zisserman Oriol Vinyals Jo\=ao Carreira

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CVcs.SDeess.AS

keywords Perceiver IOgeneral architecturestructured inputsstructured outputsmulti-task learningGLUE benchmarkoptical flowStarCraft II

0 comments

The pith

Perceiver IO adds a flexible querying mechanism to the Perceiver so one architecture processes arbitrary structured inputs and produces outputs of any size or type while scaling linearly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Perceiver IO as a general-purpose architecture that works across many data domains without baking in assumptions about input or output structure. It augments the original Perceiver, which compresses large inputs into a compact latent space, with a querying step that decodes that space into outputs of varying sizes and meanings. This removes the need for task-specific tokenizers, output heads, or architectural tweaks. The same model then delivers strong results on language benchmarks, visual tasks, multi-modal reasoning, and StarCraft II, scaling linearly rather than quadratically with input and output size. A sympathetic reader would care because current models force repeated engineering for each new domain, limiting reuse and scalability.

Core claim

Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, allowing the same architecture to handle data from arbitrary settings while scaling linearly with the size of inputs and outputs.

What carries the argument

The flexible querying mechanism added to the Perceiver, which decodes its compressed latent representation into structured outputs of arbitrary size and semantics.

If this is right

The same model can be applied to new tasks spanning language, vision, and other domains without architectural redesign.
Input tokenization can be removed while still outperforming a Transformer-based BERT baseline on the GLUE benchmark.
Explicit multiscale mechanisms are not required to reach state-of-the-art performance on Sintel optical flow estimation.
Linear scaling with both input and output size makes the architecture practical for large structured data across modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The querying mechanism may enable unified models for reinforcement learning and planning tasks beyond the StarCraft II results shown.
If the approach generalizes further, it could reduce reliance on separate encoder-decoder or modality-specific designs in multi-modal systems.
Testing on additional modalities such as audio or 3D point clouds would provide a direct check on whether the linear scaling and output flexibility hold outside the reported domains.

Load-bearing premise

The flexible querying mechanism can produce outputs of arbitrary sizes and semantics across domains without introducing hidden task-specific assumptions.

What would settle it

Showing that Perceiver IO requires per-task changes to the querying mechanism or loses linear scaling on a new large structured-input task would falsify the generality claim.

read the original abstract

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Perceiver IO adds output querying to the base Perceiver so one model can emit structured outputs of different sizes and semantics, but the query setups still vary by task in the reported experiments.

read the letter

The main addition is the cross-attention querying step that turns the fixed latent array into outputs of arbitrary size and meaning. This removes the need for separate task heads and keeps the core blocks unchanged across domains. The paper shows the same model producing competitive numbers on GLUE without input tokenization, reaching SOTA on Sintel optical flow without built-in multiscale machinery, and handling StarCraft II plus multi-modal cases. Linear scaling with both input and output size is a practical gain over standard attention models. Those results sit on public benchmarks and line up with the abstract claims, so the experimental breadth is real. The soft spot is the generality claim. Query construction is described separately for each task: spatial queries with image-derived positions for flow, small learned sets for GLUE classification. These choices are not derived from one domain-agnostic rule, so they still amount to task-specific engineering even though the Perceiver layers stay fixed. That does not break the results but does limit how far the no-task-specific-architecture argument travels. The math is standard scaled attention with no obvious circularity or fitting artifacts. Citations build on the earlier Perceiver work without self-referential loops. This is for people working on unified multi-modal or multi-task models who want fewer custom components. A reader focused on general architectures would get concrete value from the querying mechanism and the range of tasks shown. Send it to peer review. The core extension is concrete enough to merit referee scrutiny on the exact degree of generality and on the experimental details that are only summarized in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Perceiver IO, an extension of the original Perceiver that adds a flexible cross-attention querying mechanism to produce structured outputs of arbitrary size and semantics. The core claim is that a single architecture, without task-specific engineering, can handle inputs and outputs across language, vision, multi-modal, and game domains while scaling linearly; highlights include outperforming a BERT baseline on GLUE after removing input tokenization and achieving SOTA on Sintel optical flow without explicit multiscale mechanisms.

Significance. If the generality claim holds, the work would be a meaningful contribution toward unified architectures that avoid baking in domain assumptions, potentially reducing the need for modality-specific designs. The reported cross-domain results and linear scaling are strengths that, if supported by clear ablations, could influence follow-up work on scalable attention-based models.

major comments (2)

[§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.
[§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.

minor comments (2)

[Figure 2] Figure 2: The diagram of the querying block would be clearer with explicit arrows and labels distinguishing the cross-attention from the latent array update.
[§4.3] §4.3 (StarCraft results): The multi-task setup would benefit from an explicit statement of whether any output-query hyperparameters were tuned per game variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of Perceiver IO's generality. We address each major point below and will incorporate revisions to improve detail and emphasis on the unified querying mechanism.

read point-by-point responses

Referee: [§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.

Authors: The querying mechanism is a single cross-attention operation applied uniformly across tasks; only the initialization of the query array (to encode desired output structure) differs, which is a minimal input to the model rather than an architectural modification. This is directly analogous to specifying output heads in other general models while keeping the core network fixed. We will revise §3.2 to explicitly frame query initialization as output-structure specification and add a paragraph contrasting it with task-specific components like tokenizers or multiscale pyramids. revision: partial
Referee: [§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.

Authors: We agree that the current text leaves the exact byte-to-latent mapping and query construction underspecified. In the revision we will expand §4.1 (and the corresponding methods paragraph) to describe: (i) the character-level byte encoding and fixed positional embedding used to populate the input latent array directly from raw text, and (ii) the learned query vectors for GLUE as a small set of class-specific embeddings that are decoded via a linear layer. These additions will make explicit the absence of any vocabulary or tokenization assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Perceiver IO derivation chain

full rationale

The paper proposes an architectural extension via a cross-attention querying mechanism on top of the base Perceiver latent representation. All load-bearing claims rest on direct empirical evaluation against external public benchmarks (GLUE, Sintel, StarCraft II) rather than any internal derivation that reduces to fitted parameters, self-definitions, or self-citation chains. No equations or uniqueness theorems are invoked that collapse back to the inputs by construction; query construction details are presented as implementation choices for each task without being framed as predictions derived from the model itself. The evaluation is therefore self-contained against independent data sources.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The architecture relies on standard attention and cross-attention mechanisms from prior literature. The main addition is the output querying step, which assumes attention suffices for general input-output mappings without domain-specific engineering.

free parameters (1)

latent array size
Hyperparameter controlling compression of inputs into latents; chosen to balance capacity and compute.

axioms (1)

domain assumption Cross-attention can map latent representations to arbitrary structured outputs without task-specific heads
This underpins the claim of generality across output formats and semantics.

pith-pipeline@v0.9.0 · 5544 in / 1280 out tokens · 23845 ms · 2026-05-15T19:43:58.764181+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
ENSEMBITS: an alphabet of protein conformational ensembles
cs.LG 2026-05 unverdicted novelty 8.0

Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
cs.CL 2026-05 conditional novelty 7.0

Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
cs.GR 2026-05 unverdicted novelty 7.0

MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
A foundation model of vision, audition, and language for in-silico neuroscience
q-bio.NC 2026-05 unverdicted novelty 7.0

TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
RoboDreamer: Learning Compositional World Models for Robot Imagination
cs.RO 2024-04 unverdicted novelty 7.0

RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
A Generalist Agent
cs.AI 2022-05 accept novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation
cs.CV 2026-05 unverdicted novelty 6.0

TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices
cs.LG 2026-05 unverdicted novelty 6.0

HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
cs.CV 2026-04 unverdicted novelty 6.0

MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
q-bio.NC 2026-04 unverdicted novelty 6.0

OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
cs.LG 2026-04 unverdicted novelty 6.0

PRiMeFlow is a flow-matching model that approximates the full empirical distribution of single-cell gene expression after perturbations.
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
cs.CV 2026-04 unverdicted novelty 6.0

UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training
hep-ex 2026-04 conditional novelty 6.0

Self-supervised pre-training on multimodal neutrino detector simulations produces reusable representations that improve downstream classification, regression, and data efficiency over training from scratch.
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis
cs.MM 2026-04 unverdicted novelty 6.0

PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
cs.AI 2025-06 unverdicted novelty 6.0

V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
cs.LG 2026-04 unverdicted novelty 5.0

PRiMeFlow applies flow matching in gene expression space with a U-Net velocity field and pretraining-finetuning to model perturbation-induced heterogeneity, showing strong benchmark performance on PerturBench and the ...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 20 Pith papers · 7 internal anchors

[1]

Imitating interactive intelligence

Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Stephen Clark, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Nathaniel Wong, Chen Yan, and Rui Zhu. Imita...

work page arXiv 2012
[2]

VATT : Transformers for multimodal self-supervised learning from raw video, audio and text

Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. VATT : Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[3]

Self-supervised multimodal versatile networks

Jean-Baptiste Alayrac, Adri \`a Recasens, Rosalia Schneider, Relja Arandjelovi \'c , Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[4]

The D eep M ind JAX E cosystem, 2020

Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamaka...

work page 2020
[5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document Transformer . arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Byte pair encoding is suboptimal for language model pretraining

Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

work page 2020
[7]

JAX : composable transformations of P ython+ N um P y programs, 2018

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

work page 2018
[8]

High-performance large-scale image recognition without normalization

Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In Proceedings of International Conference on Machine Learning (ICML), 2021

work page 2021
[9]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

work page 2020
[10]

Butler, Jonas Wulff, Garrett B

Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. In Proceedings of European Conference on Computer Vision (ECCV), 2012

work page 2012
[11]

Campbell, R

J. Campbell, R. Sukthankar, and I. Nourbakhsh. Techniques for evaluating optical flow for visual odometry in extreme terrain. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004

work page 2004
[12]

End-to-end object detection with Transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers . In Proceedings of European Conference on Computer Vision (ECCV), 2020

work page 2020
[13]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of International Conference on Machine Learning (ICML), 2020

work page 2020
[14]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. CANINE: pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10: 0 73--91, 2022

work page 2022
[15]

A unified architecture for natural language processing: Deep neural networks with multitask learning

Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of International Conference on Machine Learning (ICML), 2008

work page 2008
[16]

Natural language processing (almost) from scratch

Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011

work page 2011
[17]

Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment : Practical automated data augmentation with a reduced search space. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

work page 2020
[18]

Le, and Ruslan Salakhutdinov

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2019

work page 2019
[19]

ImageNet : A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

work page 2009
[20]

VirTex: Learning Visual Representations from Textual Annotations

Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[21]

BERT : Pre-training of deep bidirectional Transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019

work page 2019
[22]

Multi-task self-supervised visual learning

Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[23]

Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: motion to the rescue. Proceedings of Neural Information Processing Systems (NeurIPS), 2019

work page 2019
[24]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR), 2021

work page 2021
[25]

Learning hierarchical features for scene labeling

Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012

work page 2012
[26]

FlowNet : Learning optical flow with convolutional networks

Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip H\" a usser, Caner Hazırba s , Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. FlowNet : Learning optical flow with convolutional networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015

work page 2015
[27]

Audio Set : An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio Set : An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017

work page 2017
[28]

Coordination among neural modules through a shared global workspace

Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas, Charles Blundell, Michael Mozer, and Yoshua Bengio. Coordination among neural modules through a shared global workspace. In Proceedings of International Conference on Learning Representations (ICLR), 2022

work page 2022
[29]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[31]

Mask R-CNN

Kaiming He, Georgia Gkioxari, Piotr Doll \'a r, and Ross Girshick. Mask R-CNN . In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[32]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs) . arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Autoencoders, minimum description length, and Helmholtz free energy

Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In Proceedings of Neural Information Processing Systems (NeurIPS), 1994

work page 1994
[34]

Determining optical flow

Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial Intelligence, 1981

work page 1981
[35]

Hudson and C

Drew A. Hudson and C. Lawrence Zitnick. Generative adversarial Transformers . In Proceedings of International Conference on Machine Learning (ICML), 2021

work page 2021
[36]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Jo\ a o Carreira. Perceiver: General perception with iterative attention. In Proceedings of International Conference on Machine Learning (ICML), 2021

work page 2021
[37]

Learning to estimate hidden motions with global motion aggregation

Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. Learning to estimate hidden motions with global motion aggregation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021

work page 2021
[38]

In-datacenter performance analysis of a Tensor Processing Unit

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a Tensor Processing Unit . In Proceedings of the 44th Annual International Symposium on Computer Architecture , 2017

work page 2017
[39]

One Model To Learn Them All

Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[40]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[41]

Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory

Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[42]

ImageNet classification with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. Proceedings of Neural Information Processing Systems (NeurIPS), 2012

work page 2012
[43]

S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2018

work page 2018
[44]

Set Transformer : A framework for attention-based permutation-invariant neural networks

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set Transformer : A framework for attention-based permutation-invariant neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019

work page 2019
[45]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[46]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[47]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[48]

SGDR : Stochastic gradient descent with warm restarts

Ilya Loshchilov and Frank Hutter. SGDR : Stochastic gradient descent with warm restarts. In Proceedings of International Conference on Learning Representations (ICLR), 2017

work page 2017
[49]

Pretrained Transformers as universal computation engines

Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained Transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021

work page arXiv 2021
[50]

An iterative image registration technique with an application to stereo vision

Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 1981

work page 1981
[51]

Multi-task sequence to sequence learning

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In Proceedings of International Conference on Learning Representations (ICLR), 2016

work page 2016
[52]

LUNA : Linear unified nested attention

Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. LUNA : Linear unified nested attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[53]

Object scene flow for autonomous vehicles

Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[54]

Thinking fast and slow: Efficient text-to-visual retrieval with Transformers

Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. Thinking fast and slow: Efficient text-to-visual retrieval with Transformers . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[55]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of Neural Information Processing Systems (NeurIPS), 2013

work page 2013
[56]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorth, and Ren Ng. NeRF : Representing scenes as neural radiance fields for view synthesis. In Proceedings of European Conference on Computer Vision (ECCV), 2020

work page 2020
[57]

Cross-stitch networks for multi-task learning

Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[58]

Stacked hourglass networks for human pose estimation

Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proceedings of European Conference on Computer Vision (ECCV), 2016

work page 2016
[59]

Multimodal deep learning

Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of International Conference on Machine Learning (ICML), 2011

work page 2011
[60]

GloVe : Global Vectors for word representation

Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe : Global Vectors for word representation. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014

work page 2014
[61]

Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V. Le. Meta pseudo labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[62]

Pomerleau

Dean A. Pomerleau. ALVINN : An autonomous land vehicle in a neural network. In Proceedings of Neural Information Processing Systems (NeurIPS), 1989

work page 1989
[63]

Language models are unsupervised multitask learners

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI , 2019

work page 2019
[64]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text Transformer . Journal of Machine Learning Research, 2020

work page 2020
[65]

Zero-shot text-to-image generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Proceedings of International Conference on Machine Learning (ICML), 2021

work page 2021
[66]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

work page 2020
[67]

U-Net : Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net : Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015

work page 2015
[68]

Relational recurrent neural networks

Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2018

work page 2018
[69]

Neural machine translation of rare words with subword units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2016

work page 2016
[70]

Overfeat: Integrated recognition, localization and detection using convolutional networks

Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of International Conference on Learning Representations (ICLR), 2014

work page 2014
[71]

A short note on the Kinetics -700-2020 human action dataset

Lucas Smaira, Jo \ a o Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the Kinetics -700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020

work page arXiv 2020
[72]

Kenneth O. Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8 0 (20): 0 131 -- 162, 2007

work page 2007
[73]

Revisiting unreasonable effectiveness of data in deep learning era

Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

work page 2017
[74]

PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume

Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[75]

TF-RAFT : A tensorflow implementation of RAFT

Deqing Sun, Charles Herrmann, Varun Jampani, Michael Krainin, Forrester Cole, Austin Stone, Rico Jonschkowski, Ramin Zabih, William T Freeman, and Ce Liu. TF-RAFT : A tensorflow implementation of RAFT . In ECCV Robust Vision Challenge Workshop, 2020

work page 2020
[76]

AutoFlow : Learning a better training set for optical flow

Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, William T Freeman, and Ce Liu. AutoFlow : Learning a better training set for optical flow. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

work page 2021
[77]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2014

work page 2014
[78]

Hash embeddings for efficient word representations

Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. Hash embeddings for efficient word representations. In Proceedings of Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[79]

Going deeper with convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

work page 2015
[80]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

work page 2020

Showing first 80 references.