pith. machine review for the scientific record. sign in

arxiv: 2107.14795 · v3 · submitted 2021-07-30 · 💻 cs.LG · cs.CL· cs.CV· cs.SD· eess.AS

Recognition: 1 theorem link

· Lean Theorem

Perceiver IO: A General Architecture for Structured Inputs & Outputs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:43 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CVcs.SDeess.AS
keywords Perceiver IOgeneral architecturestructured inputsstructured outputsmulti-task learningGLUE benchmarkoptical flowStarCraft II
0
0 comments X

The pith

Perceiver IO adds a flexible querying mechanism to the Perceiver so one architecture processes arbitrary structured inputs and produces outputs of any size or type while scaling linearly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Perceiver IO as a general-purpose architecture that works across many data domains without baking in assumptions about input or output structure. It augments the original Perceiver, which compresses large inputs into a compact latent space, with a querying step that decodes that space into outputs of varying sizes and meanings. This removes the need for task-specific tokenizers, output heads, or architectural tweaks. The same model then delivers strong results on language benchmarks, visual tasks, multi-modal reasoning, and StarCraft II, scaling linearly rather than quadratically with input and output size. A sympathetic reader would care because current models force repeated engineering for each new domain, limiting reuse and scalability.

Core claim

Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, allowing the same architecture to handle data from arbitrary settings while scaling linearly with the size of inputs and outputs.

What carries the argument

The flexible querying mechanism added to the Perceiver, which decodes its compressed latent representation into structured outputs of arbitrary size and semantics.

If this is right

  • The same model can be applied to new tasks spanning language, vision, and other domains without architectural redesign.
  • Input tokenization can be removed while still outperforming a Transformer-based BERT baseline on the GLUE benchmark.
  • Explicit multiscale mechanisms are not required to reach state-of-the-art performance on Sintel optical flow estimation.
  • Linear scaling with both input and output size makes the architecture practical for large structured data across modalities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The querying mechanism may enable unified models for reinforcement learning and planning tasks beyond the StarCraft II results shown.
  • If the approach generalizes further, it could reduce reliance on separate encoder-decoder or modality-specific designs in multi-modal systems.
  • Testing on additional modalities such as audio or 3D point clouds would provide a direct check on whether the linear scaling and output flexibility hold outside the reported domains.

Load-bearing premise

The flexible querying mechanism can produce outputs of arbitrary sizes and semantics across domains without introducing hidden task-specific assumptions.

What would settle it

Showing that Perceiver IO requires per-task changes to the querying mechanism or loses linear scaling on a new large structured-input task would falsify the generality claim.

read the original abstract

A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Perceiver IO, an extension of the original Perceiver that adds a flexible cross-attention querying mechanism to produce structured outputs of arbitrary size and semantics. The core claim is that a single architecture, without task-specific engineering, can handle inputs and outputs across language, vision, multi-modal, and game domains while scaling linearly; highlights include outperforming a BERT baseline on GLUE after removing input tokenization and achieving SOTA on Sintel optical flow without explicit multiscale mechanisms.

Significance. If the generality claim holds, the work would be a meaningful contribution toward unified architectures that avoid baking in domain assumptions, potentially reducing the need for modality-specific designs. The reported cross-domain results and linear scaling are strengths that, if supported by clear ablations, could influence follow-up work on scalable attention-based models.

major comments (2)
  1. [§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.
  2. [§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.
minor comments (2)
  1. [Figure 2] Figure 2: The diagram of the querying block would be clearer with explicit arrows and labels distinguishing the cross-attention from the latent array update.
  2. [§4.3] §4.3 (StarCraft results): The multi-task setup would benefit from an explicit statement of whether any output-query hyperparameters were tuned per game variant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of Perceiver IO's generality. We address each major point below and will incorporate revisions to improve detail and emphasis on the unified querying mechanism.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.

    Authors: The querying mechanism is a single cross-attention operation applied uniformly across tasks; only the initialization of the query array (to encode desired output structure) differs, which is a minimal input to the model rather than an architectural modification. This is directly analogous to specifying output heads in other general models while keeping the core network fixed. We will revise §3.2 to explicitly frame query initialization as output-structure specification and add a paragraph contrasting it with task-specific components like tokenizers or multiscale pyramids. revision: partial

  2. Referee: [§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.

    Authors: We agree that the current text leaves the exact byte-to-latent mapping and query construction underspecified. In the revision we will expand §4.1 (and the corresponding methods paragraph) to describe: (i) the character-level byte encoding and fixed positional embedding used to populate the input latent array directly from raw text, and (ii) the learned query vectors for GLUE as a small set of class-specific embeddings that are decoded via a linear layer. These additions will make explicit the absence of any vocabulary or tokenization assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in Perceiver IO derivation chain

full rationale

The paper proposes an architectural extension via a cross-attention querying mechanism on top of the base Perceiver latent representation. All load-bearing claims rest on direct empirical evaluation against external public benchmarks (GLUE, Sintel, StarCraft II) rather than any internal derivation that reduces to fitted parameters, self-definitions, or self-citation chains. No equations or uniqueness theorems are invoked that collapse back to the inputs by construction; query construction details are presented as implementation choices for each task without being framed as predictions derived from the model itself. The evaluation is therefore self-contained against independent data sources.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The architecture relies on standard attention and cross-attention mechanisms from prior literature. The main addition is the output querying step, which assumes attention suffices for general input-output mappings without domain-specific engineering.

free parameters (1)
  • latent array size
    Hyperparameter controlling compression of inputs into latents; chosen to balance capacity and compute.
axioms (1)
  • domain assumption Cross-attention can map latent representations to arbitrary structured outputs without task-specific heads
    This underpins the claim of generality across output formats and semantics.

pith-pipeline@v0.9.0 · 5544 in / 1280 out tokens · 23845 ms · 2026-05-15T19:43:58.764181+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.

  2. ENSEMBITS: an alphabet of protein conformational ensembles

    cs.LG 2026-05 unverdicted novelty 8.0

    Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.

  3. Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.

  4. MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation

    cs.GR 2026-05 unverdicted novelty 7.0

    MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...

  5. A foundation model of vision, audition, and language for in-silico neuroscience

    q-bio.NC 2026-05 unverdicted novelty 7.0

    TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.

  6. TrajTok: Learning Trajectory Tokens enables better Video Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.

  7. RoboDreamer: Learning Compositional World Models for Robot Imagination

    cs.RO 2024-04 unverdicted novelty 7.0

    RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.

  8. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  9. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  10. TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.

  11. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  12. Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices

    cs.LG 2026-05 unverdicted novelty 6.0

    HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.

  13. MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...

  14. OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens

    q-bio.NC 2026-04 unverdicted novelty 6.0

    OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.

  15. PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling

    cs.LG 2026-04 unverdicted novelty 6.0

    PRiMeFlow is a flow-matching model that approximates the full empirical distribution of single-cell gene expression after perturbations.

  16. UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

    cs.CV 2026-04 unverdicted novelty 6.0

    UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...

  17. Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training

    hep-ex 2026-04 conditional novelty 6.0

    Self-supervised pre-training on multimodal neutrino detector simulations produces reusable representations that improve downstream classification, regression, and data efficiency over training from scratch.

  18. Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

    cs.MM 2026-04 unverdicted novelty 6.0

    PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.

  19. V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    cs.AI 2025-06 unverdicted novelty 6.0

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...

  20. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  21. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  22. PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling

    cs.LG 2026-04 unverdicted novelty 5.0

    PRiMeFlow applies flow matching in gene expression space with a U-Net velocity field and pretraining-finetuning to model perturbation-induced heterogeneity, showing strong benchmark performance on PerturBench and the ...

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · cited by 20 Pith papers · 7 internal anchors

  1. [1]

    Imitating interactive intelligence

    Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Stephen Clark, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Nathaniel Wong, Chen Yan, and Rui Zhu. Imita...

  2. [2]

    VATT : Transformers for multimodal self-supervised learning from raw video, audio and text

    Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. VATT : Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021

  3. [3]

    Self-supervised multimodal versatile networks

    Jean-Baptiste Alayrac, Adri \`a Recasens, Rosalia Schneider, Relja Arandjelovi \'c , Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

  4. [4]

    The D eep M ind JAX E cosystem, 2020

    Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamaka...

  5. [5]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document Transformer . arXiv preprint arXiv:2004.05150, 2020

  6. [6]

    Byte pair encoding is suboptimal for language model pretraining

    Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020

  7. [7]

    JAX : composable transformations of P ython+ N um P y programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax

  8. [8]

    High-performance large-scale image recognition without normalization

    Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In Proceedings of International Conference on Machine Learning (ICML), 2021

  9. [9]

    Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...

  10. [10]

    Butler, Jonas Wulff, Garrett B

    Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. In Proceedings of European Conference on Computer Vision (ECCV), 2012

  11. [11]

    Campbell, R

    J. Campbell, R. Sukthankar, and I. Nourbakhsh. Techniques for evaluating optical flow for visual odometry in extreme terrain. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004

  12. [12]

    End-to-end object detection with Transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers . In Proceedings of European Conference on Computer Vision (ECCV), 2020

  13. [13]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of International Conference on Machine Learning (ICML), 2020

  14. [14]

    Clark, Dan Garrette, Iulia Turc, and John Wieting

    Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. CANINE: pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10: 0 73--91, 2022

  15. [15]

    A unified architecture for natural language processing: Deep neural networks with multitask learning

    Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of International Conference on Machine Learning (ICML), 2008

  16. [16]

    Natural language processing (almost) from scratch

    Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011

  17. [17]

    Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V

    Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment : Practical automated data augmentation with a reduced search space. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020

  18. [18]

    Le, and Ruslan Salakhutdinov

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2019

  19. [19]

    ImageNet : A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009

  20. [20]

    VirTex: Learning Visual Representations from Textual Annotations

    Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  21. [21]

    BERT : Pre-training of deep bidirectional Transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019

  22. [22]

    Multi-task self-supervised visual learning

    Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

  23. [23]

    Sim2real transfer learning for 3D human pose estimation: motion to the rescue

    Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: motion to the rescue. Proceedings of Neural Information Processing Systems (NeurIPS), 2019

  24. [24]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR), 2021

  25. [25]

    Learning hierarchical features for scene labeling

    Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012

  26. [26]

    FlowNet : Learning optical flow with convolutional networks

    Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip H\" a usser, Caner Hazırba s , Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. FlowNet : Learning optical flow with convolutional networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015

  27. [27]

    Audio Set : An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio Set : An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017

  28. [28]

    Coordination among neural modules through a shared global workspace

    Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas, Charles Blundell, Michael Mozer, and Yoshua Bengio. Coordination among neural modules through a shared global workspace. In Proceedings of International Conference on Learning Representations (ICLR), 2022

  29. [29]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013

  30. [30]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  31. [31]

    Mask R-CNN

    Kaiming He, Georgia Gkioxari, Piotr Doll \'a r, and Ross Girshick. Mask R-CNN . In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

  32. [32]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs) . arXiv preprint arXiv:1606.08415, 2016

  33. [33]

    Autoencoders, minimum description length, and Helmholtz free energy

    Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In Proceedings of Neural Information Processing Systems (NeurIPS), 1994

  34. [34]

    Determining optical flow

    Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial Intelligence, 1981

  35. [35]

    Hudson and C

    Drew A. Hudson and C. Lawrence Zitnick. Generative adversarial Transformers . In Proceedings of International Conference on Machine Learning (ICML), 2021

  36. [36]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Jo\ a o Carreira. Perceiver: General perception with iterative attention. In Proceedings of International Conference on Machine Learning (ICML), 2021

  37. [37]

    Learning to estimate hidden motions with global motion aggregation

    Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. Learning to estimate hidden motions with global motion aggregation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021

  38. [38]

    In-datacenter performance analysis of a Tensor Processing Unit

    Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a Tensor Processing Unit . In Proceedings of the 44th Annual International Symposium on Computer Architecture , 2017

  39. [39]

    One Model To Learn Them All

    Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017

  40. [40]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  41. [41]

    Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory

    Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  42. [42]

    ImageNet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. Proceedings of Neural Information Processing Systems (NeurIPS), 2012

  43. [43]

    S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2018

  44. [44]

    Set Transformer : A framework for attention-based permutation-invariant neural networks

    Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set Transformer : A framework for attention-based permutation-invariant neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019

  45. [45]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  46. [46]

    Object-centric learning with slot attention

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

  47. [47]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  48. [48]

    SGDR : Stochastic gradient descent with warm restarts

    Ilya Loshchilov and Frank Hutter. SGDR : Stochastic gradient descent with warm restarts. In Proceedings of International Conference on Learning Representations (ICLR), 2017

  49. [49]

    Pretrained Transformers as universal computation engines

    Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained Transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021

  50. [50]

    An iterative image registration technique with an application to stereo vision

    Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 1981

  51. [51]

    Multi-task sequence to sequence learning

    Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In Proceedings of International Conference on Learning Representations (ICLR), 2016

  52. [52]

    LUNA : Linear unified nested attention

    Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. LUNA : Linear unified nested attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021

  53. [53]

    Object scene flow for autonomous vehicles

    Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  54. [54]

    Thinking fast and slow: Efficient text-to-visual retrieval with Transformers

    Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. Thinking fast and slow: Efficient text-to-visual retrieval with Transformers . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  55. [55]

    Distributed representations of words and phrases and their compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of Neural Information Processing Systems (NeurIPS), 2013

  56. [56]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorth, and Ren Ng. NeRF : Representing scenes as neural radiance fields for view synthesis. In Proceedings of European Conference on Computer Vision (ECCV), 2020

  57. [57]

    Cross-stitch networks for multi-task learning

    Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  58. [58]

    Stacked hourglass networks for human pose estimation

    Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proceedings of European Conference on Computer Vision (ECCV), 2016

  59. [59]

    Multimodal deep learning

    Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of International Conference on Machine Learning (ICML), 2011

  60. [60]

    GloVe : Global Vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe : Global Vectors for word representation. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014

  61. [61]

    Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V. Le. Meta pseudo labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  62. [62]

    Pomerleau

    Dean A. Pomerleau. ALVINN : An autonomous land vehicle in a neural network. In Proceedings of Neural Information Processing Systems (NeurIPS), 1989

  63. [63]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI , 2019

  64. [64]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text Transformer . Journal of Machine Learning Research, 2020

  65. [65]

    Zero-shot text-to-image generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Proceedings of International Conference on Machine Learning (ICML), 2021

  66. [66]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020

  67. [67]

    U-Net : Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net : Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015

  68. [68]

    Relational recurrent neural networks

    Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2018

  69. [69]

    Neural machine translation of rare words with subword units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2016

  70. [70]

    Overfeat: Integrated recognition, localization and detection using convolutional networks

    Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of International Conference on Learning Representations (ICLR), 2014

  71. [71]

    A short note on the Kinetics -700-2020 human action dataset

    Lucas Smaira, Jo \ a o Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the Kinetics -700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020

  72. [72]

    Kenneth O. Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8 0 (20): 0 131 -- 162, 2007

  73. [73]

    Revisiting unreasonable effectiveness of data in deep learning era

    Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017

  74. [74]

    PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume

    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  75. [75]

    TF-RAFT : A tensorflow implementation of RAFT

    Deqing Sun, Charles Herrmann, Varun Jampani, Michael Krainin, Forrester Cole, Austin Stone, Rico Jonschkowski, Ramin Zabih, William T Freeman, and Ce Liu. TF-RAFT : A tensorflow implementation of RAFT . In ECCV Robust Vision Challenge Workshop, 2020

  76. [76]

    AutoFlow : Learning a better training set for optical flow

    Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, William T Freeman, and Ce Liu. AutoFlow : Learning a better training set for optical flow. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  77. [77]

    Sequence to sequence learning with neural networks

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2014

  78. [78]

    Hash embeddings for efficient word representations

    Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. Hash embeddings for efficient word representations. In Proceedings of Neural Information Processing Systems (NeurIPS), 2017

  79. [79]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

  80. [80]

    Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

    Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020

Showing first 80 references.