Recognition: 1 theorem link
· Lean TheoremPerceiver IO: A General Architecture for Structured Inputs & Outputs
Pith reviewed 2026-05-15 19:43 UTC · model grok-4.3
The pith
Perceiver IO adds a flexible querying mechanism to the Perceiver so one architecture processes arbitrary structured inputs and produces outputs of any size or type while scaling linearly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perceiver IO augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, allowing the same architecture to handle data from arbitrary settings while scaling linearly with the size of inputs and outputs.
What carries the argument
The flexible querying mechanism added to the Perceiver, which decodes its compressed latent representation into structured outputs of arbitrary size and semantics.
If this is right
- The same model can be applied to new tasks spanning language, vision, and other domains without architectural redesign.
- Input tokenization can be removed while still outperforming a Transformer-based BERT baseline on the GLUE benchmark.
- Explicit multiscale mechanisms are not required to reach state-of-the-art performance on Sintel optical flow estimation.
- Linear scaling with both input and output size makes the architecture practical for large structured data across modalities.
Where Pith is reading between the lines
- The querying mechanism may enable unified models for reinforcement learning and planning tasks beyond the StarCraft II results shown.
- If the approach generalizes further, it could reduce reliance on separate encoder-decoder or modality-specific designs in multi-modal systems.
- Testing on additional modalities such as audio or 3D point clouds would provide a direct check on whether the linear scaling and output flexibility hold outside the reported domains.
Load-bearing premise
The flexible querying mechanism can produce outputs of arbitrary sizes and semantics across domains without introducing hidden task-specific assumptions.
What would settle it
Showing that Perceiver IO requires per-task changes to the querying mechanism or loses linear scaling on a new large structured-input task would falsify the generality claim.
read the original abstract
A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Perceiver IO, an extension of the original Perceiver that adds a flexible cross-attention querying mechanism to produce structured outputs of arbitrary size and semantics. The core claim is that a single architecture, without task-specific engineering, can handle inputs and outputs across language, vision, multi-modal, and game domains while scaling linearly; highlights include outperforming a BERT baseline on GLUE after removing input tokenization and achieving SOTA on Sintel optical flow without explicit multiscale mechanisms.
Significance. If the generality claim holds, the work would be a meaningful contribution toward unified architectures that avoid baking in domain assumptions, potentially reducing the need for modality-specific designs. The reported cross-domain results and linear scaling are strengths that, if supported by clear ablations, could influence follow-up work on scalable attention-based models.
major comments (2)
- [§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.
- [§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.
minor comments (2)
- [Figure 2] Figure 2: The diagram of the querying block would be clearer with explicit arrows and labels distinguishing the cross-attention from the latent array update.
- [§4.3] §4.3 (StarCraft results): The multi-task setup would benefit from an explicit statement of whether any output-query hyperparameters were tuned per game variant.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of Perceiver IO's generality. We address each major point below and will incorporate revisions to improve detail and emphasis on the unified querying mechanism.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Flexible Querying): Query construction is described with task-dependent choices—spatially arranged queries initialized from image positions for Sintel flow versus a small set of learned classification queries for GLUE. These are not derived from a single domain-agnostic procedure and therefore function as hidden per-task engineering, weakening the central claim that the architecture eliminates task-specific design.
Authors: The querying mechanism is a single cross-attention operation applied uniformly across tasks; only the initialization of the query array (to encode desired output structure) differs, which is a minimal input to the model rather than an architectural modification. This is directly analogous to specifying output heads in other general models while keeping the core network fixed. We will revise §3.2 to explicitly frame query initialization as output-structure specification and add a paragraph contrasting it with task-specific components like tokenizers or multiscale pyramids. revision: partial
-
Referee: [§4.1] §4.1 (GLUE experiments): The claim that Perceiver IO outperforms a Transformer BERT baseline despite removing input tokenization is load-bearing for the generality argument, yet the exact mapping from raw text to the latent array and the precise query semantics for classification outputs are not specified in sufficient detail to confirm that all domain assumptions have been removed.
Authors: We agree that the current text leaves the exact byte-to-latent mapping and query construction underspecified. In the revision we will expand §4.1 (and the corresponding methods paragraph) to describe: (i) the character-level byte encoding and fixed positional embedding used to populate the input latent array directly from raw text, and (ii) the learned query vectors for GLUE as a small set of class-specific embeddings that are decoded via a linear layer. These additions will make explicit the absence of any vocabulary or tokenization assumptions. revision: yes
Circularity Check
No significant circularity in Perceiver IO derivation chain
full rationale
The paper proposes an architectural extension via a cross-attention querying mechanism on top of the base Perceiver latent representation. All load-bearing claims rest on direct empirical evaluation against external public benchmarks (GLUE, Sintel, StarCraft II) rather than any internal derivation that reduces to fitted parameters, self-definitions, or self-citation chains. No equations or uniqueness theorems are invoked that collapse back to the inputs by construction; query construction details are presented as implementation choices for each task without being framed as predictions derived from the model itself. The evaluation is therefore self-contained against independent data sources.
Axiom & Free-Parameter Ledger
free parameters (1)
- latent array size
axioms (1)
- domain assumption Cross-attention can map latent representations to arbitrary structured outputs without task-specific heads
Forward citations
Cited by 22 Pith papers
-
ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.
-
ENSEMBITS: an alphabet of protein conformational ensembles
Ensembits creates a discrete vocabulary for protein conformational ensembles that outperforms static tokenizers on dynamics prediction tasks and enables ensemble token prediction from single structures via distillation.
-
Scratchpad Patching: Decoupling Compute from Patch Size in Byte-Level Language Models
Scratchpad Patching decouples compute from patch size in byte-level language models by inserting entropy-triggered scratchpads to update patch context dynamically.
-
MeshFIM: Local Low-Poly Mesh Editing via Fill-in-the-Middle Autoregressive Generation
MeshFIM enables local low-poly mesh editing by autoregressively filling target regions conditioned on context, using boundary markers, positional embeddings, and a gated geometry encoder to enforce attachment, topolog...
-
A foundation model of vision, audition, and language for in-silico neuroscience
TRIBE v2 is a multimodal AI model that predicts human brain activity more accurately than linear encoding models and recovers established neuroscientific findings through in-silico testing.
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
-
RoboDreamer: Learning Compositional World Models for Robot Imagination
RoboDreamer factorizes video generation using language primitives to achieve compositional generalization in robot world models, outperforming monolithic baselines on unseen goals in RT-X.
-
A Generalist Agent
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
-
TOPOS: High-Fidelity and Efficient Industry-Grade 3D Head Generation
TOPOS creates high-fidelity 3D heads with fixed industry topology from single images via a specialized VAE with Perceiver Resampler and a rectified flow transformer.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Hypergraph and Latent ODE Learning for Multimodal Root Cause Localization in Microservices
HyperODE RCA integrates hypergraph learning with latent ODEs and cross-modal attention to improve root cause localization in microservice architectures on the Tianchi AIOps benchmark.
-
MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement
MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...
-
OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens
OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.
-
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
PRiMeFlow is a flow-matching model that approximates the full empirical distribution of single-cell gene expression after perturbations.
-
UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing
UHR-BAT is a budget-aware framework that uses text-guided multi-scale importance estimation plus region-wise preserve and merge strategies to compress visual tokens in ultra-high-resolution remote sensing vision-langu...
-
Towards foundation-style models for energy-frontier heterogeneous neutrino detectors via self-supervised pre-training
Self-supervised pre-training on multimodal neutrino detector simulations produces reusable representations that improve downstream classification, regression, and data efficiency over training from scratch.
-
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis
PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.
-
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 h...
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
-
PRiMeFlow: Capturing Complex Expression Heterogeneity in Perturbation Response Modelling
PRiMeFlow applies flow matching in gene expression space with a U-Net velocity field and pretraining-finetuning to model perturbation-induced heterogeneity, showing strong benchmark performance on PerturBench and the ...
Reference graph
Works this paper leans on
-
[1]
Imitating interactive intelligence
Josh Abramson, Arun Ahuja, Arthur Brussee, Federico Carnevale, Mary Cassin, Stephen Clark, Andrew Dudzik, Petko Georgiev, Aurelia Guy, Tim Harley, Felix Hill, Alden Hung, Zachary Kenton, Jessica Landon, Timothy Lillicrap, Kory Mathewson, Alistair Muldal, Adam Santoro, Nikolay Savinov, Vikrant Varma, Greg Wayne, Nathaniel Wong, Chen Yan, and Rui Zhu. Imita...
-
[2]
VATT : Transformers for multimodal self-supervised learning from raw video, audio and text
Hassan Akbari, Linagzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. VATT : Transformers for multimodal self-supervised learning from raw video, audio and text. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[3]
Self-supervised multimodal versatile networks
Jean-Baptiste Alayrac, Adri \`a Recasens, Rosalia Schneider, Relja Arandjelovi \'c , Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. Self-supervised multimodal versatile networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[4]
The D eep M ind JAX E cosystem, 2020
Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Claudio Fantacci, Jonathan Godwin, Chris Jones, Tom Hennigan, Matteo Hessel, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Lena Martens, Vladimir Mikulik, Tamara Norman, John Quan, George Papamaka...
work page 2020
-
[5]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document Transformer . arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[6]
Byte pair encoding is suboptimal for language model pretraining
Kaj Bostrom and Greg Durrett. Byte pair encoding is suboptimal for language model pretraining. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020
work page 2020
-
[7]
JAX : composable transformations of P ython+ N um P y programs, 2018
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander P las, Skye Wanderman- M ilne, and Qiao Zhang. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/google/jax
work page 2018
-
[8]
High-performance large-scale image recognition without normalization
Andrew Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image recognition without normalization. In Proceedings of International Conference on Machine Learning (ICML), 2021
work page 2021
-
[9]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,...
work page 2020
-
[10]
Butler, Jonas Wulff, Garrett B
Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. In Proceedings of European Conference on Computer Vision (ECCV), 2012
work page 2012
-
[11]
J. Campbell, R. Sukthankar, and I. Nourbakhsh. Techniques for evaluating optical flow for visual odometry in extreme terrain. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2004
work page 2004
-
[12]
End-to-end object detection with Transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with Transformers . In Proceedings of European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[13]
Generative pretraining from pixels
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In Proceedings of International Conference on Machine Learning (ICML), 2020
work page 2020
-
[14]
Clark, Dan Garrette, Iulia Turc, and John Wieting
Jonathan H. Clark, Dan Garrette, Iulia Turc, and John Wieting. CANINE: pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10: 0 73--91, 2022
work page 2022
-
[15]
A unified architecture for natural language processing: Deep neural networks with multitask learning
Ronan Collobert and Jason Weston. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of International Conference on Machine Learning (ICML), 2008
work page 2008
-
[16]
Natural language processing (almost) from scratch
Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 2011
work page 2011
-
[17]
Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V
Ekin D. Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V. Le. RandAugment : Practical automated data augmentation with a reduced search space. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020
work page 2020
-
[18]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-XL : Attentive language models beyond a fixed-length context. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2019
work page 2019
-
[19]
ImageNet : A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet : A large-scale hierarchical image database. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009
work page 2009
-
[20]
VirTex: Learning Visual Representations from Textual Annotations
Karan Desai and Justin Johnson. VirTex: Learning Visual Representations from Textual Annotations . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[21]
BERT : Pre-training of deep bidirectional Transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), 2019
work page 2019
-
[22]
Multi-task self-supervised visual learning
Carl Doersch and Andrew Zisserman. Multi-task self-supervised visual learning. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017
work page 2017
-
[23]
Sim2real transfer learning for 3D human pose estimation: motion to the rescue
Carl Doersch and Andrew Zisserman. Sim2real transfer learning for 3D human pose estimation: motion to the rescue. Proceedings of Neural Information Processing Systems (NeurIPS), 2019
work page 2019
-
[24]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[25]
Learning hierarchical features for scene labeling
Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012
work page 2012
-
[26]
FlowNet : Learning optical flow with convolutional networks
Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip H\" a usser, Caner Hazırba s , Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. FlowNet : Learning optical flow with convolutional networks. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2015
work page 2015
-
[27]
Audio Set : An ontology and human-labeled dataset for audio events
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio Set : An ontology and human-labeled dataset for audio events. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017
work page 2017
-
[28]
Coordination among neural modules through a shared global workspace
Anirudh Goyal, Aniket Didolkar, Alex Lamb, Kartikeya Badola, Nan Rosemary Ke, Nasim Rahaman, Jonathan Binas, Charles Blundell, Michael Mozer, and Yoshua Bengio. Coordination among neural modules through a shared global workspace. In Proceedings of International Conference on Learning Representations (ICLR), 2022
work page 2022
-
[29]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[30]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[31]
Kaiming He, Georgia Gkioxari, Piotr Doll \'a r, and Ross Girshick. Mask R-CNN . In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017
work page 2017
-
[32]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs) . arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[33]
Autoencoders, minimum description length, and Helmholtz free energy
Geoffrey E Hinton and Richard S Zemel. Autoencoders, minimum description length, and Helmholtz free energy. In Proceedings of Neural Information Processing Systems (NeurIPS), 1994
work page 1994
-
[34]
Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial Intelligence, 1981
work page 1981
-
[35]
Drew A. Hudson and C. Lawrence Zitnick. Generative adversarial Transformers . In Proceedings of International Conference on Machine Learning (ICML), 2021
work page 2021
-
[36]
Perceiver: General perception with iterative attention
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Jo\ a o Carreira. Perceiver: General perception with iterative attention. In Proceedings of International Conference on Machine Learning (ICML), 2021
work page 2021
-
[37]
Learning to estimate hidden motions with global motion aggregation
Shihao Jiang, Dylan Campbell, Yao Lu, Hongdong Li, and Richard Hartley. Learning to estimate hidden motions with global motion aggregation. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2021
work page 2021
-
[38]
In-datacenter performance analysis of a Tensor Processing Unit
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-datacenter performance analysis of a Tensor Processing Unit . In Proceedings of the 44th Annual International Symposium on Computer Architecture , 2017
work page 2017
-
[39]
Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all. arXiv preprint arXiv:1706.05137, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[40]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[41]
Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
work page 2017
-
[42]
ImageNet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. Proceedings of Neural Information Processing Systems (NeurIPS), 2012
work page 2012
-
[43]
Taku Kudo and John Richardson. S entence P iece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2018
work page 2018
-
[44]
Set Transformer : A framework for attention-based permutation-invariant neural networks
Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set Transformer : A framework for attention-based permutation-invariant neural networks. In Proceedings of International Conference on Machine Learning (ICML), 2019
work page 2019
-
[45]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa : A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[46]
Object-centric learning with slot attention
Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020
work page 2020
-
[47]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[48]
SGDR : Stochastic gradient descent with warm restarts
Ilya Loshchilov and Frank Hutter. SGDR : Stochastic gradient descent with warm restarts. In Proceedings of International Conference on Learning Representations (ICLR), 2017
work page 2017
-
[49]
Pretrained Transformers as universal computation engines
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained Transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021
-
[50]
An iterative image registration technique with an application to stereo vision
Bruce D Lucas and Takeo Kanade. An iterative image registration technique with an application to stereo vision. In Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), 1981
work page 1981
-
[51]
Multi-task sequence to sequence learning
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, and Lukasz Kaiser. Multi-task sequence to sequence learning. In Proceedings of International Conference on Learning Representations (ICLR), 2016
work page 2016
-
[52]
LUNA : Linear unified nested attention
Xuezhe Ma, Xiang Kong, Sinong Wang, Chunting Zhou, Jonathan May, Hao Ma, and Luke Zettlemoyer. LUNA : Linear unified nested attention. In Proceedings of Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[53]
Object scene flow for autonomous vehicles
Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[54]
Thinking fast and slow: Efficient text-to-visual retrieval with Transformers
Antoine Miech, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic, and Andrew Zisserman. Thinking fast and slow: Efficient text-to-visual retrieval with Transformers . In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[55]
Distributed representations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proceedings of Neural Information Processing Systems (NeurIPS), 2013
work page 2013
-
[56]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorth, and Ren Ng. NeRF : Representing scenes as neural radiance fields for view synthesis. In Proceedings of European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[57]
Cross-stitch networks for multi-task learning
Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[58]
Stacked hourglass networks for human pose estimation
Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Proceedings of European Conference on Computer Vision (ECCV), 2016
work page 2016
-
[59]
Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. Multimodal deep learning. In Proceedings of International Conference on Machine Learning (ICML), 2011
work page 2011
-
[60]
GloVe : Global Vectors for word representation
Jeffrey Pennington, Richard Socher, and Christopher D Manning. GloVe : Global Vectors for word representation. In Proceedings of the Annual Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014
work page 2014
-
[61]
Hieu Pham, Zihang Dai, Qizhe Xie, Minh-Thang Luong, and Quoc V. Le. Meta pseudo labels. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
- [62]
-
[63]
Language models are unsupervised multitask learners
Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI , 2019
work page 2019
-
[64]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text Transformer . Journal of Machine Learning Research, 2020
work page 2020
-
[65]
Zero-shot text-to-image generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In Proceedings of International Conference on Machine Learning (ICML), 2021
work page 2021
-
[66]
Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer
Rene Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
work page 2020
-
[67]
U-Net : Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net : Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), 2015
work page 2015
-
[68]
Relational recurrent neural networks
Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber, Daan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2018
work page 2018
-
[69]
Neural machine translation of rare words with subword units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. In Proceedings of the Annual Meetings of the Association for Computational Linguistics (ACL), 2016
work page 2016
-
[70]
Overfeat: Integrated recognition, localization and detection using convolutional networks
Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of International Conference on Learning Representations (ICLR), 2014
work page 2014
-
[71]
A short note on the Kinetics -700-2020 human action dataset
Lucas Smaira, Jo \ a o Carreira, Eric Noland, Ellen Clancy, Amy Wu, and Andrew Zisserman. A short note on the Kinetics -700-2020 human action dataset. arXiv preprint arXiv:2010.10864, 2020
-
[72]
Kenneth O. Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8 0 (20): 0 131 -- 162, 2007
work page 2007
-
[73]
Revisiting unreasonable effectiveness of data in deep learning era
Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017
work page 2017
-
[74]
PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume
Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net : CNNs for optical flow using pyramid, warping, and cost volume. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
work page 2018
-
[75]
TF-RAFT : A tensorflow implementation of RAFT
Deqing Sun, Charles Herrmann, Varun Jampani, Michael Krainin, Forrester Cole, Austin Stone, Rico Jonschkowski, Ramin Zabih, William T Freeman, and Ce Liu. TF-RAFT : A tensorflow implementation of RAFT . In ECCV Robust Vision Challenge Workshop, 2020
work page 2020
-
[76]
AutoFlow : Learning a better training set for optical flow
Deqing Sun, Daniel Vlasic, Charles Herrmann, Varun Jampani, Michael Krainin, Huiwen Chang, Ramin Zabih, William T Freeman, and Ce Liu. AutoFlow : Learning a better training set for optical flow. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
work page 2021
-
[77]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of Neural Information Processing Systems (NeurIPS), 2014
work page 2014
-
[78]
Hash embeddings for efficient word representations
Dan Svenstrup, Jonas Meinertz Hansen, and Ole Winther. Hash embeddings for efficient word representations. In Proceedings of Neural Information Processing Systems (NeurIPS), 2017
work page 2017
-
[79]
Going deeper with convolutions
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015
work page 2015
-
[80]
Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proceedings of Neural Information Processing Systems (NeurIPS), 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.