CanViT is the first task- and policy-agnostic AVFM pretrained via passive-to-active dense latent distillation on 13.2M scenes and 1B random glimpses, achieving 38.5% ADE20K mIoU in one glimpse and 84.5% ImageNet-1k top-1 after fine-tuning.
super hub Mixed citations
Layer Normalization
Mixed citation behavior. Most common role is background (58%).
abstract
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times. It is also straightforward to apply to recurrent neural networks by computing the normalization statistics separately at each time step. Layer normalization is very effective at stabilizing the hidden state dynamics in recurrent networks. Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not
authors
co-cited works
representative citing papers
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Transformers performing in-context learning implicitly implement gradient descent, ridge regression, and least-squares predictors for linear models, with behavior shifting based on model depth, width, and data noise.
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.
Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.
A new probing framework detects moderate parametric memorization signals in tabular in-context learning models under single-task fine-tuning, strongest on low-cardinality tasks, but signals largely disappear under realistic training.
SurGe improves local surface geometry in feedforward point maps via gradient matching loss and Neighborhood Attention Decoder, topping average rank on eight zero-shot monocular geometry benchmarks for global AbsRel while boosting local metrics.
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
A Set-Transformer architecture with self-attention encodes Pauli-string correlations, optimizes via commutation objective, and finds symmetries with near-deterministic success on physical models like Ising and Toric code.
Recurrent trace units enable exact RTRL with linear time/memory for streaming RL under partial observability, sustaining performance on long-chain memory tasks where TBPTT baselines collapse.
CRiSP uses neural-guided MCTS and curriculum learning to insert Clifford prefixes before parameterized rotations in VQAs, yielding mean 3.17x and max 45x gains in energy accuracy on 22-qubit QAOA benchmarks versus prior Clifford initializers.
Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.
Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.
RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.
Nested-GPT is an autoregressive Transformer surrogate that generates variable-multiplicity parton showers while enforcing ordered Markovian branching and matches reference Monte Carlo results for leading-log non-global logarithm resummation in the large-Nc limit.
Domain transfer becomes identifiable from marginals plus one anchor under Jacobian sparsity, enabled by a randomized masked finite-difference regularizer.
The paper proves negative weight drift at initialization under MSE or cross-entropy with asymmetric activations, links it to up to 90% sparsity in GPT-nano, maps the sparsity-accuracy cliff across 79 configurations, and shows clipped ReLU² and GELU² improve validation loss.
FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.
ChangeFlow reformulates remote sensing change detection as latent rectified-flow mask synthesis, reaching 80.4% average F1 across four benchmarks with 1.3-point gain and sampling-based ensembling.
MM-SOLD is a training-free particle sampler whose large-particle limit converges to a moment-matched Gibbs distribution obtained by exponentially tilting a score-smoothed target.
A W-Net deep learning model detects asteroids in TESS data independently of trajectory by rotating training image cubes and using adaptive normalization for data scaling.
citing papers explorer
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
LRM: Large Reconstruction Model for Single Image to 3D
LRM is a large transformer that predicts a NeRF directly from a single image after training on a million-object multi-view dataset.
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
PagedAttention achieves near-zero waste in LLM key-value cache memory and enables 2-4x higher serving throughput than prior systems.
-
Segment Anything
A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
iTransformer: Inverted Transformers Are Effective for Time Series Forecasting
By applying attention and feed-forward networks to inverted variate tokens instead of temporal tokens, iTransformer achieves state-of-the-art performance on real-world time series forecasting datasets.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
Retentive Network: A Successor to Transformer for Large Language Models
RetNet is a new sequence modeling architecture that delivers parallel training, constant-time inference, and competitive language modeling performance as a potential replacement for Transformers.
-
On the Role of ViT and CNN in Semantic Communications: Analysis and Prototype Validation
ViT-based semantic communications yields +0.5 dB PSNR over CNN baselines, introduces cosine-similarity and Fourier analysis metrics, and demonstrates an SDR prototype.
-
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
-
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
-
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
AdaLoRA uses SVD-based pruning to allocate the parameter budget for low-rank fine-tuning updates according to per-matrix importance scores, yielding better performance than uniform allocation especially under tight budgets.
-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
mPLUG-Owl2 presents a modular MLLM architecture that enables modality collaboration via shared functional modules and modality-adaptive components, achieving SOTA on both text and multi-modal tasks with one generic model.
-
Baichuan 2: Open Large-scale Language Models
Baichuan 2 presents 7B and 13B LLMs trained on 2.6T tokens that match or exceed similar open models on MMLU, CMMLU, GSM8K, HumanEval and excel in medicine and law.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
- Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey