DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
Bert: Pre-training of deep bidirectional trans- formers for language understanding
6 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 6years
2026 6verdicts
UNVERDICTED 6representative citing papers
FILTR predicts persistence diagrams from pretrained 3D encoders on the new DONUT benchmark, showing limited topological signals in encoders but successful approximation via learnable feed-forward.
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
DSAA improves fine-grained open-vocabulary object detection by injecting attribute priors via APA in text embeddings, modulating K/V vectors in BERT, and using an attribute-aware contrastive loss, with gains shown on the FG-OVD benchmark.
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
A staged multimodal fusion model for predicting six continuous emotion intensities from in-the-wild video achieves 0.4722 validation and 0.57 test Pearson correlation in the EMI challenge.
citing papers explorer
-
Depth Adaptive Efficient Visual Autoregressive Modeling
DepthVAR adaptively allocates per-token computational depth in VAR models using a cyclic rotated scheduler and dynamic layer masking to achieve 2.3-3.1x inference speedup with minimal quality loss.
-
FILTR: Extracting Topological Features from Pretrained 3D Models
FILTR predicts persistence diagrams from pretrained 3D encoders on the new DONUT benchmark, showing limited topological signals in encoders but successful approximation via learnable feed-forward.
-
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
MODIX dynamically rescales positional indices in VLMs using intra-modal covariance-based entropy and inter-modal alignment scores to allocate finer granularity to informative content.
-
DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection
DSAA improves fine-grained open-vocabulary object detection by injecting attribute priors via APA in text embeddings, modulating K/V vectors in BERT, and using an attribute-aware contrastive loss, with gains shown on the FG-OVD benchmark.
-
On The Application of Linear Attention in Multimodal Transformers
Linear attention delivers significant computational savings in multimodal transformers and follows the same scaling laws as softmax attention on ViT models trained on LAION-400M with ImageNet-21K zero-shot validation.
-
Two-Stage Multimodal Framework for Emotion Mimicry Intensity Prediction
A staged multimodal fusion model for predicting six continuous emotion intensities from in-the-wild video achieves 0.4722 validation and 0.57 test Pearson correlation in the EMI challenge.