A classification-trained ViT encodes patch boundaries at layers 5-6 and depth at layer 8, with causal interventions showing the depth signal is actively re-derived rather than passively carried.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 3representative citing papers
Multimodal deep learning for ambivalence/hesitancy recognition in videos yields limited results on the BAH dataset, highlighting the need for improved spatio-temporal and cross-modal fusion methods.
EchoAlign adjusts instances with controllable generative models to match noisy labels and selects reliable subsets, outperforming prior methods on benchmarks especially under 30% instance-dependent noise.
citing papers explorer
-
From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
A classification-trained ViT encodes patch boundaries at layers 5-6 and depth at layer 8, with causal interventions showing the depth signal is actively re-derived rather than passively carried.
-
Multimodal Ambivalence/Hesitancy Recognition in Videos for Personalized Digital Health Interventions
Multimodal deep learning for ambivalence/hesitancy recognition in videos yields limited results on the BAH dataset, highlighting the need for improved spatio-temporal and cross-modal fusion methods.
-
EchoAlign: Bridging Generative and Discriminative Learning under Noisy Labels
EchoAlign adjusts instances with controllable generative models to match noisy labels and selects reliable subsets, outperforming prior methods on benchmarks especially under 30% instance-dependent noise.