Vision as LoRA

· 2025 · arXiv 2503.20680

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Local Spatiotemporal Convolutional Network for Robust Gait Recognition

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

LSTCN is a dual-branch CNN that extracts temporal gait features by pooling spatial data into strips and applying local spatiotemporal convolutions with asymmetric kernels.

Selective LoRA for Visual Tokens and Attention Heads

cs.CV · 2025-12-22 · unverdicted · novelty 7.0

Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.

Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

SAME-Net adds a differentiable soft attention mask embedding module to achieve rectification-free end-to-end scene text spotting with 84.02% H-mean on Total-Text.

The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

cs.LG · 2026-04-26 · conditional · novelty 6.0 · 2 refs

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.

Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

cs.CV · 2026-04-20 · unverdicted · novelty 6.0 · 2 refs

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.

Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion

cs.CV · 2026-04-27 · unverdicted · novelty 5.0

CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep feature merging.

Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.

Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

cs.CV · 2026-04-28 · unverdicted · novelty 3.0

RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

citing papers explorer

Showing 9 of 9 citing papers.

Local Spatiotemporal Convolutional Network for Robust Gait Recognition cs.CV · 2026-05-14 · unverdicted · none · ref 41
LSTCN is a dual-branch CNN that extracts temporal gait features by pooling spatial data into strips and applying local spatiotemporal convolutions with asymmetric kernels.
Selective LoRA for Visual Tokens and Attention Heads cs.CV · 2025-12-22 · unverdicted · none · ref 23
Image-LoRA selectively adapts only visual tokens and chosen attention heads in VLMs, matching standard LoRA performance with lower parameter count and FLOPs.
Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting cs.CV · 2026-05-18 · unverdicted · none · ref 43
SAME-Net adds a differentiable soft attention mask embedding module to achieve rectification-free end-to-end scene text spotting with 84.02% H-mean on Total-Text.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation cs.LG · 2026-04-26 · conditional · none · ref 34 · 2 links
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accuracy to 71-72.5% on Gemma-2B and Mistral-7B.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation cs.CV · 2026-04-20 · unverdicted · none · ref 99 · 2 links
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture cs.CV · 2026-05-12 · unverdicted · none · ref 131
SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Multi-Branch Non-Homogeneous Image Dehazing via Concentration Partitioning and Image Fusion cs.CV · 2026-04-27 · unverdicted · none · ref 59
CPIFNet decomposes non-homogeneous dehazing into multiple homogeneous sub-problems via specialized IENet branches trained on different haze concentrations, then uses IFNet to fuse advantageous regions through deep feature merging.
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction cs.CV · 2026-04-03 · unverdicted · none · ref 57
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation cs.CV · 2026-04-28 · unverdicted · none · ref 51
RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

Vision as LoRA

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer