archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 3
-
Benchmark shows MLLMs fail on 16-minute continuous video reasoning
VideoOdyssey: A Benchmark for Ultra-Long-Context and Omni-Modal Video Understanding
-
Projector fix lifts Video-LLM motion direction accuracy from 26% to 85%
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs
-
Camera pose tokens lift video model spatial scores 4.5-6.5%
Cambrian-P: Pose-Grounded Video Understanding
-
Reasoning adds secondary motions for natural video
MotiMotion: Motion-Controlled Video Generation with Visual Reasoning
-
Self-awareness module improves language-guided navigation
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation
-
Gestures raise robot object selection accuracy in cluttered scenes
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations
-
Dashcam videos turned into full AV multi-sensor data
Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving
-
Metro suicide risk scored from video by tracking and heatmaps
Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations
-
VLMs keep high scores after most image tokens are deleted
Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?
-
Queries raise PSNR by 3.6 dB and cut convergence time by 3x in frozen autoencoders
DecQ: Detail-Condensing Queries for Enhanced Reconstruction and Generation in Representation Autoencoders
-
Synthetic faces alone match real data for rare pediatric disease AI
Synthetic Data Alone is Enough? Rethinking Data Scarcity in Pediatric Rare Disease Recognition
-
Generated images show anomalous ultra-high-frequency spectral uplift
Spectral Tail Auxiliary Learning for AI-Generated Image Detection
-
Retrieval keeps video worlds consistent at double speed
WorldKV: Efficient World Memory with World Retrieval and Compression
-
Simulated dense placements train IMU model that ignores sensor setup
AnyMo: Geometry-Aware Setup-Agnostic Modeling of Human Motion in the Wild
-
Multiview cues and orientation prompts lift zero-shot action recognition
Cross-Domain Human Action Recognition from Multiview Motion and Textual Descriptions
-
Synthetic viewpoints plus state-space encoding boost action detection
Improving Viewpoint-Invariance and Temporal Consistency for Action Detection
-
Disentangling vision-language embeddings without added dimensions
Conceptualizing Embeddings: Sparse Disentanglement for Vision-Language Models
-
Taylor expansion picks surprising frames in long videos
Swift Sampling: Selecting Temporal Surprises via Taylor Series
-
One ConvNeXt model serves many compute budgets
Slimmable ConvNeXt: Width-Adaptive Inference for Efficient Multi-Device Deployment
-
Coherent behavior vectors let VLA models match top results with half the data
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
-
SEGA adapts attention scaling to latent frequencies for higher-res DiT outputs
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
-
Sparse autoencoder links reasoning steps to image masks
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
-
Images boost LLM poetry detectors past RoBERTa
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs
-
Nonce substitutions rank captions for better VL data selection
What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining
-
Causal model matches age changes in spine DXA images
From Baseline to Follow-Up: Counterfactual Spine DXA Image Synthesis in UK Biobank Using a Causal Hierarchical Variational Autoencoder
-
CAME-Grad optimizer lifts radiology reports by 2 percent
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
-
CAME-Grad fixes gradient double dilemma in report generation
The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution
-
Five functional body clusters improve full-pose reconstruction from head and hands
AtomicMotion: Learning Human Motion From Different Human Parts
-
Physics priors train dense human scene flow from monocular video
H-Flow: Self-supervised Human Scene Flow via Physics-inspired Joint Multi-modal Learning
-
Graph reasoning turns radiology reports into precise 3D lesion maps
GLeVE: Graph-Guided Lesion Grounding with Proposal Verification in 3D CT
-
Head-conditioned LoRA lifts gaze following on non-salient targets
Enhancing Gaze Reasoning in Vision Foundation Models for Gaze Following
-
Dual-interval motion cues decouple ego-motion for UAV detection
Decoupling Ego-Motion from Target Dynamics via Dual-Interval Motion Cues for UAV Detection
-
No single noisy-label method wins for frozen vision models
Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure
-
3D reconstruction turns floorplan localization into alignment task
SceneAligner: 3D-Grounded Floorplan Localization in the Wild
-
New metric shows detection limits online map accuracy
Beyond Chamfer Distance: Granular Order-aware Evaluation Metric For Online Mapping
-
Attention maps for tumor sub-regions come free in one lightweight 3D model
SegGuidedNet: Sub-Region-Aware Attention Supervision for Interpretable Brain Tumor Segmentation
-
Generative models create controlled videos to test MLLM spatio-temporal reasoning
VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis
-
Fourier shape descriptors create time-consistent cell phantom videos
Cell Phantom Video Generation in Elliptical Fourier Descriptor Domain
-
Geometry must ground visual tokens before reasoning starts
GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
-
Unified model handles many fashion search types at once
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
-
Multimodal data improves two-wheeler rider behavior recognition
MOTOR: A Multimodal Dataset for Two-Wheeler Rider Behavior Understanding
-
Similar cases form graphs that refine medical image diagnoses
Case-Aware Medical Image Classification with Multimodal Knowledge Graphs and Reliability-Guided Refinement
-
Motion and geometry cues boost SAM 2 tracking on nonlinear scenarios
Segment Anything with Motion, Geometry, and Semantic Adaptation for Complex Nonlinear Visual Object Tracking
-
Degraded images break spatial reasoning in current AI
SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation
-
Latent sharing speeds up collaborative driving coordination
LACO: Adaptive Latent Communication for Collaborative Driving
-
Training-free method segments fine-grained fungi without retraining
Training-Free Fine-Grained Semantic Segmentations in Low Data Regimes: A FungiTastic Baseline
-
Discarded classifier weights act as semantic anchors
Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling
-
Multi-agent self-evolution sets SOTA on image retrieval benchmarks
DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval
-
Masked metric improves agreement with humans on concept fidelity
MaSC: A Masked Similarity Metric for Evaluating Concept-Driven Generation
-
Fused geometry and appearance metric predicts synthetic data transfer
SADGE: Structure and Appearance Domain Gap Estimation of Synthetic and Real Data