archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 10
-
Neural fields guide free Gaussians to capture layered clothing
PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars
-
Free Gaussians in neural space model avatars with complex clothing
PiG-Avatar: Hierarchical Neural-Field-Guided Gaussian Avatars
-
LoRA upgrade turns text-to-image flows bidirectional
FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation
-
Benchmark enables reliable testing of multi-shot audio-video models
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
-
Staged perception training boosts VLM accuracy with shorter reasoning
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
-
AUDITS benchmark tests detectors on 530K manipulated images
Multi-axis Analysis of Image Manipulation Localization
-
New test reveals VLMs ignore camera motion in spatial tasks
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
-
Prototype layer matches ResNet accuracy on composite X-ray defects
Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites
-
Counterfactual tests expose failures in LVLM attribution for chest X-rays
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
-
Billion-scale 3D Gaussians train on one 24 GB GPU
TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization
-
Dataset lets AI models generate native 100MP images
PixVerve: Advancing Native UHR Image Generation to 100MP with a Large-Scale High-Quality Dataset
-
Natural-language concepts replace tokens for multi-target segmentation
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction
-
One model handles any-to-any translation across five remote sensing modalities
MetaEarth-MM: Unified Multimodal Remote Sensing Image Generation with Scene-centered Joint Modeling
-
First-frame spatial prompts raise cross-scene trajectory accuracy
Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
-
VLM-guided DPO lifts driving model human alignment by 12%
VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving
-
Adaptive Manifold Guidance conserves probability during strong guidance
Probability-Conserving Flow Guidance
-
Pixel classification hits 95.48% accuracy on angiogram vessels
X-Ray cardiac angiographic vessel segmentation based on pixel classification using machine learning and region growing
-
Small tables bind new visual concepts to word triggers
Tiny-Engram: Trigger-Indexed Concept Tables for Generative Vision
-
Pix2pix network segments heart fat on CT scans with 99% accuracy
Cardiac fat segmentation using computed tomography and an image-to-image conditional generative adversarial neural network
-
SDM improves adversarial attack performance and efficiency by reconstructing the…
SDM: A Powerful Tool for Evaluating Model Robustness
-
Second opacity per Gaussian cleans up object masks
OP2GS: Object-Aware 3D Gaussian Splatting with Dual-Opacity Primitives
-
Pruning 90% non-text tokens cuts omni-LLM cost by 9x
Stage-adaptive Token Selection for Efficient Omni-modal LLMs
-
Nash equilibrium scores filter unstable multimodal reasoning steps
A Nash Equilibrium Framework For Training-Free Multimodal Step Verification
-
Frequency priors guide short-video quality scores
FGSVQA: Frequency-Guided Short-form Video Quality Assessment
-
CryoNet maps debris-covered glaciers at 90 percent IoU
CryoNet: A Deep Learning Framework for Multi-Modal Debris-Covered Glacier Mapping. A Case Study of the Poiqu Basin, Central Himalaya
-
Anime-trained VLM turns sparse sketches into aligned video outputs
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
-
Four photodiodes replace cameras for robot odometry
Minimalist Visual Inertial Odometry
-
Visual encoder spatial detail fix unlocks precise robot tasks
Beyond Binary Success: A Diagnostic Meta-Evaluation Framework for Fine-Grained Manipulation
-
Illumination priors guide selective recovery in dark photos
InterLight: Leveraging Intrinsic Illumination Priors for Low-Light Image Enhancement
-
Video transcript grounding reward lifts planning accuracy by 7-16 points
RECIPE: Procedural Planning via Grounding in Instructional Video
-
Fusing lifted panoramas yields long-range navigable 3D worlds from text
SphericalDreamer: Generating Navigable Immersive 3D Worlds with Panorama Fusion
-
World-ego split lifts long-horizon hybrid robot modeling
World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks
-
Refined gradient attention rollout identifies surviving semantic regions to guide…
Towards Fine-Grained Robustness: Attention-Guided Test-Time Prompt Tuning for Vision-Language Models
-
VLMs and agents miss over half the score on wild road damage
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
-
Future emotion prediction raises multimodal recognition accuracy
AffectVerse: Emotional World Models for Multimodal Affective Computing
-
One pass turns sparse aerial photos into full 3D city models
Feed-Forward Gaussian Splatting from Sparse Aerial Views
-
Model fuses lidar and plot data for lower-bias forest biomass maps
StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels
-
SplitQ keeps 93.5% accuracy at 3-bit VLM quantization
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models
-
Diversity in memory buffers improves TTA under tight constraints
GoTTA be Diverse: Rethinking Memory Policies for Test-Time Adaptation
-
3D Gaussians replace grids for continuous color mapping
GLUT: 3D Gaussian Lookup Table for Continuous Color Transformation
-
U-Net feature energy cuts Janus rate in text-to-3D
Structural Energy Guidance for View-Consistent Text-to-3D Generation
-
Persona prompts lift construction safety checks by 12 percent
Passive Construction Site Safety Monitoring via Persona-Scaffolded Adversarial Chain-of-Thought VLM Verification
-
New decoder head raises wound segmentation Dice to 81.9%
WoundFormer: Multi-Scale Spatial Feature Fusion for Multi-Class Wound Tissue Segmentation
-
Layout priors raise markdown F1 from 0.37 to 0.92 on OOD docs
Structured Layout Priors for Robust Out-of-Distribution Visual Document Understanding
-
Score-based guidance fixes viewpoint estimation in diffusion models
Landscape-Awareness for Geometric View Diffusion Model
-
VLMs lag on gaze following and social attention benchmarks
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
-
VLMs trail visual models on gaze following and social attention
Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models
-
Zero-shot image models fall short on concept faithfulness for XAI
A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability
-
Dense benchmark exposes open VLMs' gaps on subtle human actions
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding
-
Open VLMs struggle with fine details in human video actions
FineBench: Benchmarking and Enhancing Vision-Language Models for Fine-grained Human Activity Understanding