archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 16
-
Agent reaches 0.90 WISE score in multi-turn image generation
Generation Navigator: A State-Aware Agentic Framework for Image Generation
-
Fewer semantic tokens match full multimodal performance
A More Word-like Image Tokenization for MLLMs
-
Adapted FamNet counts washer parts at 1.96 MAE
Counting Machine Parts
-
Raw patches cut language bias in remote sensing vision models
SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning
-
Benchmark shows agents at 79% on game video questions vs 95% oracle
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
4 Piths -
Agents reach 79% on game video frames
SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain
4 Piths -
New UAV benchmark slashes 3D reconstruction errors by up to 84%
UAVFF3D: A Geometry-Aware Benchmark for Feed-Forward UAV 3D Reconstruction
-
Visual atlases evolve from trajectories to guide VLM agents
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
-
Transient expert steers MoE updates to cut forgetting
CP-MoE: Consistency-Preserving Mixture-of-Experts for Continual Learning
-
Streaming video model cuts tokens 95% with cascaded control
An Efficient Streaming Video Understanding Framework with Agentic Control
-
One anchor pair identifies domain transfer under Jacobian sparsity
Domain Transfer Becomes Identifiable via a Single Alignment
-
Decoupled geometry and cache yield consistent house panoramas
PanoWorld: A Generative Spatial World Model for Consistent Whole-House Panorama Synthesis
-
Surgical video QA handles full procedures with temporal consolidation
SurgLQA: Scalable Long-Horizon Surgical Video Question Answering
-
Benchmark adds touch, RL training, and real robots to world model tests
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
-
One model translates any sensor features to any other without retraining
One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception
-
Frequency disentanglement plus geodesic matching lifts few-shot medical segmentation
Beyond Euclidean Prototypes: Spectral Disentanglement and Geodesic Matching for Few-Shot Medical Image Segmentation
-
Two-phase sampling matches contradictory audio prompts to video
CounterFlow: A Two-Phase Inference-Time Sampling for Counterfactual Video Foley Generation
-
Mamba model beats SOTA on ECG multi-label scores
HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities
-
Classical SIFT beats learned descriptors on accuracy and speed
PySIFT: GPU-Resident Deterministic SIFT for Deep Learning Vision Pipelines
-
Smartphone LiDAR sees hidden objects with motion sampling
Imaging Hidden Objects with Consumer LiDAR via Motion Induced Sampling
-
Girsanov weights enable unbiased resampling for diffusion models
Simple Approximation and Derivative Free Inference-Time Scaling for Diffusion Models via Sequential Monte Carlo on Path Measures
-
Temporal pruning speeds video diffusion while preserving fidelity
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
-
Temporal smoothing lets pruning speed up video diffusion
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
-
Warm-up trick lets MeanFlow scale to 80B image models
Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation
-
VLMs count by prior instead of image when facts clash
CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models
-
Scene understanding training produces human-like fixations in foveated model
Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding
-
Fourier shapes achieve 88% IR detector attack success past 25 meters
Unleashing the Representational Power of Fourier Shapes for Attacking Infrared Object Detection
-
Reward variance selects learnable prompts for T2I training
Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation
-
Post-hoc sphere normalization lifts long-tailed OOD AUROC
Is Complex Training Necessary for Long-Tailed OOD Detection? A Re-think from Feature Geometry
-
High noisy-label accuracy fails to ensure OOD reliability
When Accuracy Is Not Enough: Uncertainty Collapse between Noisy Label Learning and Out-of-Distribution Detection
-
Saliency consistency loss raises defect detection accuracy
Network Knowledge Prior Guided Learning for Data-Efficient Surface Defect Detection
-
LiteLoc slashes localization storage 94% and speeds pose solving 19x
Efficient Sparse-to-Dense Visual Localization via Compact Gaussian Scene Representation and Accelerated Dense Pose Estimation
-
Tree constraints in training produce consistent plant skeletons
PlantPose: Universal Plant Skeleton Estimation via Tree-constrained Graph Generation
-
Framework makes one physical attack fool multiple AI vision tasks
Towards Universal Physical Adversarial Attacks via a Joint Multi-Objective and Multi-Model Optimization Framework
-
Aligning latent mappings reduces inconsistency in multimodal models
LatentUMM: Dual Latent Alignment for Unified Multimodal Models
-
Pixel diffusion reaches FID 1.60 at 256 resolution in 320 epochs
FrequencyBooster: Full-Frequency Modeling for High-Fidelity Pixel Diffusion
-
Adapter boosts Vision Transformer image quality assessment with fewer parameters
Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction
-
Sparsity experts and distillation enable continual adaptation
MoASE++: Mixture of Activation Sparsity Experts with Domain-Adaptive On-policy Distillation for Continual Test Time Adaptation
-
Uncertainty flow plus point cloud interaction cuts hand pose error
UST-Hand: An Uncertainty-aware Spatiotemporal Point Cloud Interaction Network for 3D Self-supervised Hand Pose Estimation
-
Continual learning adapts X-ray models to new domains at 88.66% accuracy
Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis
-
Prefix length turns frozen VLM embeddings into a semantic dial
GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
-
Patch-MoE Mamba improves segmentation of polyps and skin lesions
Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation
-
STDP rules deliver 78.6 percent mAP for event cameras on CPU
Brain-inspired spike-timing plasticity for reliable label-efficient event-camera vision
-
1D-2D CNN fusion with attention hits 99-100% on ECG identification
Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition
-
4D Gaussians let you query driving scenes at any future time
GEM: Gaussian Evolution Model for Occupancy Forecasting and Motion Planning
-
Sobel edges match finger knuckles at 17% rate
A simple approach for biometrics: Finger-knuckle prints recognition based on a Sobel filter and similarity measures
-
Deep learning cuts pathology slide file sizes 43-80 percent
Deep learning-based compression of giga-resolution whole slide images
-
Monocular RGB+IMU matches RGB-D accuracy for indoor scene graphs
Mono-Hydra++: Real-Time Monocular Scene Graph Construction with Multi-Task Learning for 3D Indoor Mapping
-
Three-stage pipeline lifts video RAG retrieval from 0.195 to 0.759 nDCG
MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation
-
System maps hand contacts to surfaces in operating rooms
TouchMap-OR: Multi-View 3D Mapping of Hand-Surface Contacts