archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 1
-
Geometric reward aligns camera paths in generated videos
Geo-Align: Video Generation Alignment via Metric Geometry Reward
-
Pixel diffusion turns 512x512 latents into 2048x2048 images in 210 ms
PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion
-
Dedicated image editor lifts multimodal reasoning by 5 points
ETCHR: Editing To Clarify and Harness Reasoning
-
Causal tests show many brain localizations are false positives
From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain
-
Token selection speeds geometry transformers over 85 percent
Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers
-
Dual-stream system inserts objects into videos harmoniously
Smart-Insertion-V: Photorealistic Video Insertion via a Closed-Loop Feedback Dual-Stream Framework
-
HorizonStream keeps 3D reconstruction stable past 10,000 frames
HorizonStream: Long-Horizon Attention for Streaming 3D Reconstruction
-
Projection conditioning lifts generative priors to scene reconstruction
GenRecon: Bridging Generative Priors for Multi-View 3D Scene Reconstruction
-
Geometric overlays on images lift MLLM spatial scores by 20%
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
-
Self-supervised priors raise physical fidelity in video generators
LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation
-
Entmax attention lifts ViT segmentation mIoU by up to 6 points
Vision Transformers Need Better Token Interaction
-
Foundation models support zero-shot causal image reasoning
Leveraging Foundation Models for Causal Generative Modeling
-
Dynamics model learns particle motion from real videos alone
Learning a Particle Dynamics Model with Real-world Videos
-
Pretraining on decomposition maps cuts labeled data needs for Mueller polarimetry
MuellerPT: Decomposition Driven Pretraining for Dense Learning in Mueller Polarimetry
-
LLM splits video queries into tool calls merged by boolean logic
Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval
-
Vision models match humans best at balanced generative-discriminative mix
Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot
-
Debiased mining converts OOD detection to Monte-Carlo sampling
Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models
-
Transformer predicts saliency from event camera streams
Exploring deep learning for Event-Based Saliency Prediction with a Transformer-based model
4 Piths -
ML framework grades emeralds at 98 percent accuracy
Machine learning applied to emerald gemstone grading: framework proposal and creation of a public dataset
-
cGAN counts eucalyptus logs at 92.3 percent accuracy
A Novel Approach for the Counting of Wood Logs Using cGANs and Image Processing Techniques
-
Agent beats baselines at text-guided 3D photo search
PhotoFlow: Agentic 3D Virtual Photography Missions
-
Stabilized SegFormer reaches 0.4572 mIoU on original DMS split
Revitalizing Dense Material Segmentation: Stabilized Vision Transformers and the Generalization Paradox
-
Video models fail physics consistency under viewpoint shifts
CRONOS: Benchmarking Counterfactual Physical Consistency in Video Models
-
RiGS models multi-scale motions with three Gaussian types
RiGS: Rigid-aware 4D Gaussian Splatting from a Single Monocular Video
-
Coupling narrow models cuts 30% FLOPs from wide vision training
Recursive Block-Diagonal Coupling for Resource-Efficient Training of Vision Models
-
Adaptive search fixes blind spots in high-res image perception for LLMs
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
-
3D hand motions predict open-surgery skill with r=0.78
ExpOS: Explainable Open-Surgery Skills Assessment Using 3D Hand Reconstruction
-
Final diagnosis scores hide flawed medical workups in AI
DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
-
Entity patches in memory fix consistency in multi-shot videos
EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation
-
Semantic banks let 3D splatting handle night glow scenes
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
-
Meta-learning yields model performance scores on unlabeled data
Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning
-
Support map shows some regions supply stronger LiDAR-camera cues
Calibration-Informative Region Selection for Online LiDAR--Camera Calibration in Agricultural Environments
-
PathNavigate scans slides for surprises before matching the question
PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA
-
Tri-module augmentation lifts 3D avatar quality from short videos
Generator-Refiner-Examiner: A Tri-Module Data Augmentation Framework for 3D Human Avatar Learning from Monocular Videos
-
PixIE raises low-light PSNR by up to 15% using DINO prompts
PixIE: Prompted Pixel-Space Low-Light Image Enhancement
-
Hand motions guide stable object tracking in RGB video
ComPose: When to Trust Hands for Object Pose Tracking
-
New sampler cuts RL training time for flow models by up to 53%
Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models
-
120K triplets enable instruction editing at 4K+ resolution
VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset
-
Sketches control long video generation via independent shots
DrawVideo: Generating Long Video from Storyboard Keyframe Sketches
-
MDS-DETR gains +2.8 mAP over Deformable-DETR with 5% extra training
MDS-DETR: DETR with Masked Duplicate Suppressor
-
Bootstrapped GRTO unifies RL and tool training for segmentation
B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation
-
MDM distills vision-language datasets into compact synthetic sets
Multimodal Distribution Matching for Vision-Language Dataset Distillation
-
One model forecasts yields for many crops by learning their weather responses
PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction
-
DINOv3 beats ImageNet after finetuning on RGB inspection but loses on X-ray
Rethinking Transfer Learning for Industrial Inspection: DINOv3 vs. ImageNet Pretraining Across RGB and X-ray Tasks
-
One-Forcing scores 83.76 on VBench for one-step video
One-Forcing: Towards Stable One-Step Autoregressive Video Generation
-
32x compression and linear attention enable fast image restoration
Efficient One-Step Diffusion Restoration Model with Compact Token Compression and Linear Attention
-
VAE decoder learns to respect non-commutative latent order
Commutator-Induced Uncertainty in VAEs
-
Dynamic sparse attention delivers 2.1x video generation speedup
DFSAttn: Dynamic Fine-grained Sparse Attention for Efficient Video Generation
-
Semantic scores trigger early stops in video motion search
FAST-ME: Foundation-aware Adaptive Stopping for Motion Estimation for Efficient IoT Video Analysis
-
Sample-wise attacks fool TTA while keeping label counts normal
Sample-wise Targeted Adversarial Attacks on Test-time Adaptation