archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 6
-
DoRA raises VLA success rates by 10.4 points over SFT
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
-
Seizure video dataset yields 0.96 F1 on epileptic classification
Seizure-Semiology-Suite (S3): A Clinically Multimodal Dataset, Benchmark, and Models for Seizure Semiology Understanding
-
PET/CT model matches full segmentation accuracy with 10% labels
An Open Multi-Center Whole-Body FDG PET/CT Foundation Model for Tumor Segmentation
-
Embeddings support 99% accurate tomato field mapping
Mapping Tomato Cropping Systems in California Using AlphaEarth Geospatial Embeddings and Deep Learning Analysis
-
Context rewrite lifts 3D grounding accuracy by up to 22 points
MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue
3 Piths -
Scene graph matching grounds 3D objects from language without training
SceneGraphGrounder: Zero-Shot 3D Visual Grounding via Structured Scene Graph Matching
-
Diffusion model relights full-body videos consistently under new lights
BodyReLux: Temporally Consistent Full-Body Video Relighting
-
4D geometry supervision lifts robot video models to 81% success
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
-
VLMs give better 3D vehicle dimensions than lidar in occluded cases
Improving 3D Labeling in Self-Driving by Inferring Vehicle Information using Vision Language Models
-
Lightweight cross-encoder matches LLM judges for caption evaluation
BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model
-
Vision-IMU attention fusion cuts hand tracking error by 16%
AVI-HT: Adaptive Vision-IMU Fusion for 3D Hand Tracking
-
HSR methods vary by over 13 dB across degradation types
HyperBench: Standardizing and Scaling Synthetic Evaluation for Hyperspectral Super-Resolution
-
AI turns T1 scans into motion-free high-res MRIs
MRecover: A Conditional Generative Model for Recovering Motion-Corrupted MR images Using AI Generated Contrast
-
Stochastic policy amortizes diffusion guidance for 5x faster sampling
Hierarchical Variational Policies for Reward-Guided Diffusion
-
Ultrasound VQA model learns to zoom closer before diagnosing
Look-Closer-Then-Diagnose: Confidence-Aware Ultrasound VQA via Active Zooming
-
VLMs retain gains after corrupting thought tokens
Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
-
Three-plane aggregation raises stroke lesion Dice score
VRXU-net: A Deep Learning Approach for Brain Ischemic Stroke Lesion Detection and Segmentation in T1W MRI
-
New benchmark shows LVLMs falter on furniture assembly videos
Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly
-
Text rendered on masks improves images and halves inference cost
UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation
-
Agents evolve image generation by distilling trajectory differences
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
-
Agents evolve image generation by distilling trajectory differences
GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation
-
3.8B model rivals larger ones using 19% of the training compute
Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
-
Amortized noise sampling cuts diffusion teacher variance 10x
Variance Reduction for Expectations with Diffusion Teachers
-
Amortized resampling yields 2-3x compute gains for diffusion teachers
Variance Reduction for Expectations with Diffusion Teachers
-
Single editing task lifts understanding
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
-
One editing task improves understanding
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
-
Fixed-point distillation matches multi-step diffusion in one pass
One-Step Distillation of Discrete Diffusion Image Generators via Fixed-Point Iteration
-
Unified model generates simulation-ready 3D assets across object types
PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects
-
WikiVQABench tests VLMs on Wikipedia questions needing external knowledge
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
-
Latent dynamics model yields coherent full-body avatar animations
Latent Dynamics for Full Body Avatar Animation
-
Evidential memory turns frozen 3D generators into streaming systems
Stream3D: Sequential Multi-View 3D Generation via Evidential Memory
-
Few-step streaming adapts generators for video editing without training
StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation
-
Prototypes and pathways fuse for cancer survival prediction with built-in explanations
ProtoPathway: Biologically Structured Prototype-Pathway Fusion for Multimodal Cancer Survival Prediction
-
VLMs miss most time-based glitches in game videos
TempGlitch: Evaluating Vision-Language Models for Temporal Glitch Detection in Gameplay Videos
-
Two-frame recurrent method restores turbulence videos efficiently
ReMATF: Recurrent Motion-Adaptive Multi-scale Turbulence Mitigation for Dynamic Scenes
-
New method masters interactive video try-on with hand and action guidance
iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
-
Smartphone runs full gait analysis locally without cloud upload
AIGaitor: Privacy-preserving and cloud-free motion analysis for everyone, using edge computing
-
Gossip-based critic sharing lifts multi-cell OFDMA sum-rates in 6G
FedCritic: Serverless Federated Critic Learning-based Resource Allocation for Multi-Cell OFDMA in 6G
-
Top-n encoder selection lifts blended emotion accuracy
Ordering Matters: Rank-Aware Selective Fusion for Blended Emotion Recognition
-
3D point clouds lift VLA robot success by 10%
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
-
Road videos now produce captions with chosen tone
RoadTones: Tone Controllable Text Generation from Road Event Videos
-
One model shifts image restoration from precise to creative
Disentangling Generation and Regression in Stochastic Interpolants for Controllable Image Restoration
-
Simulation feedback picks best synthetic scenes for driving models
Closed Loop Dynamic Driving Data Mixture for Real-Synthetic Co-Training
-
Diffusion model fills Antarctic Landsat gaps without references
A Non-Reference Diffusion-Based Restoration Framework for Landsat 7 ETM+ SLC-off Imagery in Antarctica
-
Model fixes occlusion order in overlapping layout-to-image scenes
OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation
-
Hyper-V2X estimates epistemic and aleatoric uncertainty in cooperative BEV segmentation
Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation
-
Adaptive fusion gives linear SSMs flexible vision and 3D fusion
Deformba: Vision State Space Model with Adaptive State Fusion
-
Contrasting patients with controls isolates disease subgroups
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
-
Reweighting image-negative tokens cuts LVLM hallucinations
Reducing Object Hallucination in LVLMs via Emphasizing Image-negative Tokens
-
Continuous flow matching generates realistic EEG signals
Let EEG Models Learn EEG