archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 7
-
Meta-actions set new SOTA on Waymo driving challenge
DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions
-
One-step meta-actions set new Waymo driving records
DriveMA: Rethinking Language Interfaces in Driving VLAs with One-Step Meta-Actions
-
105M open image-text pairs train competitive text-to-image model
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
-
CNNs suit small land-use data
Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification
-
Transition vector refines LLM captions for zero-shot image retrieval
STiTch: Semantic Transition and Transportation in Collaboration for Training-Free Zero-Shot Composed Image Retrieval
-
Local tolerance rule reconnects gaps in Frangi vessel maps
Local-sensitive connectivity filter (ls-cf): A post-processing unsupervised improvement of the frangi, hessian and vesselness filters for multimodal vessel segmentation
-
Dataset trains AI to locate and reduce SR artifacts
SR-Ground: Image Quality Grounding for Super-Resolved Content
-
Region-aware VAE completes full heart motion cycle from single frame
RePCM: Region-Specific and Phenotype-Adaptive Bi-Ventricular Cardiac Motion Synthesis
-
Peak calibration lifts AI image detector accuracy 12% on new test
PGC: Peak-Guided Calibration for Generalizable AI-Generated Image Detection
-
Co-evolving decoder with policy fixes quality drop in discrete T2I
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
-
NaviEdit separates edit steps from noise scale for better results
Semantic Granularity Navigation in Image Editing
-
SAM3 turns rough maps into sharp bacteria explanations
SAM-Sode: Towards Faithful Explanations for Tiny Bacteria Detection
-
Manga109 revised to correct 29,000 dialogue annotations
Manga109-v2026: Revisiting Manga109 Annotations for Modern Manga Understanding
-
Fully ternary ViT reaches 82.43% accuracy at 6 MB
FTerViT: Fully Ternary Vision Transformer
-
Weierstrass function supplies 2D patch encodings for vision transformers
Weierstrass Positional Encoding for Vision Transformers
-
YOLOv11 detects military targets in synthetic thermal and night drone images
Comparative Analysis of Military Detection Using Drone Imagery Across Multiple Visual Spectrums
-
Cognitive-physical RL adds foresight to safer driving policies
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
-
CoPhy RL framework reaches SOTA on NAVSIM with BEV foresight
Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving
-
Streaming model narrates surgery in real time at three workflow levels
SurgOnAir: Hierarchy-Aware Real-Time Surgical Video Commentary
-
One transformer switches between real-time and full 3D reconstruction
UniT: Unified Geometry Learning with Group Autoregressive Transformer
-
Pairwise comparisons improve video quality assessment generalization
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment
-
Linear utility improves DPO for diffusion and flow image models
Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models
-
Router upgrades single-view 3D models to handle any number of views
ROAR-3D: Routing Arbitrary Views for High-Fidelity 3D Generation
-
Radar tweaks alone match complex camera fusion for 3D detection
RCGDet3D: Rethinking 4D Radar-Camera Fusion-based 3D Object Detection with Enhanced Radar Feature Encoding
-
Method cuts error in labor-progress angle from ultrasound
R2AoP: Reliable and Robust Angle of Progression Estimation from Intrapartum Ultrasound
-
3.2M synthetic pairs advance open scene text editing
TextSculptor: Training and Benchmarking Scene Text Editing
-
New model clears banding from phone screen videos
VDFP: Video Deflickering with Flicker-banding Priors
-
VDFP removes banding from phone screen videos
VDFP: Video Deflickering with Flicker-banding Priors
-
New transformer fuses hyperspectral imagery with other EO sensors
SpectralEarth-FM: Bringing Hyperspectral Imagery into Multimodal Earth Observation Pretraining
-
Quantization method enables efficient ARVD video generation
Q-ARVD: Quantizing Autoregressive Video Diffusion Models
-
0.5B driving model matches 7B models by adding future visual states
Grounding Driving VLA via Inverse Kinematics
-
Pairwise data trains multimodal LLMs without full joint alignments
Multimodal LLMs under Pairwise Modalities
-
Dynamic allocation speeds video diffusion 7x near-losslessly
Dynamic Video Generation: Shaping Video Generation Across Time and Space
-
Orthogonal projection fixes spatial-temporal ambiguity in 4D driving scenes
Towards Physically Consistent 4D Scene Reconstruction for Closed-loop Autonomous Driving Simulation
-
Dynamic sinks raise dynamic degree in long video generation
DySink: Dynamic Frame Sinks for Autoregressive Long Video Generation
-
LiteViLNet reaches 96.36% MaxF with 14M parameters at 164 FPS
LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation
-
Framework turns AI detection metrics into legal evidence thresholds
Verifiable Provenance and Watermarking for Generative AI: An Evidentiary Framework for International Operational Law and Domestic Courts
-
Body-anchored Gaussians let users reorder clothing layers on 3D avatars
DAMA: Disentangled Body-Anchored Gaussians for Controllable Multi-Layered Avatars
-
Landsat addition cuts TanDEM-X forest height RMSE by 13.5%
Hybrid Machine Learning Model for Forest Height Estimation from TanDEM-X and Landsat Data
-
Contact coupling improves 4D hand-object reconstruction from video
CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction
-
Contact signals align hands and objects in monocular 4D videos
CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction
-
3D scans integrate rock bolts with fractures for mine assessment
Towards Integrated Rock Support Visualisation in 3D Point Cloud of Underground Mines
-
VGG16 detects fake images at 91% accuracy
Comparative Evaluation of Deep Learning Models for Fake Image Detection
-
Layer attention gaps reveal fix for LVLM hallucinations
Finding the Correct Visual Evidence Without Forgetting: Mitigating Hallucination in LVLMs via Inter-Layer Visual Attention Discrepancy
-
Multispectral signatures raise small-UAV detection by 6.2 percent
Towards UAV Detection in the Real World: A New Multispectral Dataset UAVNet-MS and a New Method
-
Role split improves faithful 4D video editing
Preserve, Reveal, Expand: Faithful 4D Video Editing with Region-Aware Conditioning
-
Hand drawings add spatial precision to text-based 3D motion generation
DrawMotion: Generating 3D Human Motions by Freehand Drawing
-
Focus-then-context method trims VLM tokens to 22% with tiny accuracy cost
Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
-
Tiny models master road reasoning from 20-80 graph scenes
Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding
-
AI continues paintings by predicting next strokes from canvas history
PaintCopilot: Modeling Painting as Autonomous Artistic Continuation