archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 15
-
Prompting methods raise table QA accuracy without training
Efficient Table QA via TableGrid Navigation and Progressive Inference Prompting
-
Occupancy-opacity split in Gaussians renders both reflection and transmission
RT-Splatting: Joint Reflection-Transmission Modeling with Gaussian Splatting
-
Shared codebook bridges modalities without full data pairs
CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook
-
GaussianZoom generates detailed 3D zooms from low-res inputs
GaussianZoom: Progressive Zoom-in Generative 3D Gaussian Splatting with Geometric and Semantic Guidance
-
Gaps in real face embeddings hold 10M virtual identities
Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning
-
Noise alignment lets video models output endless coherent clips
Enhancing Train-Free Infinite-Frame Generation for Consistent Long Videos
-
Speech audio accelerates MRI reconstruction of vocal tracts
SIREM: Speech-Informed MRI Reconstruction with Learned Sampling
-
Synthetic egocentric videos raise real interaction model accuracy
EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation
-
Synthetic egocentric videos improve real interaction models
EgoInteract: Synthetic Egocentric Videos Generation for Interaction Understanding and Anticipation
-
Question routing lifts zero-shot spatial video QA by up to 5%
SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning
-
RGB cameras build 3D scene graphs for robots as well as depth sensors
RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots
-
Sensory-bounded reasoning lifts MLLM accuracy on second-order belief tasks
Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks
-
Best Segmentation Buddies match image pixels to 3D shape parts
Best Segmentation Buddies for Image-Shape Correspondence
-
View-aware experts lift aerial-ground re-ID mAP by 10 points
View-Aware Semantic Alignment for Aerial-Ground Person Re-Identification
-
Heavy-light split cuts diffusion sampling cost by 2-4x
Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network
-
External cameras boost robot scene recall by up to 79%
Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation
-
TokenMask predicts masks directly from token affinities
Token-Space Mask Prediction for Efficient Vision Transformer Segmentation
-
Agentic selector ranks second on four-day multimodal challenge
MARS: Technical Report for the CASTLE Challenge at EgoVis 2026
-
Soft attention masks enable text spotting without rectification
Do You Need Text Rectification? Soft Attention Mask Embedding for Rectification-Free Scene Text Spotting
-
Consistency reward lifts VLM spatial reasoning
Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency
-
New module keeps multimodal models true to images during long answers
Vision Inference Former: Sustaining Visual Consistency in Multimodal Large Language Models
-
Semi-supervised method removes nighttime flares from unlabeled images
Semi-LAR: Semi-supervised Contrastive Learning with Linear Attention for Removal of Nighttime Flares
-
Joint model fuses reconstruction and causal video generation for driving
Xiaomi Auto World Model: A Joint World Model Integrating Reconstruction and Generation for Autonomous Driving
-
3D generators leave fingerprints that identify their source
Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models
-
Semantic guidance localizes lesions before segmentation and diagnosis
Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis
-
Hybrid tokenizer splits semantics from pixels for better results
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
-
Bangla medical questions trip up top AI models
How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking
-
Grounding cuts tokens 18x while matching big models on home tasks
TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
-
Residual fusion refines hands while stabilizing body in video meshes
DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos
-
Fusion recovers detailed hands and stable bodies from single videos
DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos
-
Diffusion model generates aligned urban energy maps from roads
SENSE: Satellite-based ENergy Synthesis for Sustainable Environment
-
Synthetic data cuts mixed-object counting errors by 20%
The MixCount Dataset: Bridging the Data Gap for Open-Vocabulary Object Counting
-
Lightweight ConvNet ensembles match heavy models on Arabic handwriting
Embedded ConvNet Ensembles: A Lightweight Approach to Recognize Arabic Handwritten Characters
-
Pixle attack fools Arabic handwriting AI at 99-100% success
Threats to Arabic Handwriting Recognition: Investigating Black-Box Adversarial Attacks on embedded ConvNet models
-
Triplane features adapt to standard codecs for better volumetric compression
CATRF: Codec-Adaptive TriPlane Radiance Fields for Volumetric Content Delivery
-
Instant3D creates 3D assets in seconds
Efficient 3D Content Reconstruction and Generation
-
Dynamic relevance scores pick pruning regime to shrink omni-modal tokens
OmniSelect: Dynamic Modality-Aware Token Compression for Efficient Omni-modal Large Language Models
-
Template-guided field fuses cues for fast 3D shape matching
SGSoft: Learning Fused Semantic-Geometric Features for 3D Shape Correspondence via Template-Guided Soft Signals
-
Lateral line patches raise salmon re-ID accuracy across cameras
Patch Ensembles for Robust Salmon Re-Identification with Weak Trajectory Labels
-
Filtered data beats bigger models in grocery retrieval
What Matters for Grocery Product Retrieval with Open Source Vision Language Models
-
Dual-stage activation fixes attribute binding in open-vocabulary detection
DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection
-
Training fixes attention so text alone locates video objects
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
-
TinySAM 2 cuts SAM 2 memory tokens to 7 percent at 90 percent accuracy
TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model
-
Semantic scoring refines distilled image datasets
SAS: Semantic-aware Sampling for Generative Dataset Distillation
-
Graph completion turns non-functional 3D models functional
Functionalization via Structure Completion and Motion Rectification
-
Inter-frame learning cuts bits for LiDAR geometry
Inter-LPCM: Learning-based Inter-Frame Predictive Coding for LiDAR Point Cloud Compression
-
Per-module scaling lifts low-bit quantization accuracy
MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization
-
Optical encoder cuts gaze tracking latency to 3.4 ms
Low Latency Gaze Tracking via Latent Optical Sensing
-
Event data fused with frames yields clear silhouettes at kilohertz rates
See Silhouettes in Motion with Neuromorphic Vision
-
Decoupled attention balances references in remote sensing super-resolution
Learning to Balance: Decoupled Siamese Diffusion Transformer for Reference-Based Remote Sensing Image Super-Resolution