archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 12
-
Spatial weighting and dual loss create novel text-to-image objects
Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting
-
Sparse anchor fields yield editable SVGs at full raster fidelity
AnchorFlow: Editable SVG Reconstruction via Sparse Anchor Point Fields
-
Evidential head gives reliable uncertainty for 3D pointmaps
Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R
-
RL solver reaches 82.9% on CAPTCHA benchmark
CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision
-
Replace blocks with synthesized operators to cut training costs
Replacement Learning: Training Neural Networks with Fewer Parameters
-
Early core token attention ranks best seeds for text-to-image results
Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection
-
The paper describes a framework for 3D localization in multimodal large language models…
Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
-
Dual prompts help CLIP identify occluded people better
Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification
-
Negative data cuts collisions in driving AI models
SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving
-
Merging LLMs into VLMs boosts instructions but not math
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
-
Dual-branch model wins photo quality challenge via explicit differences
iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment
-
Single photo becomes real-time physics video of interacting objects
TelePhysics: Physics-Grounded Multi-Object Scene Generation from a Single Image with Real-Time Interaction
-
Text-guided edits keep watermarks intact after decoder-loss training
Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing
-
Subtraction module lifts unsupervised video domain adaptation
Return of Frustratingly Easy Unsupervised Video Domain Adaptation
-
Event pruning trims 80% tokens but raises reasoning accuracy
EventPrune: Cascaded Event-Assisted Token Pruning for Efficient First-Person Dynamic Spatial Reasoning
-
PathCTM cuts pathology patches by 96 percent
Thinking in Scales: Accelerating Gigapixel Pathology Image Analysis via Adaptive Continuous Reasoning
-
Hybrid platform syncs real CAVs with CARLA-SUMO sims for closed-loop tests
Closed-Loop Hybrid Digital Twin Platform for Connected and Automated Vehicle Validation
-
GUI agents reach only 36% success on media editing tasks
CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing
-
Dynamic prompts fuse backdoors with task performance to resist pruning
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
-
Targeted attacks succeed on encoders without knowing the task
Targeted Downstream-Agnostic Attack
-
CEPO boosts math reasoning to 43.43% at 2B and 60.56% at 4B
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
-
Model fuses layout and netlist to predict cell delay at 0.92% error
FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction
-
Prototype-anchored training halves calibration error in place recognition
KappaPlace: Learning Hyperspherical Uncertainty for Visual Place Recognition via Prototype-Anchored Supervision
-
Vision agent builds ad-hoc segmentations with working mask
Vision Harnessing Agent for Open Ad-hoc Segmentation
-
JUDO outperforms GPT-4o on industrial anomaly QA with normal image references
JUDO: A Juxtaposed Domain-Oriented Multimodal Reasoner for Industrial Anomaly QA
-
Rebalancing attention reduces reference dominance and increases video motion
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
-
Rebalancing attention boosts motion in image-to-video models
Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models
-
Unlearning methods leave class traces in model representations
Can Vision Models Truly Forget? Mirage: Representation-Level Certification of Visual Unlearning
-
Variance penalty on penultimate neurons cuts medical AI bias
Neuron Incidence Redistribution for Fairness in Medical Image Classification
-
Tracking tokens lift LMM performance on 4D video tasks
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
-
Material codebook yields consistent physics parameters from video
MatPhys: Learning Material-Aware Physics Parameters for Deformable Object Simulation from Videos
-
Concept ontology filters noisy negatives to lift chest X-ray zero-shot tasks
Concept-Guided Noisy Negative Suppression for Zero-Shot Classification and Grounding of Chest X-Ray Findings
-
Heat dissipation flow matching outperforms most baselines
Multi-Scale Generative Modeling with Heat Dissipation Flow Matching
-
Optical pass checks 15 deepfake videos simultaneously
Scalable, Energy-Efficient Optical-Neural Architecture for Multiplexed Deepfake Video Detection
-
Atlas text boosts mammography BI-RADS accuracy
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification
-
Repositioned anchors keep motion contacts across body shapes
Skinned Motion Retargeting with Spatially Adaptive Interaction Guidance
-
Autoregressive codebook tokens sharpen MRI from extreme undersampling
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
-
Autoregressive token prediction sharpens MRI from sparse scans
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction
-
Claim differences as RL rewards balance caption hallucinations and omissions
ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison
-
Integral feedback reduces hallucinations in CT medical reports
Regulating Anatomy-Aware Rewards via Trajectory-Integral Feedback for Volumetric Computed Tomography Analysis
-
Two-stage training adds semantics to latent visual reasoning
Semantic-Enriched Latent Visual Reasoning
-
HERA lifts CD-FSS accuracy over 4 mIoU points with tiny updates
Selective, Regularized, and Calibrated: Harnessing Vision Foundation Models for Cross-Domain Few-Shot Semantic Segmentation
-
Event streams improve VLM scene understanding in tough conditions
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
-
Event streams lift VLM captioning and VQA scores in low light and motion
RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding
-
DynaTok trims 90% of video tokens with 95% accuracy retained
DynaTok: Temporally Adaptive and Positional Bias-Aware Token Compression for Video-LLMs
-
Hierarchical rewards raise text accuracy in image generators
TextAlign: Preference Alignment for Text Rendering with Hierarchical Rewards
-
Image editing replaces video for robot task planning
SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution
-
Gated CNN detects falls on smartwatches without attention
You Don't Need Attention: Gated Convolutional Modeling for Watch-Based Fall Detection
-
Metamorphic relations reveal hidden VQA failures missed by accuracy
MetaRA: Metamorphic Robustness Assessment for Multimodal Large Language Model-based Visual Question Answering Systems
-
Matérn noise gives flow matching triangulation-agnostic behavior
Mat\'ern Noise for Triangulation-Agnostic Flow Matching on Meshes