archive
Every paper Pith has read. Search by title, abstract, or pith.
9568 papers in cs.CV · page 8
-
3D distillation speeds wheat spike volume estimation by 100x
3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat
-
Oscillatory network scales to ImageNet with high efficiency
Winfree Oscillatory Neural Network
-
RISE makes self-evolving VLMs gain steadily without new labels
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
-
Tweedie matching across overlaps extends short video models to long sequences
FlowLong: Inference-time Long Video Generation via Manifold-constrained Tweedie Matching
-
Hybrid routes inputs to concept or neural branch for accuracy gains
SynCB: A Synergy Concept-Based Model with Dynamic Routing Between Concepts and Complementary Neural Branches
-
Frozen video model plus probe wins kitchen action challenge
JFAA: Technical Report for the EPIC-KITCHENS-100 Action Anticipation Challenge at EgoVis 2026
-
VISTA wins Ego4D STA challenge by fusing frozen video features into detector
VISTA: Technical Report for the Ego4D Short-Term Object Interaction Anticipation at EgoVis 2026
-
MLLM arbitration with ensemble reaches 70.49% on 306 fruits
FruitEnsemble: MLLM-Guided Arbitration for Heterogeneous ensemble in Fine-Grained Fruit Recognition
-
Two-level experts reduce redundancy in multimodal cancer survival models
HDMoE: A Hierarchical Decoupling-Fusion Mixture-of-Experts Framework for Multimodal Cancer Survival Prediction
-
Map anchors egocentric pose to eliminate drift
Map-Mono-Ego: Map-Grounded Global Human Pose Estimation from Monocular Egocentric Video
-
Self-elicited reasoning and critic revision improve sarcasm detection
ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection
-
Polynomial alternatives match activation-based vision models
Activation-Free Backbones for Image Recognition: Polynomial Alternatives within MetaFormer-Style Vision Models
-
224K short videos collected by labels support semantic benchmarks
USV: Towards Understanding the User-generated Short-form Videos
-
New benchmark shows VLMs lag trained humans on building layouts
ArchSIBench: Benchmarking the Architectural Spatial Intelligence of Vision-Language Models
-
Two-stage model turns panoramic X-rays into accurate 3D dental volumes
HyDAR-Pano3D: A Hybrid Disentangled Anatomical Recovery Framework for Panoramic-to-3D Reconstruction
-
Witness cues turn missing 3D relations into usable training signals
RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
-
Visual-geometric cues recover missing 3D relations from incomplete labels
RelWitness: Open-Vocabulary 3D Scene Graph Generation with Visual-Geometric Relation Witnesses
-
TERDNet beats prior models at spotting scene changes
TERDNet: Transformer Encoder-Recurrent Decoder Network for Scene Change Detection
-
Patch alignment spots changes in free-motion videos
VSCD: Video-based Scene Change Detection in Unaligned Scenes
-
Single network pass reconstructs images with 2D Gaussians in 160-300 ms
AIR: Amortized Image Reconstruction Framework for Self-Supervised Feed-Forward 2D Gaussian Splatting
-
Reranking OSGNet candidates with MLLM wins Ego4D challenge
OSGNet with MLLM Reranking @ Ego4D Episodic Memory Challenge 2026
-
Self-similarity alignment fixes high-res diffusion conflicts
Spatial Gram Alignment for Ultra-High-Resolution Image Synthesis
-
Canny map first keeps logos and text intact in subject edits
Decomposing Subject-Driven Image Generation via Intermediate Structural Prediction
-
OlmoEarth models cut training GPU hours by 1.7x
OlmoEarth v1.1: A more efficient family of OlmoEarth models
-
Connector degrades structural semantics in video editing
What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing
-
AI detectors flag fakes well but cannot identify the source model
Findings of the Counter Turing Test: AI-Generated Image Detection
-
Detectors flag AI images reliably but fail to name their model
Findings of the Counter Turing Test: AI-Generated Image Detection
-
Intermediate alignment cuts physics residuals by 66% in diffusion models
Learning to Think in Physics: Breaking Shortcut Learning in Scientific Diffusion via Representation Alignment
-
Attention alignment yields accurate attributes in visual stories
AttriStory: Fine-grained Attribute Realization for Visual Storytelling with Diffusion Models
-
Visual token masking flags hallucinations in medical VQA answers
VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering
-
Diffusion from points creates masks for infrared target detection
Diffuse to Detect: Bi-Level Sample Rebalancing with Pseudo-Label Diffusion for Point-Supervised Infrared Small-Target Detection
-
Lightweight U-Net segments spines in CT scans on basic hardware
SpineContextResUNet: A Computationally Efficient Residual UNet for Spine CT Segmentation
-
New guidance resolves gradient conflicts in flow models
Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
-
Constraint engine turns AI drawings into verifiable geometry reasoning
Draw2Think: Harnessing Geometry Reasoning through Constraint Engine Interaction
-
Scale-decoupled alignment improves remote sensing incremental detection
STAR-IOD: Scale-decoupled Topology Alignment with Pseudo-label Refinement for Remote Sensing Incremental Object Detection
-
Language priors fix long-tail bias in 3D point cloud clustering
Resolving Long-Tail Ambiguity in Unsupervised 3D Point Cloud Segmentation with Language Priors
-
Open-source iris algorithms pass first official IREX evaluation
Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition
-
Method generates editable 3D surfaces from hand sketches
Sketch2MinSurf: Vision-Language Guided Generation of Editable Minimal Surfaces from Hand-Drawn Sketches
-
Attention reweighting suppresses spurious features before CNN pooling
Deep Attention Reweighting: Post-Hoc Attention-Based Feature Aggregation in CNNs for Disentangling Core and Spurious Features under Spurious Correlations
-
Designer ratings dataset lifts AI graphic scorer to 0.611 agreement
TASTE: A Designer-Annotated Multi-Dimensional Preference Dataset for AI-Generated Graphic Design
-
Early high-frequency injection reduces OOD score overlap
Early High-Frequency Injection for Geometry-Sensitive OOD Detection
-
Virtual outliers reshape geometry to handle noisy labels
GAMR: Geometric-Aware Manifold Regularization with Virtual Outlier Synthesis for Learning with Noisy Labels
-
Decoupling reliabilities lifts noisy-label accuracy
Holistic Reliability Propagation: Decoupling Annotation and Prediction for Robust Noisy-Label
-
ReRAM macro reaches 419 TOPS/W for edge neural inference
E-ReCON: An Energy- and Resource-Efficient Precision-Configurable Sparse nvCIM Macro for Conventional and Spiking Neural Edge Inference
-
SAVER selectively activates vision to boost F1 and cut latency in multimodal IE
SAVER: Selective As-Needed Vision Evidence for Multimodal Information Extraction
-
DAR cuts DiT training iterations by 8.75x while improving FID by 2.11
Rethinking Cross-Layer Information Routing in Diffusion Transformers
-
Agent framework hits top zero-shot scores for industrial defect detection
IndusAgent: Reinforcing Open-Vocabulary Industrial Anomaly Detection with Agentic Tools
-
IMU-warped event frames lift action recognition in dark and shaky scenes
DarkShake-DVS: Event-based Human Action Recognition under Low-light andShaking Camera Conditions
-
VISTAQA benchmark shows models answer but rarely ground correctly
VISTAQA: Benchmarking Joint Visual Question Answering and Pixel-Level Evidence
-
GSA-YOLO hits 189 FPS while cutting compute for X-ray scans
GSA-YOLO: A High-Efficiency Framework via Structured Sparsity and Adaptive Knowledge Distillation for Real-Time X-ray Security Inspection