RS2AD-LiDAR reconstructs vehicle LiDAR data from roadside observations via coordinate transformation, virtual LiDAR modeling and resampling, claimed as the first such method, with experiments showing improved object detection when mixed with real data.
super hub Canonical reference
ImageBind One Embedding Space to Bind Them All
Canonical reference. 76% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
authors
co-cited works
representative citing papers
Introduces NMCA-aligned L1/L2 LULC schemes and the Loosdorf-MSL benchmark dataset, with Point Transformer V3 reaching 79.4% mIoU on 8 classes and 58.9% on 20 classes, plus gains from multispectral inputs.
AgroVG is a new multi-source benchmark for agricultural visual grounding formulated as generalized set prediction, with protocols for box and mask grounding across single-target, multi-target, and target-absent queries from six object families.
HyperDn is a configuration-conditioned predictor that transfers oracle supervision across denoising paradigms to achieve near-oracle hyperparameter prediction with few or zero target labels.
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.
LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
Vector Scaffolding uses Interior Gradient Aggregation, Progressive Stratification, and Rapid Inflation Scheduling to achieve 2.5x faster optimization and up to 1.4 dB higher PSNR in differentiable vectorization.
The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
ProtoSSL discovers generalizable prototypes from unlabeled time-series via self-supervision and assigns them to new tasks for interpretable predictions, outperforming supervised baselines in low-data regimes on ECG datasets.
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
BasketHAR is a publicly released multimodal dataset of professional basketball training activities captured with inertial sensors, physiological signals, and video, accompanied by a baseline alignment method.
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
LumaFlux is a physically and perceptually guided diffusion transformer for SDR-to-HDR conversion that introduces PGA, PCM, and HDR Residual Coupler modules plus a new training corpus and benchmark, outperforming prior ITM methods.
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
citing papers explorer
-
RS2AD-LiDAR: End-to-End Autonomous Driving LiDAR Data Generation from Roadside Sensor Observations
RS2AD-LiDAR reconstructs vehicle LiDAR data from roadside observations via coordinate transformation, virtual LiDAR modeling and resampling, claimed as the first such method, with experiments showing improved object detection when mixed with real data.
-
3D LULC classification using multispectral LiDAR and deep learning: current and prospective schemes
Introduces NMCA-aligned L1/L2 LULC schemes and the Loosdorf-MSL benchmark dataset, with Point Transformer V3 reaching 79.4% mIoU on 8 classes and 58.9% on 20 classes, plus gains from multispectral inputs.
-
AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding
AgroVG is a new multi-source benchmark for agricultural visual grounding formulated as generalized set prediction, with protocols for box and mask grounding across single-target, multi-target, and target-absent queries from six object families.
-
Oracle Supervision Transfers for Hyperparameter Prediction in Model-Based Image Denoising
HyperDn is a configuration-conditioned predictor that transfers oracle supervision across denoising paradigms to achieve near-oracle hyperparameter prediction with few or zero target labels.
-
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.
-
LMM-Track4D: Eliciting 4D Dynamic Reasoning in LMMs via Trajectory-Grounded Dialogue
LMM-Track4D formulates a trajectory-grounded dialogue task, releases Track4D-Bench with 526 samples, and proposes RTGE encoding, TRK state token, and OSK-RA decoder to elicit better 4D spatiotemporal reasoning in LMMs.
-
PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media
PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.
-
Vector Scaffolding: Inter-Scale Orchestration for Differentiable Image Vectorization
Vector Scaffolding uses Interior Gradient Aggregation, Progressive Stratification, and Rapid Inflation Scheduling to achieve 2.5x faster optimization and up to 1.4 dB higher PSNR in differentiable vectorization.
-
Martingale-Consistent Self-Supervised Learning
The paper develops a martingale-consistent SSL framework enforcing expected coherence between coarse and refined predictions via new objectives and a Monte Carlo estimator, improving robustness under partial observations.
-
Geometrically Approximated Modeling for Emitter-Centric Ray-Triangle Filtering in Arbitrarily Dynamic LiDAR Simulation
GRCA uses emitter-centric geometric culling of rays per triangle to accelerate LiDAR simulation in arbitrarily dynamic scenes, reporting up to 14.55x speedup over Embree and 7.97x over OptiX.
-
AnomalyClaw: A Universal Visual Anomaly Detection Agent via Tool-Grounded Refutation
AnomalyClaw turns single-step VLM anomaly judgments into a multi-round tool-grounded refutation process, delivering consistent macro-AUROC gains of 3.5-7.9 percentage points over direct inference across 12 cross-domain datasets.
-
Field-Localized Forgery Detection for Digital Identity Documents
FLiD is a field-localized forgery detection method for identity documents that outperforms full-document baselines and general detectors with significantly fewer parameters.
-
ProtoSSL: Interpretable Prototype Learning from Unlabeled Time-Series Data
ProtoSSL discovers generalizable prototypes from unlabeled time-series via self-supervision and assigns them to new tasks for interpretable predictions, outperforming supervised baselines in low-data regimes on ECG datasets.
-
Does it Really Count? Assessing Semantic Grounding in Text-Guided Class-Agnostic Counting
Text-guided class-agnostic counting models exhibit significant weaknesses in grounding textual prompts to visual objects, as demonstrated by new negative-label and distractor tests on a multi-category dataset.
-
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue
ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
-
LoopCTR: Unlocking the Loop Scaling Power for Click-Through Rate Prediction
LoopCTR trains CTR models with recursive layer reuse and process supervision so that zero-loop inference outperforms baselines on public and industrial datasets.
-
Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data
DHCNet improves ultra-fine-grained visual categorization by progressively building holistic cognition from local discrepancies using self-shuffling and refinement on limited data.
-
BasketHAR: A Multimodal Dataset for Human Activity Recognition and Sport Analysis in Basketball Training Scenarios
BasketHAR is a publicly released multimodal dataset of professional basketball training activities captured with inertial sensors, physiological signals, and video, accompanied by a baseline alignment method.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
DiV-INR: Extreme Low-Bitrate Diffusion Video Compression with INR Conditioning
DiV-INR integrates implicit neural representations as conditioning signals for diffusion models to achieve better perceptual quality than HEVC, VVC, and prior neural codecs at extremely low bitrates under 0.05 bpp.
-
On the Decompositionality of Neural Networks
Neural decompositionality is defined via decision-boundary semantic preservation, and language transformers largely satisfy it under SAVED while vision models often do not.
-
Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.
-
LumaFlux: Lifting 8-Bit Worlds to HDR Reality with Physically-Guided Diffusion Transformers
LumaFlux is a physically and perceptually guided diffusion transformer for SDR-to-HDR conversion that introduces PGA, PCM, and HDR Residual Coupler modules plus a new training corpus and benchmark, outperforming prior ITM methods.
-
CORP: Closed-Form One-shot Representation-Preserving Structured Pruning for Transformers
CORP performs one-shot structured pruning of Transformers by modeling removed components as affine functions of retained ones and solving closed-form ridge regressions on calibration data to fold compensation into weights, retaining 83.27% Top-1 accuracy on DeiT-Huge after 50% pruning.
-
Learning to Build Shapes by Extrusion
Text Encoded Extrusions (TEE) lets LLMs generate and edit manifold 3D meshes by learning sequences of face extrusions from decomposed quadrilateral meshes.
-
FaSTA$^*$: Fast-Slow Toolpath Agent with Subroutine Mining for Efficient Multi-turn Image Editing
FaSTA* combines LLM fast planning with A* search and inductive subroutine mining to create an efficient agent for multi-turn image editing tasks.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
High Volume Rate 3D Ultrasound Reconstruction with Diffusion Models
Diffusion models reconstruct high-resolution 3D cardiac ultrasound volumes from heavily undersampled elevation planes and outperform traditional interpolation and supervised deep learning baselines.
-
MirrorCheck: Efficient Adversarial Defense for Vision-Language Models
MirrorCheck detects adversarial attacks on VLMs via T2I regeneration for semantic consistency checks, using stochastic model selection and one-time perturbations for robustness against adaptive attacks.
-
Hyper-V2X: Hypernetworks for Estimating Epistemic and Aleatoric Uncertainty in Cooperative Bird's-Eye-View Semantic Segmentation
Hyper-V2X uses a Bayesian hypernetwork with partial weight generation and V2X context embedding to produce calibrated epistemic and aleatoric uncertainty estimates for multi-agent BEV segmentation on the OPV2V benchmark.
-
Automatic Discovery of Disease Subgroups by Contrasting with Healthy Controls
Deep UCSL uses a contrastive EM loss on patient-control labels to isolate disease-driven subgroups in medical imaging by suppressing shared healthy variability.
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
Mechanisms of Misgeneralization in Physical Sequence Modeling
Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核
-
SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation
SegRAG is a training-free retrieval-augmented framework that extracts class-specific point prompts from a filtered DINOv3 feature bank to boost SAM3 semantic segmentation performance on standard and agricultural benchmarks.
-
MSIQ: Moment-based Scale-Invariant Quality Measure for Single Image Super-Resolution
MSIQ is a scale-invariant, model-free quality metric for single image super-resolution using normalized central geometric moments for direct comparison of different-resolution images.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
Decomposed Vision-Language Alignment for Fine-Grained Open-Vocabulary Segmentation
Decomposed Vision-Language Alignment framework factorizes prompts into concept and attribute tokens with Feature-Gated Cross-Attention for better compositional generalization in fine-grained open-vocabulary segmentation.
-
On What We Can Learn from Low-Resolution Data
Low-resolution data improves high-resolution model performance when high-resolution samples are limited, via KL-divergence bounds and experiments on vision transformers and CNNs.
-
EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization
A dual-branch system using frequency edge cues and CLIP-based synthetic patch detection for accurate, resolution-independent image forgery localization.
-
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
-
Removing the Watermark Is Not Enough: Forensic Stealth in Generative-AI Watermark Removal
Current AI image watermark removal attacks replace the watermark with a different forensic signal, allowing independent detectors to distinguish processed outputs from clean images at over 98% true-positive rate under a 1% false-positive budget.
-
Probability-Flow Distillation: Exact Wasserstein Gradient Flow for High-Fidelity 3D Generation
Probability-Flow Distillation exactly matches the Wasserstein gradient flow of the target distribution when distilling 2D diffusion priors into 3D models, yielding higher-fidelity results than SDS or SDI.
-
Exposing and Mitigating Temporal Attack in Deepfake Video Detection
SpInShield is a temporal spectral-invariant defense that decouples semantic motion from manipulatable spectral artifacts in deepfake detectors via a learnable adversary and shortcut suppression optimization.
-
EAPFusion: Intrinsic Evolving Auxiliary Prior Guidance for Infrared and Visible Image Fusion
EAPFusion uses self-evolving intrinsic priors to produce dynamic, scene-adaptive convolution kernels and channel-mixing fusion for infrared-visible images, reporting state-of-the-art results and downstream gains.
-
Model Merging: Foundations and Algorithms
New cycle-consistent optimization, task vector theory, singular vector decompositions, adaptive routing, and efficient evolutionary search provide foundations for merging neural network weights across tasks.
-
Neighbor2Inverse: Self-Supervised Denoising for Low-Dose Region-of-Interest Phase Contrast CT
Neighbor2Inverse adapts the Neighbor2Neighbor principle to train a denoising network directly in the image domain for low-dose PBI-CT by using independently noised subsampled projections.
-
CSC: Turning the Adversary's Poison against Itself
CSC identifies backdoored samples via early-epoch latent clustering and conceals them by relabeling to a virtual class, driving attack success rates near zero on benchmarks with little clean accuracy loss.
-
Where are they looking in the operating room?
Gaze-following models on extended 4D-OR and Team-OR datasets reach F1 scores of 0.92 for clinical role prediction and 0.95 for surgical phase recognition while improving team communication detection by over 30%.
-
Explicit Dropout: Deterministic Regularization for Transformer Architectures
Explicit dropout reformulates stochastic dropout as deterministic loss penalties for Transformers, matching or exceeding standard performance with independent control per component.
-
Random Walk on Point Clouds for Feature Detection
RWoDSN extracts feature points from point clouds via a novel DSN descriptor and random walk graph analysis, reporting 22% higher recall than prior state-of-the-art with 0.784 precision.