WildBox provides over 237k 3D wildlife annotations from drone video and benchmarks reveal zero-shot 3D detection at 0 AP but fine-tuned performance of 8.68 AP-BEV and 13.17 AP3D, with depth estimation causing most errors.
super hub Canonical reference
Emogen: Emotional image content generation with text-to-image diffusion models
Canonical reference. 91% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
SpheRoPE modifies rotary position embeddings in diffusion transformers to enforce spherical topology for zero-shot 360 panorama generation across multiple backbones.
RESOLVE provides a controlled multi-resolution LiDAR and camera benchmark for evaluating 3D detection and tracking under point sparsity variations in roadside cooperative perception.
A method to decompose 3D Gaussian splats into independent albedo and shading components for consistent texture editing in radiance fields.
An asynchronous architecture decouples incremental voxel-based mapping from VLM-based semantic enrichment to produce queryable open-vocabulary 3D scene graphs that match or exceed prior methods on segmentation and grounding benchmarks.
MLLMs drop from over 85% accuracy on action presence to under 50% on matched action-denial videos, exposing a causal verification gap that causal graph prompts partially close.
A regularization technique that treats diffusion model outputs as a similarity kernel during material optimization in inverse rendering, enabling joint reconstruction of geometry, materials, and illumination that satisfies the rendering equation and generalizes to new lighting.
Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.
ScaLe-INR is a multi-branch INR architecture that applies directional scaling per the Fourier inverse theorem and a directional edge guidance loss to disentangle scales and improve reconstruction fidelity.
MATCH is the first flow matching method for multi-view anomaly detection, reporting SOTA results on Real-IAD and the first comprehensive evaluation on MANTA-Tiny while enabling real-time use by omitting the divergence term.
GeoFidelity-Bench shows text-to-image models gain city-level plausibility from local names but achieve near-zero improvement in exact segment identity, with GPS coordinates adding no benefit.
Arbor attaches constraint mesh tokens to a frozen text-to-3D denoiser to enable controllable generation obeying hull, avoidance, and touch constraints.
Target dynamics provide an intrinsic source of variation equivalent to controlled illumination changes, enabling scattering-compensated reconstruction of dynamic scenes with one acquisition per frame in holographic and fluorescence imaging.
The paper defines the 4DVLT task for worldline-centered 4D scene understanding, releases Instruct-4D with 129.4K QA pairs, and presents 4DTrack achieving 62.68 TGA_Top1, outperforming adapted baselines by 19.62 points.
FLM-Occ reformulates indoor occupancy prediction as feed-forward likelihood maximization over a mixture model with volume-normalized weights, achieving superior accuracy on Occ-ScanNet using only 32 superquadrics.
HERO maps DNA methylation and miRNA to a 16-dimensional intent vector for TF-IDF caption retrieval and cosine-gated repair in VLM-based multi-task breast cancer prediction, claiming SOTA on TCGA-BRCA.
StylisticBias benchmark shows 15 visual attributes explain nearly 80% of bias variation in six MLLMs by isolating single cues like age and fashion in generated images.
CloudLULC-Net is an end-to-end heterogeneous SAR-optical fusion network for LULC mapping under cloud contamination that achieves 86.60% OA, 83.29% F1, and 73.51% mIoU on a new global benchmark of 40,223 samples.
A two-stage generative model (Graph CVAE + flow matching) learns topology-agnostic motion codes from a new 5k-topology dataset and retargets video motion to arbitrary unseen skeletons.
FisherAdapTune uses temporal drift in Fisher geometry, measured by scale-invariant Jensen-Shannon distance, to progressively freeze stabilized parameter groups during fine-tuning, reporting gains on segmentation and zero-shot transfer.
An ILP-based oracle applied to seven VIS methods on YouTube-VIS and OVIS shows tracking instability as the dominant bottleneck, producing gaps exceeding 20 AP under occlusion while classification impact is secondary.
Attributed Feature Graphs (AFGs) represent CAD features as attributed nodes and relations as directed edges to enable GNN surrogate models that predict design performance with feature-level interpretability on the CarHoods10K dataset.
Empirical study of five LVR variants finds cosine alignment negatively correlates with accuracy (r=-0.94), supervised latents are bypassed under corruption (max 4-point shift), and answers are decodable downstream but not at the latent.
OTP-FM extends conditional flow matching by incorporating dynamic optimal transport potentials to enable efficient multimarginal transport learning with intermediate observed marginals.
citing papers explorer
-
Multimarginal flow matching with optimal transport potentials
OTP-FM extends conditional flow matching by incorporating dynamic optimal transport potentials to enable efficient multimarginal transport learning with intermediate observed marginals.
-
Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging
MERIT enables decentralized instruction tuning via conflict-aware PCA splitting and parameter-space merging, raising average benchmark scores above joint training on multimodal and text mixtures.
-
Navigating Potholes with Geometry-Aware Sharpness Minimization
LLQR+SAM pairs a slow learned geometry preconditioner with fast SAM perturbations to amplify escape from locally sharp 'potholes' while stabilizing flat basins, producing consistent gains over SAM and LLQR alone.
-
MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion
MuteBench evaluates multimodal fusion robustness to modality missing and within-modality missing on 125000 samples from 9 clinical datasets, finding architecture family predicts tolerance better than parameter count.
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
-
Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization
STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineering evaluations.
-
The Geometry of Phase Transitions in Generative Dynamics via Projection Caustics
The paper links phase-transition behavior in continuous generative samplers to projection caustics in the data geometry and introduces the Critical Boundary Detector as a diagnostic tool.
-
Silent Failures in Federated Personalization of Foundation Models
Federated personalization of foundation models creates hard-to-detect trustworthiness failures due to privacy constraints, and existing benchmarks cannot adequately evaluate them.
-
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
-
Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models
Entropy-based test-time compute (ETTC) in VLM ensembles outperforms majority voting by prioritizing high-confidence predictions from stronger models.
-
Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models
MAP-RPS and LMAP-RPS enable stage-wise D-P traversal in diffusion-based zero-shot inverse problems via MAP initialization followed by re-noised posterior sampling, supported by theoretical analysis.
-
HypergraphFormer: Learning Hypergraphs from LLMs for Editable Floor Plan Generation
HypergraphFormer trains LLMs via supervised fine-tuning to generate hypergraph textual representations for floor plans, claiming better performance than raster or vector methods on RPLAN and a new out-of-distribution dataset while enabling arbitrary boundaries and high editability.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
-
Scaling Laws for Mixture Pretraining Under Data Constraints
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
-
Generalized Category Discovery in Federated Graph Learning
GCD-FGL mitigates neighborhood absorption and global semantic inconsistency in federated generalized category discovery, delivering +4.86 average HRScore gain over baselines on five graph datasets.
-
PrismAgent: Illuminating Harm in Memes via a Zero-Shot Interpretable Multi-Agent Framework
PrismAgent deploys four specialized LLM agents in sequence to analyze meme intent, gather context, make preliminary judgments, and deliver a final harm verdict, outperforming prior zero-shot methods on three public datasets.
-
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density clustering.
-
3D Masked Autoencoders are Robust Learners of Volumetric and Multimodal Cellular Representations for Microscopy
3D masked autoencoders with multimodal alignment to ESM2 outperform 2D variants on single-cell microscopy tasks, reaching ROC-AUC 0.865 on protein-protein interaction and state-of-the-art AUC_micro 0.952 on localization.
-
Fitting Unknown Number of Hyperplanes with Manifold Optimization
A two-stage manifold optimization method on the sphere uses Riemannian EM with a heavy-tailed kernel and projected density initialization to fit an unknown number of hyperplanes, claiming better geometric accuracy than prior baselines.
-
Pruning Deep Neural Networks via the Marchenko--Pastur Distribution
Marchenko-Pastur random-matrix pruning of DNNs yields theoretical certificates for accuracy preservation under small fine-tuning and empirical ImageNet results with 50-60% MAC reduction and sub-2pp accuracy drops on ViT and CNN models.
-
Representation Gap: Explaining the Unreasonable Effectiveness of Neural Networks from a Geometric Perspective
Derives an asymptotic equivalent for the Representation Gap in equivariant diffusion models, showing it depends primarily on the intrinsic dimension of the task.
-
Position: State-of-the-Art Claims Require State-of-the-Art Evidence
SOTA claims based on aggregate benchmark scores frequently lack evidence for true model superiority beyond marginal mean improvements.
-
Ranking-Aware Calibration for Reliable Multimodal Reinforcement Learning
RAC adds ranking-aware group loss and clean-corrupted pairwise loss to RL post-training to boost both accuracy and calibration in multimodal reasoning without extra annotations.
-
SE-GA: Memory-Augmented Self-Evolution for GUI Agents
SE-GA combines Test-Time Memory Extension for dynamic context retrieval with Memory-Augmented Self-Evolution training to reach 89.0% on ScreenSpot and 75.8% on AndroidControl-High.
-
TailLoR: Protecting Principal Components in Parameter-Efficient Continual Learning
TailLoR applies low-rank updates to the singular value matrix of pre-trained weights while using a soft spectral penalty to protect dominant singular directions during continual learning.
-
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet
A multiplication-only truncated Neumann approximation for matrix inversion in quantized Gated DeltaNet linear attention delivers up to 5x kernel speedup and 20% decode overhead reduction while preserving accuracy on Qwen3.5 models.
-
INAR-VL: Input-Aware Routing for Edge-Cloud Vision-Language Inference
INAR-VL routes 36% of visual question answering requests to the edge using lightweight complexity signals, cutting latency 24% and energy 26% while retaining 97% of cloud accuracy.
-
Gated-SwinRMT: Unifying Swin Windowed Attention with Retentive Manhattan Decay via Input-Dependent Gating
Gated-SwinRMT unifies Swin windowed attention with retentive Manhattan decay via gating, reaching 80.22% top-1 accuracy on Mini-ImageNet versus 73.74% for the RMT baseline.
- MidSteer: Optimal Affine Framework for Steering Generative Models