SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
Attention is all you need.Advances in neural information processing systems, 30
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 9verdicts
UNVERDICTED 9roles
background 2polarities
background 2representative citing papers
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
HAD uses multi-view reasoning from a pre-trained feedforward NVS network to estimate and mask hallucination scores in diffusion priors, reducing artifacts and achieving SOTA novel view synthesis in sparse-view 3D reconstruction.
EdgeVTP delivers the lowest measured end-to-end latency on Jetson-class platforms while matching or exceeding state-of-the-art accuracy on highway trajectory benchmarks by using bounded graph interactions and a one-shot curve decoder.
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.
A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
LangFlash introduces a feed-forward model for 3D language Gaussian splatting from sparse unposed images, claiming superior novel view synthesis and semantic consistency via enriched training data and sparse semantic encoding.
DietDelta uses vision-language prompts on paired before-and-after RGB images to localize food items, estimate their weights, and compute consumption differences, reporting better results than prior single-image methods on three public datasets.
citing papers explorer
-
SparseSplat: Towards Applicable Feed-Forward 3D Gaussian Splatting with Pixel-Unaligned Prediction
SparseSplat uses entropy-based probabilistic sampling and a specialized point cloud network to generate compact 3D Gaussian maps that retain high rendering quality with far fewer Gaussians than prior feed-forward methods.
-
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory reduction and 4x faster inference at SOTA quality.
-
HAD: Hallucination-Aware Diffusion Priors for 3D Reconstruction
HAD uses multi-view reasoning from a pre-trained feedforward NVS network to estimate and mask hallucination scores in diffusion priors, reducing artifacts and achieving SOTA novel view synthesis in sparse-view 3D reconstruction.
-
EdgeVTP: Exploration of Latency-efficient Trajectory Prediction for Edge-based Embedded Vision Applications
EdgeVTP delivers the lowest measured end-to-end latency on Jetson-class platforms while matching or exceeding state-of-the-art accuracy on highway trajectory benchmarks by using bounded graph interactions and a one-shot curve decoder.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
FlexAvatar: Learning Complete 3D Head Avatars with Partial Supervision
FlexAvatar introduces bias sinks in a transformer to unify monocular and multi-view training, yielding complete 3D head avatars with strong generalization and view extrapolation from single images.
-
Eevee: Towards Close-up High-resolution Video-based Virtual Try-on
A new dataset with high-fidelity close-up garment images and full/close-up try-on videos plus the VGID metric enables better texture and structure preservation in high-resolution video virtual try-on.
-
LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images
LangFlash introduces a feed-forward model for 3D language Gaussian splatting from sparse unposed images, claiming superior novel view synthesis and semantic consistency via enriched training data and sparse semantic encoding.
-
DietDelta: A Vision-Language Approach for Dietary Assessment via Before-and-After Images
DietDelta uses vision-language prompts on paired before-and-after RGB images to localize food items, estimate their weights, and compute consumption differences, reporting better results than prior single-image methods on three public datasets.