ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
hub
V-Net: Fully Convolutional Neural Networks for V olumetric Medical Image Segmentation
10 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
Frozen DINOv3 features with multi-view MLP probes, entropy-weighted fusion, and spatial regularization achieve 0.895 Dice on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR, recovering 98.4% of full-data performance with only five annotated patients.
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
DSER combines spectral epipolar regularization with a hybrid pipeline of gradient initialization, plane-sweeping, multiscale refinement, and occlusion-aware random walk to produce structurally consistent depth maps from light fields.
Ada2MS is a new optimizer that exponentially mixes elementwise and global second-moment estimates to interpolate between AdamW and momentum-SGD behaviors and reports competitive results on visual tasks under a unified protocol.
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
TMVA4D uses CNN and ConvLSTM encoders on multi-view 2D projections of 4D radar point clouds for semantic segmentation of people, reporting Dice 75.9% and IoU 61.2% in field tests.
citing papers explorer
-
ESARBench: A Benchmark for Agentic UAV Embodied Search and Rescue
ESARBench is the first unified benchmark for MLLM-driven UAV agents that must explore, locate clues, and decide on victim positions in photorealistic simulated SAR environments.
-
A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features
FastForward represents scenes as collections of 3D-anchored image features and performs camera pose estimation via feed-forward correspondence prediction, achieving competitive accuracy with minimal mapping time.
-
DINO-MVR: Multi-View Readout of Frozen DINOv3 for Annotation-Efficient Medical Segmentation
Frozen DINOv3 features with multi-view MLP probes, entropy-weighted fusion, and spatial regularization achieve 0.895 Dice on Kvasir-SEG, 0.897 on ISIC 2018, and 0.908 on BraTS FLAIR, recovering 98.4% of full-data performance with only five annotated patients.
-
Moondream Segmentation: From Words to Masks
Moondream Segmentation achieves 80.2% cIoU on RefCOCO by autoregressively decoding paths from referring expressions and using RL to refine masks, plus releases a cleaned RefCOCO-M dataset.
-
DSER: Spectral Epipolar Representation for Efficient Light Field Depth Estimation
DSER combines spectral epipolar regularization with a hybrid pipeline of gradient initialization, plane-sweeping, multiscale refinement, and occlusion-aware random walk to produce structurally consistent depth maps from light fields.
-
Ada2MS: A Hybrid Optimization Algorithm Based on Exponential Mixing of Elementwise and Global Second-Moment Estimates
Ada2MS is a new optimizer that exponentially mixes elementwise and global second-moment estimates to interpolate between AdamW and momentum-SGD behaviors and reports competitive results on visual tasks under a unified protocol.
-
Efficient 3D Content Reconstruction and Generation
Presents Instant3D for rapid text/image-to-3D generation via multi-view diffusion plus feed-forward reconstruction, and FastMap for 10x faster structure-from-motion with comparable accuracy.
-
Dual-stream Spatio-Temporal GCN-Transformer Network for 3D Human Pose Estimation
MixTGFormer reports state-of-the-art 3D pose estimation errors of 37.6 mm on Human3.6M and 15.7 mm on MPI-INF-3DHP by using parallel GCN-Transformer streams with SE layers for local-global feature fusion.
-
Hierarchical Awareness Adapters with Hybrid Pyramid Feature Fusion for Dense Depth Prediction
A multilevel perceptual CRF model using Swin Transformer, HPF fusion, HA adapters, and dynamic scaling attention achieves state-of-the-art monocular depth estimation on NYU Depth v2, KITTI, and MatterPort3D with reduced error and fast inference.
-
4D Radar Semantic Segmentation of People in Field Conditions Using Temporal Multi-View Networks
TMVA4D uses CNN and ConvLSTM encoders on multi-view 2D projections of 4D radar point clouds for semantic segmentation of people, reporting Dice 75.9% and IoU 61.2% in field tests.