SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
End-to-end object detection with transformers
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
FingerEye delivers continuous vision-tactile sensing via binocular RGB cameras and marker-tracked compliant ring deformation, supporting imitation learning policies that generalize across object variations for tasks like coin standing and syringe manipulation.
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
XDecomposer uses set prediction and phase-query decomposition to jointly identify phases and reconstruct multiphase PXRD patterns without priors.
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to 40% on hallucination benchmarks and 20% under adversarial perturbations.
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
citing papers explorer
-
SAM 3D Animal: Promptable Animal 3D Reconstruction from Images in the Wild
SAM 3D Animal is the first promptable framework for multi-animal 3D reconstruction from single images, built on SMAL+ and trained on the new Herd3D dataset, achieving SOTA results on Animal3D, APTv2, and Animal Kingdom benchmarks.
-
FingerEye: Continuous and Unified Vision-Tactile Sensing for Dexterous Manipulation
FingerEye delivers continuous vision-tactile sensing via binocular RGB cameras and marker-tracked compliant ring deformation, supporting imitation learning policies that generalize across object variations for tasks like coin standing and syringe manipulation.
-
LACE: Latent Visual Representation for Cross-Embodiment Learning
LACE aligns human-robot visual features via semantic distribution matching on corresponding body parts plus Gram loss, yielding 65% better zero-shot policy transfer than baseline DINO.
-
Invaria: Learning Scale and Density Invariance in Point Clouds via Next-Resolution Prediction
Invaria trains point cloud encoders with next-resolution prediction to learn scale and density invariant features, yielding higher mIoU on ScanNet under lower resolution and scaled objects while using a smaller model.
-
XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction
XDecomposer uses set prediction and phase-query decomposition to jointly identify phases and reconstruct multiphase PXRD patterns without priors.
-
ViCrop-Det: Spatial Attention Entropy Guided Cropping for Training-Free Small-Object Detection
ViCrop-Det uses spatial attention entropy from the decoder to dynamically crop and refine small-object regions in transformer detectors during inference.
-
ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
ORCA is an agentic reasoning framework that enhances factual accuracy and adversarial robustness of pretrained LVLMs via an Observe-Reason-Critique-Act loop with small vision models, reporting accuracy gains of up to 40% on hallucination benchmarks and 20% under adversarial perturbations.
-
Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Random label bridge training aligns LLM parameters with vision tasks, and partial training of certain layers often suffices due to their foundational properties.
- Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification