Introduces the TVR active viewpoint-matching task and TVRBench indoor simulation benchmark, where foundation models start at low single-digit success rates but reach 51.4% after visual-action SFT and multi-turn GRPO post-training.
hub Canonical reference
Objectnav revisited: On evaluation of embodied agents navigating to objects
Canonical reference. 83% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
HM3D offers 1000 building-scale 3D environments that are larger and higher-fidelity than existing datasets, enabling better-performing embodied AI agents for tasks like PointGoal navigation.
POINav-Bench provides the first high-fidelity real-world benchmark for POI-goal VLN using 3DGS reconstructions of 126k m² with 163 POIs, supported by a Brain-Action framework and 70K real signage-entrance dataset.
IntentionNav is a new benchmark showing that VLMs infer intended targets from implicit instructions in 48% of cases but achieve only 25% terminal success and 5.5% grounded success in active navigation.
SynthFun3D generates synthetic 3D functionality segmentation data from action descriptions via object retrieval and scene arrangement, yielding consistent gains of +2.2 mAP, +6.3 mAR, and +5.7 mIoU when augmenting real data for VLM training.
A zero-shot unified agent for VLN-CE, ObjectNav, EQA and Aerial-VLN on wheeled, quadruped, humanoid and UAV platforms that translates language and vision inputs into actions via MLLMs plus TDM and SCB mechanisms, matching trained foundation models on multiple benchmarks.
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
Node-wise beam search with expected gain and RRAG graph construction outperforms prior active perception methods by at least 20% on representative tasks.
ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
Habitat-GS integrates 3D Gaussian Splatting scene rendering and Gaussian avatars into Habitat-Sim, yielding agents with stronger cross-domain generalization and effective human-aware navigation.
A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.
HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.
ReMemNav improves zero-shot object navigation success and efficiency by integrating episodic memory and rethinking with VLMs, achieving SR/SPL gains of 1.7%/7.0% on HM3D v0.1, 18.2%/11.1% on HM3D v0.2, and 8.7%/7.9% on MP3D.
MerNav's Memory-Execute-Review framework improves success rates in zero-shot object goal navigation by 5-8% over baselines on four datasets while outperforming both training-free and supervised methods on key benchmarks.
C-Nav is a continual visual navigation framework with dual-path anti-forgetting via feature distillation and replay plus adaptive sampling that outperforms baselines on a new continual object navigation benchmark while using less memory.
Transit-Aware Planning (TAP) enriches navigation policies with object transit data on Dynamic Object Maps, raising success rates by 21.1% in MP3D simulation and 18.3% in real-world tests for finding non-stationary targets.
Qwen-RobotNav provides a parameterized navigation model trained on 15.6M samples with vision-language co-training that achieves SOTA results on benchmarks and zero-shot transfer to real robots.
STEM develops a semantically-guided combinatorial planner and active perception pipeline that propagates object priorities to frontier voxels, enabling MAVs to find targets faster than baselines in simulation and real-world tests.
TravExplorer couples zero-shot semantic guidance with traversability-aware 3-D planning to enable cross-floor object navigation in unseen indoor environments.
CLUE adaptively weights room-type and object-co-location cues from an LLM to construct a unified semantic value map that improves success rate and efficiency in zero-shot object-goal navigation.
MiniVLA-Nav v1 provides 1,174 episodes of language-instructed robot navigation in photorealistic simulations with RGB, depth, segmentation, and expert action data.
The paper defines Agent AI as interactive multimodal systems that perceive grounded data and generate embodied actions, arguing this approach can mitigate hallucinations in foundation models.
citing papers explorer
No citing papers match the current filters.