pith. sign in

hub Canonical reference

Vln-r1: Vision-language navigation via reinforcement fine-tuning.arXiv preprint arXiv:2506.17221

Canonical reference. 75% of citing Pith papers cite this work as background.

19 Pith papers citing it
Background 75% of classified citations

hub tools

citation-role summary

background 6 baseline 2

citation-polarity summary

years

2026 18 2025 1

clear filters

representative citing papers

World Models as Group Actions

cs.CV · 2026-05-23 · unverdicted · novelty 7.0

Formalizes video world models as group actions on states and uses latent regularization with synthesized supervision to enforce consistency, introducing GAC and GAR metrics that improve structural correctness in SOTA models.

WorldVLN: Autoregressive World Action Model for Aerial Vision-Language Navigation

cs.RO · 2026-05-15 · unverdicted · novelty 7.0

WorldVLN proposes the first autoregressive world action model for aerial vision-language navigation that predicts short-horizon latent world states, decodes them to waypoints in closed loop, and uses two-stage training with Action-aware GRPO to achieve over 12% success-rate gains on benchmarks plus零

Token Warping Helps MLLMs Look from Nearby Viewpoints

cs.CV · 2026-04-03 · unverdicted · novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search

cs.CV · 2026-05-09 · unverdicted · novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency in 360° environments.

HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation

cs.AI · 2026-04-09 · unverdicted · novelty 6.0

HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

cs.CV · 2026-06-05 · unverdicted · novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.

citing papers explorer

Showing 19 of 19 citing papers.