hub Canonical reference

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan, Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al · 2025 · arXiv 2501.14818

Canonical reference. 100% of citing Pith papers cite this work as background.

14 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Being-H0.7: A Latent World-Action Model from Egocentric Videos

cs.RO · 2026-04-30 · unverdicted · novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

Cambrian-P: Pose-Grounded Video Understanding

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.

$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills

cs.RO · 2026-04-27 · unverdicted · novelty 6.0

M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.

Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

cs.RO · 2026-04-20 · unverdicted · novelty 6.0

State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.

MLLM-as-a-Judge Exhibits Model Preference Bias

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

cs.RO · 2026-04-05 · unverdicted · novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

FLARE: Robot Learning with Implicit World Modeling

cs.RO · 2025-05-21 · unverdicted · novelty 6.0

FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

cs.CV · 2025-04-14 · conditional · novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.

SmolVLM: Redefining small and efficient multimodal models

cs.AI · 2025-04-07 · unverdicted · novelty 6.0

SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

cs.RO · 2025-03-18 · unverdicted · novelty 6.0

GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

cs.CV · 2025-07-22 · unverdicted · novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.

PLaMo 2.1-VL Technical Report

cs.CV · 2026-04-21 · unverdicted · novelty 4.0

PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.

citing papers explorer

Showing 14 of 14 citing papers.

Being-H0.7: A Latent World-Action Model from Egocentric Videos cs.RO · 2026-04-30 · unverdicted · none · ref 28
Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV · 2026-04-21 · unverdicted · none · ref 13
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 18
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
Cambrian-P: Pose-Grounded Video Understanding cs.CV · 2026-05-21 · unverdicted · none · ref 52
Cambrian-P adds per-frame camera pose tokens and a regression head to video MLLMs, delivering 4.5-6.5% gains on spatial benchmarks, generalization to other video QA tasks, and SOTA streaming pose estimation on ScanNet.
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills cs.RO · 2026-04-27 · unverdicted · none · ref 24
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models cs.RO · 2026-04-20 · unverdicted · none · ref 27
State-of-the-art vision-language-action models catastrophically fail dynamic embodied reasoning due to lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse caused by architectural bottlenecks, as shown by the new BeTTER benchmark with real-world validation.
MLLM-as-a-Judge Exhibits Model Preference Bias cs.CV · 2026-04-13 · unverdicted · none · ref 31
MLLMs show self-preference bias and family-level mutual bias when judging captions; Philautia-Eval quantifies it and Pomms ensemble reduces it.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models cs.RO · 2026-04-05 · unverdicted · none · ref 22
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
FLARE: Robot Learning with Implicit World Modeling cs.RO · 2025-05-21 · unverdicted · none · ref 17
FLARE integrates predictive latent world modeling into diffusion transformer policies for robots, delivering up to 26% gains on multitask manipulation benchmarks and enabling co-training with action-free human videos.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models cs.CV · 2025-04-14 · conditional · none · ref 69
InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
SmolVLM: Redefining small and efficient multimodal models cs.AI · 2025-04-07 · unverdicted · none · ref 22
SmolVLM-256M outperforms a 300-times larger model using under 1 GB GPU memory, while the 2.2B version matches state-of-the-art VLMs at half the memory cost.
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots cs.RO · 2025-03-18 · unverdicted · none · ref 53
GR00T N1 is a new open VLA foundation model for humanoid robots that outperforms imitation learning baselines in simulation and shows strong performance on real-world bimanual manipulation tasks.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning cs.CV · 2025-07-22 · unverdicted · none · ref 22
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.
PLaMo 2.1-VL Technical Report cs.CV · 2026-04-21 · unverdicted · none · ref 6
PLaMo 2.1-VL reports 61.5 ROUGE-L on JA-VG-VQA-500, 85.2% on Japanese Ref-L4, 53.9% zero-shot factory accuracy, and raises anomaly detection F1 from 39.7 to 64.9 after fine-tuning.

Eagle 2: Building post-training data strategies from scratch for frontier vision-language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer