Hierarchical task learning from language instructions with unified transformers and self- monitoring

Yichi Zhang, Joyce Chai · 2021 · arXiv 2106.03427

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.

PaLM-E: An Embodied Multimodal Language Model

cs.LG · 2023-03-06 · conditional · novelty 6.0

PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive transfer from joint training on language and robotics data.

Environmental Understanding Vision-Language Model for Embodied Agent

cs.CV · 2026-04-21 · unverdicted · novelty 5.0

EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

citing papers explorer

Showing 3 of 3 citing papers.

ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints cs.AI · 2026-04-16 · unverdicted · none · ref 23
ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.
PaLM-E: An Embodied Multimodal Language Model cs.LG · 2023-03-06 · conditional · none · ref 40
PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive transfer from joint training on language and robotics data.
Environmental Understanding Vision-Language Model for Embodied Agent cs.CV · 2026-04-21 · unverdicted · none · ref 48
EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.

Hierarchical task learning from language instructions with unified transformers and self- monitoring

fields

years

verdicts

representative citing papers

citing papers explorer