pith. sign in

arxiv: 2605.18556 · v1 · pith:5VQMO4GKnew · submitted 2026-05-18 · 💻 cs.RO · cs.AI

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Pith reviewed 2026-05-20 09:05 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords embodied manipulationkey-gramexternal memoryvision-language-actioncompositional instructionsrobot controlknowledge extensiontransfer learning
0
0 comments X

The pith

Key-Gram decouples linguistic knowledge from visual reasoning in embodied policies using an external memory of key-grams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Key-Gram to address the coupling of language and visual computation in current vision-language-action policies. It introduces a memory module that decomposes instructions into task-specific key-grams and retrieves static linguistic priors via deterministic hashed lookup. These entries are then injected into selected hidden layers through context-aware gating and lightweight convolutional fusion. This separation lets the backbone focus on visual reasoning and action inference while linguistic knowledge stays in an extensible external store. Experiments demonstrate consistent gains across simulation benchmarks and real-world dual-arm manipulation tasks for two different backbones.

Core claim

Key-Gram is a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning by decomposing an instruction into task-specific key-grams, retrieving static linguistic priors through deterministic hashed lookup, and injecting the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion, allowing the backbone to devote its main capacity to visual reasoning and action inference while reusable instruction knowledge is stored in an extensible external memory.

What carries the argument

Memory module that decomposes instructions into task-specific key-grams, retrieves linguistic priors via deterministic hashed lookup, and injects them into hidden layers via context-aware gating and convolutional fusion.

If this is right

  • Improves both π0 and π0.5 backbones with average relative gains of 29.5 percent and 9.9 percent on RoboTwin2.0.
  • Achieves 35.8 percent and 4.5 percent gains on LIBERO-Plus transfer without target-domain fine-tuning.
  • Delivers 15.4 percent and 8.1 percent gains on real-world long-horizon dual-arm tasks.
  • Allows the logical memory table to be partitioned during training and placed on host memory with O(1) lookup at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Editing the memory table alone could add new world knowledge to a deployed policy without any backbone retraining.
  • The constant-time lookup pattern may allow the same architecture to scale to much larger instruction sets in real-time control.
  • Partitioning the memory table by domain during training could support rapid adaptation to new task families.

Load-bearing premise

That decomposing instructions into task-specific key-grams and retrieving static linguistic priors through deterministic hashed lookup can be injected into selected hidden layers via context-aware gating without losing critical information or introducing new interference with visual reasoning.

What would settle it

A controlled experiment in which Key-Gram is added to the π0 or π0.5 backbone and produces no improvement or a measurable drop in success rates on RoboTwin2.0 or LIBERO-Plus would show that the injection step fails to enhance or actively harms visual reasoning.

Figures

Figures reproduced from arXiv: 2605.18556 by Botao Ren, Jingjing Fan, Siyuan Li, Zhidong Deng.

Figure 1
Figure 1. Figure 1: Overview of Key-Gram. (a) The framework separates language-derived knowledge [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Extensible memory allocation of Key-Gram. The memory is a logical table composed of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Demonstrations show the execution process of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative examples from real-world expansion tasks. Both [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-placement ablation on RoboTwin2.0. Shaded curves denote the weighted task score, [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $\pi_{0}$ and $\pi_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Key-Gram, a conditional-memory framework for embodied manipulation policies that decouples language-derived world knowledge from visual-state reasoning. Instructions are decomposed into task-specific key-grams whose static linguistic priors are retrieved via deterministic hashed lookup and injected into selected hidden layers of the backbone (π0 or π0.5) through context-aware gating plus lightweight convolutional fusion. The external memory is claimed to be extensible and O(1) lookup efficient. Empirical results report average relative gains of 29.5%/9.9% on RoboTwin2.0, 35.8%/4.5% on LIBERO-Plus zero-shot transfer, and 15.4%/8.1% on real-world long-horizon dual-arm tasks.

Significance. If the central mechanism is shown to deliver the claimed separation without modality interference or capacity-driven artifacts, the work would offer a practical route to extensible linguistic priors in vision-language-action models, reducing the cost of knowledge updates and improving compositional transfer. The reported gains on standard benchmarks and real-world tasks would be noteworthy for the robotics community if properly controlled.

major comments (3)
  1. [§3.2] §3.2 (Context-aware gating and fusion): The manuscript provides no layer-wise activation analysis, content-controlled ablations (e.g., random vs. retrieved key-grams), or interference metrics to verify that the injected priors leave visual reasoning intact and do not introduce modality competition. This is load-bearing for the claim that externalization, rather than added parameters or fusion capacity, drives the reported gains.
  2. [§4] §4 (Experimental protocol): Relative performance gains are stated without reporting number of random seeds, statistical significance tests, error bars, exact baseline implementations, or controls that isolate the contribution of the memory module versus the added gating/fusion parameters. This prevents assessment of whether the data support the mechanism-level claims.
  3. [§4.3] §4.3 (Ablation studies): No ablation removes the retrieved linguistic content while retaining the gating and fusion architecture, leaving open the possibility that performance improvements stem from architectural capacity rather than the extensible memory design.
minor comments (2)
  1. Notation for π0 and π0.5 backbones should be defined on first use and cross-referenced to the original papers.
  2. Figure 3 (memory table visualization) would benefit from an explicit legend distinguishing hashed keys from retrieved priors.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments in detail below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Context-aware gating and fusion): The manuscript provides no layer-wise activation analysis, content-controlled ablations (e.g., random vs. retrieved key-grams), or interference metrics to verify that the injected priors leave visual reasoning intact and do not introduce modality competition. This is load-bearing for the claim that externalization, rather than added parameters or fusion capacity, drives the reported gains.

    Authors: We agree that demonstrating the lack of modality interference is important for validating our central claim. In the revised version, we will add layer-wise activation analysis showing the impact of key-gram injection on visual features. Additionally, we will include content-controlled ablations using random key-grams and report quantitative interference metrics, such as the change in visual feature norms and cross-modal attention scores. These will help confirm that the gains arise from the externalized knowledge rather than capacity increases. revision: yes

  2. Referee: [§4] §4 (Experimental protocol): Relative performance gains are stated without reporting number of random seeds, statistical significance tests, error bars, exact baseline implementations, or controls that isolate the contribution of the memory module versus the added gating/fusion parameters. This prevents assessment of whether the data support the mechanism-level claims.

    Authors: We acknowledge the need for more rigorous statistical reporting. The experiments were run with 5 random seeds; we will report mean and standard deviation with error bars in the updated figures. We will also include statistical significance tests (e.g., t-tests) comparing Key-Gram to baselines. We will clarify the baseline implementations by referencing the exact code versions and hyperparameters used. To isolate the memory contribution, we plan to add a control where the fusion modules are active but fed with non-informative inputs. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation studies): No ablation removes the retrieved linguistic content while retaining the gating and fusion architecture, leaving open the possibility that performance improvements stem from architectural capacity rather than the extensible memory design.

    Authors: This observation is correct, and we will address it by adding the requested ablation in the revised Section 4.3. Specifically, we will train and evaluate a variant where the key-gram lookup returns empty or random vectors, while keeping the gating and convolutional fusion layers intact. The performance difference between this variant and the full Key-Gram will quantify the benefit of the linguistic content over mere architectural additions. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical gains from design choice, not self-referential derivation

full rationale

The paper introduces Key-Gram as an architectural design that decomposes instructions into key-grams, retrieves priors via hashed lookup, and injects them via gating and fusion to separate linguistic memory from visual reasoning. Reported improvements (e.g., 29.5%/9.9% on RoboTwin2.0) are presented as outcomes of experiments on standard benchmarks rather than predictions derived from equations or first principles. No load-bearing step reduces a claimed result to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled through prior work. The central mechanism is a proposed engineering separation whose effectiveness is tested externally on held-out tasks and real-world scenarios, keeping the derivation self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based solely on the abstract, no specific free parameters, axioms, or invented entities with independent evidence are detailed. The framework itself introduces a new memory structure for linguistic priors as the core contribution.

invented entities (1)
  • key-grams no independent evidence
    purpose: Task-specific decomposition of language instructions for deterministic memory retrieval
    Introduced in the abstract as the core mechanism for breaking down instructions, but no independent validation or external evidence is provided.

pith-pipeline@v0.9.0 · 5837 in / 1258 out tokens · 47342 ms · 2026-05-20T09:05:28.309582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 12 internal anchors

  1. [1]

    (2023) RT-2: Vision-language-action models transfer web knowledge to robotic control

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V ., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., et al. (2023) RT-2: Vision-language-action models transfer web knowledge to robotic control. In J. Tan, M. Toussaint and K. Darv...

  2. [2]

    & Finn, C

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E.P., Sanketi, P.R., Vuong, Q., Kollar, T., Burchfiel, B., Tedrake, R., Sadigh, D., Levine, S., Liang, P. & Finn, C. (2025) OpenVLA: An open-source vision-language-action model. In P. Agrawal, O. Kroemer and W. Burgard (eds.),Proceedings of The 8th Conferenc...

  3. [3]

    Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

    Kim, M.J., Finn, C. & Liang, P. (2025) Fine-tuning vision-language-action models: Optimizing speed and success.arXiv preprint arXiv:2502.19645

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., Levine, S., Li-Bell, A., Mothukuri, M., Nair, S., Pertsch, K., Shi, L.X., Tanner, J., et al. (2024) π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164

  5. [5]

    (2025)π 0.5: A vision-language-action model with open-world generalization

    Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M.Y ., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., Li-Bell, A., et al. (2025)π 0.5: A vision-language-action model with open-world generalization. InProceedings of The 9th Confere...

  6. [6]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    Liu, S., Wu, L., Li, B., Tan, H., Chen, H., Wang, Z., Xu, K., Su, H. & Zhu, J. (2024) RDT-1B: A diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864

  7. [7]

    & Zhan, X

    Zheng, J., Li, J., Wang, Z., Liu, D., Kang, X., Feng, Y ., Zheng, Y ., Zou, J., Chen, Y ., Zeng, J., Zhang, Y .-Q., Pang, J., Liu, J., Wang, T. & Zhan, X. (2026) X-VLA: Soft-prompted transformer as scalable cross- embodiment vision-language-action model. InInternational Conference on Learning Representations

  8. [8]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    NVIDIA, Bjorck, J., Castañeda, F., Cherniadev, N., Da, X., Ding, R., Fan, L., Fang, Y ., Fox, D., Hu, F., Huang, S., Jang, J., Jiang, Z., Kautz, J., Kundalia, K., Lao, L., Li, Z., Lin, Z., Lin, K., Liu, G., Llontop, E., et al. (2025) GR00T N1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734

  9. [9]

    GR-3 Technical Report

    Cheang, C.L., Chen, S., Cui, Z., Hu, Y ., Huang, L., Kong, T., Li, H., Li, Y ., Liu, Y ., Ma, X., Niu, H., Ou, W., Peng, W., Ren, Z., Shi, H., Tian, J., Wu, H., Xiao, X., Xiao, Y ., Xu, J. & Yang, Y . (2025) GR-3 technical report.arXiv preprint arXiv:2507.15493

  10. [10]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Cheang, C.L., Chen, G., Jing, Y ., Kong, T., Li, H., Li, Y ., Liu, Y ., Wu, H., Xu, J., Yang, Y ., Zhang, H. & Zhu, M. (2024) GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158

  11. [11]

    Enerverse: Envisioning embodied future space for robotics manipulation

    Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M. & Ren, G. (2025) EnerVerse: Envisioning embodied future space for robotics manipulation.arXiv preprint arXiv:2501.01895

  12. [12]

    Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

    Liao, Y ., Zhou, P., Huang, S., Yang, D., Chen, S., Jiang, Y ., Hu, Y ., Cai, J., Liu, S., Luo, J., Chen, L., Yan, S., Yao, M. & Ren, G. (2025) Genie Envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635

  13. [13]

    & Huang, S

    Lu, G., Jia, B., Li, P., Chen, Y ., Wang, Z., Tang, Y . & Huang, S. (2025) GWM: Towards scalable Gaussian world models for robotic manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision

  14. [14]

    Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y . & Xu, Y . (2026) Causal world modeling for robot control.arXiv preprint arXiv:2601.21998

  15. [15]

    World Action Models are Zero-shot Policies

    Ye, S., Ge, Y ., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y .L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y ., Wang, G., Hu, F., Narayan, A., Bjorck, J., et al. (2026) World action models are zero-shot policies.arXiv preprint arXiv:2602.15922

  16. [16]

    & Song, S

    Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y ., Burchfiel, B., Tedrake, R. & Song, S. (2023) Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems. 11

  17. [17]

    RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation

    Chen, T., Chen, Z., Chen, B., Cai, Z., Liu, Y ., Li, Z., Liang, Q., Lin, X., Ge, Y ., Gu, Z., Deng, W., Guo, Y ., Nian, T., Xie, X., Chen, Q., Su, K., Xu, T., Liu, G., Hu, M., Gao, H., et al. (2025) RoboTwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088

  18. [18]

    & Stone, P

    Liu, B., Zhu, Y ., Gao, C., Feng, Y ., Liu, Q., Zhu, Y . & Stone, P. (2023) LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InAdvances in Neural Information Processing Systems 36, pp. 44776– 44791

  19. [19]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Fei, S., Wang, S., Shi, J., Dai, Z., Cai, J., Qian, P., Ji, L., He, X., Zhang, S., Fei, Z., Fu, J., Gong, J. & Qiu, X. (2025) LIBERO-Plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626

  20. [20]

    & Jégou, H

    Lample, G., Sablayrolles, A., Ranzato, M.A., Denoyer, L. & Jégou, H. (2019) Large memory layers with product keys. InAdvances in Neural Information Processing Systems 32

  21. [21]

    & Chang, M.-W

    Guu, K., Lee, K., Tung, Z., Pasupat, P. & Chang, M.-W. (2020) REALM: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, pp. 3929–

  22. [22]

    (2022) Improving language models by retrieving from trillions of tokens

    Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G.B., Lespiau, J.-B., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick, J., Ring, R., Hennigan, T., Huang, S., Maggiore, L., Jones, C., Cassirer, A., Brock, A., et al. (2022) Improving language models by retrieving from trillions of tokens. InProceedin...

  23. [23]

    Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

    Cheng, X., Zeng, W., Dai, D., Chen, Q., Wang, B., Xie, Z., Huang, K., Yu, X., Hao, Z., Li, Y ., Zhang, H., Zhang, H., Zhao, D. & Liang, W. (2026) Conditional memory via scalable lookup: A new axis of sparsity for large language models.arXiv preprint arXiv:2601.07372

  24. [24]

    & Cai, X

    Liu, H., Zhang, J., Wang, C., Hu, X., Lyu, L., Sun, J., Yang, X., Wang, B., Li, F., Qian, Y ., Si, L., Sun, Y ., Li, R., Pei, P., Xie, Y . & Cai, X. (2026) Scaling embeddings outperforms scaling experts in language models.arXiv preprint arXiv:2601.21204

  25. [25]

    Meki: Memory-based expert knowledge injection for efficient llm scaling.arXiv preprint arXiv:2602.03359,

    Ding, N., Liu, F., Kim, K., Hao, L., Lee, K.-H., Ko, H. & Tang, Y . (2026) MeKi: Memory-based expert knowledge injection for efficient LLM scaling.arXiv preprint arXiv:2602.03359

  26. [26]

    Accessed May 7, 2026

    Google (2026) Gemma 4 model overview.Google AI for Developers Documentation. Accessed May 7, 2026

  27. [27]

    & Courville, A

    Perez, E., Strub, F., de Vries, H., Dumoulin, V . & Courville, A. (2018) FiLM: Visual reasoning with a general conditioning layer. InProceedings of the AAAI Conference on Artificial Intelligence32(1)

  28. [28]

    Dumoulin , author E

    Dumoulin, V ., Perez, E., Schucher, N., Strub, F., de Vries, H., Courville, A. & Bengio, Y . (2018) Feature- wise transformations.Distill. doi:10.23915/distill.00011

  29. [29]

    & Xie, S

    Peebles, W. & Xie, S. (2023) Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195–4205

  30. [30]

    & Levine, S

    Dasari, S., Mees, O., Zhao, S., Srirama, M.K. & Levine, S. (2024) The ingredients for robotic diffusion transformers.arXiv preprint arXiv:2410.10088

  31. [31]

    & Cohen, N.J

    McCloskey, M. & Cohen, N.J. (1989) Catastrophic interference in connectionist networks: The sequential learning problem. In G.H. Bower (ed.),Psychology of Learning and Motivation, V ol.24, pp. 109–165. Academic Press

  32. [32]

    (1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

    French, R.M. (1999) Catastrophic forgetting in connectionist networks.Trends in Cognitive Sciences 3(4):128–135

  33. [33]

    & Kiela, D

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S. & Kiela, D. (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33, pp. 9459–9474

  34. [34]

    & Wei, F

    Wang, W., Dong, L., Cheng, H., Liu, X., Yan, X., Gao, J. & Wei, F. (2023) Augmenting language models with long-term memory. InAdvances in Neural Information Processing Systems 36, pp. 74530–74543

  35. [35]

    & Szegedy, C

    Wu, Y ., Rabe, M.N., Hutchins, D. & Szegedy, C. (2022) Memorizing transformers. InInternational Conference on Learning Representations. 12 A Technical appendices and supplementary material A.1 Full RoboTwin2.0 Results Table 6: Full RoboTwin2.0 results (%). Gains in parentheses for KG variants are relative improvements over their corresponding base backbon...

  36. [36]

    Output exactly 8 keywords

  37. [37]

    Each keyword must contain 2 to 4 words

  38. [38]

    Prefer high-information phrases that combine multiple semantic roles in one phrase

  39. [39]

    Prefer action-centered phrases over static descriptive phrases whenever possible

  40. [40]

    At least 3 of the 8 keywords must explicitly contain an action verb

  41. [41]

    verb + object + relation/target/source b

    Prefer these phrase types, in this priority order: a. verb + object + relation/target/source b. verb + particle + object c. verb + prep + object d. object + prep + object e. attribute + object

  42. [42]

    A good keyword should ideally compress 2 or more semantic elements, such as: - action + object - action + object + source - action + object + target - object + attribute - object + location

  43. [43]

    Use standalone static noun phrases only when they add important information that is not already covered elsewhere

  44. [44]

    Use at most 5 standalone noun phrases

  45. [45]

    If a static phrase can be replaced by a more informative action phrase, prefer the action phrase

  46. [46]

    pick up" -

    Prefer phrases like: - "pick up" - "pick bowl from drawer" - "pick up bowl" - "place bowl on plate" - "bowl in top drawer" - "black bowl"

  47. [47]

    place it on

    Avoid: - fragmented phrases - fake combinations across unrelated spans - pronoun-centered phrases like "place it on" - low-information phrases 14 - too many static environment phrases - duplicated semantics across multiple keywords - more than 4 words in a keyword - less or more than 8 keywords

  48. [48]

    Do not explain anything

  49. [49]

    keywords

    Return valid JSON only. Example: Instruction: pick up the green sponge from the sink and wipe the wooden table near the window Output: { "keywords": [ "pick and wipe", "pick sponge from sink", "pick up sponge", "green sponge", "wipe wooden table", "wipe table near window", "table near window", "wooden table" ] } MUST FOLLOW: - Do NOT less or more than 8 k...