pith. machine review for the scientific record. sign in

arxiv: 2501.17161 · v2 · submitted 2025-01-28 · 💻 cs.AI · cs.CV· cs.LG

Recognition: 2 theorem links

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG
keywords supervised fine-tuningreinforcement learninggeneralizationmemorizationfoundation modelspost-trainingarithmetic reasoningvisual navigation
0
0 comments X

The pith

Reinforcement learning enables foundation models to generalize to unseen variants while supervised fine-tuning leads to memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the distinct effects of supervised fine-tuning and reinforcement learning during the post-training of foundation models on their ability to generalize. Using a new arithmetic card game called GeneralPoints and the V-IRL navigation setting, it tests performance on rule changes and visual changes not seen during training. The key finding is that reinforcement learning with outcome-based rewards allows models to adapt to these new situations in both text and vision, while supervised fine-tuning makes models rely on memorizing specific training instances. This distinction matters for building models that can handle novel real-world scenarios rather than repeating what they saw in training. The analysis also shows that reinforcement learning boosts visual recognition skills, and that supervised fine-tuning is still required first to keep outputs in a consistent format so reinforcement learning can succeed.

Core claim

Reinforcement learning, particularly with outcome-based rewards, generalizes across both rule-based textual and visual variants in tasks like GeneralPoints and V-IRL. In contrast, supervised fine-tuning tends to memorize training data and struggles with out-of-distribution scenarios. Reinforcement learning enhances underlying visual recognition capabilities, but supervised fine-tuning is essential to stabilize output formats before reinforcement learning can achieve its gains.

What carries the argument

Outcome-based reward signals in reinforcement learning applied after supervised fine-tuning, tested on the GeneralPoints card game for arithmetic reasoning and V-IRL for real-world navigation.

If this is right

  • RL-trained models perform well on unseen rule variants in text and new visual conditions without additional training.
  • SFT alone results in poor generalization to new scenarios in multi-modal tasks.
  • A two-stage process of SFT followed by RL combines format stability with generalization ability.
  • RL training improves core perceptual skills such as visual recognition that support broader task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These results point to the value of incorporating outcome-based RL in post-training pipelines for better adaptability in changing environments.
  • The pattern may extend to other foundation model applications like language understanding or decision making where rules can vary.
  • Developers could experiment with different reward designs to see if they further enhance generalization beyond the tested games and navigation tasks.

Load-bearing premise

The performance gaps observed between SFT and RL on GeneralPoints and V-IRL truly indicate differences in generalization ability rather than being specific to the design of those two test environments.

What would settle it

Training new models with RL on GeneralPoints or V-IRL and then testing them on a completely different task with novel rules and visuals, finding no generalization advantage over SFT models, would falsify the central claim.

read the original abstract

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that RL (especially outcome-based) generalizes better than SFT to out-of-distribution rule-based textual and visual variants in the newly introduced GeneralPoints arithmetic card game and the adopted V-IRL navigation environment. SFT is said to memorize training data and fail on OOD cases, while RL also improves underlying visual recognition; SFT is nevertheless required to stabilize output formats before effective RL training.

Significance. If the central empirical distinction holds after addressing setup details, the work would usefully separate the memorization/generalization roles of SFT versus RL in multi-modal post-training and demonstrate that outcome rewards can yield rule-like behavior across modalities. The introduction of GeneralPoints and controlled variants of V-IRL supplies concrete testbeds that future studies could reuse.

major comments (3)
  1. [Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.
  2. [GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.
  3. [Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.
minor comments (2)
  1. [Abstract] Abstract contains a subject-verb agreement error: 'These findings demonstrates' should read 'These findings demonstrate'.
  2. [Figures and tables] All result tables and figures should include error bars, exact sample sizes, and explicit comparison of SFT-only, RL-only, and SFT-then-RL pipelines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We provide detailed responses to each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.

    Authors: We agree with this observation. The original manuscript omitted these details due to space constraints, but we recognize their importance. In the revised version, we will report the number of independent training runs (conducted with different random seeds), include error bars (standard deviations) on all performance figures, and add statistical significance tests (such as paired t-tests) to confirm that the generalization gaps between SFT and RL are significant. revision: yes

  2. Referee: [GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.

    Authors: The variants were designed with the explicit goal of isolating changes to the rule or visual feature. For GeneralPoints, textual variants modify the arithmetic operations while keeping card values, game mechanics, and the outcome-based reward identical. Similarly for V-IRL visual variants. We will revise the environment sections to provide more explicit descriptions of these controls and how they maintain constancy in other factors. While we did not perform additional ablations beyond the main experiments, the controlled construction supports our claims; we can add a paragraph discussing this design choice. revision: partial

  3. Referee: [Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.

    Authors: Our analysis in the manuscript includes probes on visual element recognition in V-IRL scenes after training. To address the separation from policy optimization, we will add explicit auxiliary metrics, such as accuracy on a visual recognition task using frozen policy models or separate evaluation on image classification of navigation-relevant objects. This will be included in the revised analysis section to demonstrate gains independent of the full navigation policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new environments

full rationale

The paper is an empirical study that introduces GeneralPoints and adopts V-IRL, then reports direct experimental comparisons of SFT versus RL on in-distribution and out-of-distribution textual and visual variants. No mathematical derivation, prediction, or first-principles claim is present that reduces by construction to fitted parameters, self-definitions, or self-citations. Generalization claims rest on measured performance gaps rather than any re-labeling of inputs as outputs. The work is self-contained against external benchmarks via explicit task variants and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical comparison study; no free parameters, invented entities, or new mathematical axioms are introduced in the abstract. It relies on standard domain assumptions about measuring generalization via held-out variants.

axioms (1)
  • domain assumption Performance differences on held-out variants of GeneralPoints and V-IRL indicate true generalization rather than benchmark-specific effects.
    The paper uses these environments as proxies for broader capabilities without additional validation of their representativeness.

pith-pipeline@v0.9.0 · 5536 in / 1342 out tokens · 52948 ms · 2026-05-12T21:25:13.756846+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  2. StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

    cs.SE 2026-05 unverdicted novelty 7.0

    StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.

  3. From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation

    cs.LG 2026-05 conditional novelty 7.0

    Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.

  4. TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...

  5. Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.

  6. Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

    cs.RO 2026-05 unverdicted novelty 7.0

    ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...

  7. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  8. Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

    cs.AI 2026-04 unverdicted novelty 7.0

    Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...

  9. S-GRPO: Unified Post-Training for Large Vision-Language Models

    cs.LG 2026-04 unverdicted novelty 7.0

    S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.

  10. LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO

    cs.DB 2026-04 unverdicted novelty 7.0

    LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.

  11. Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    cs.LG 2026-01 unverdicted novelty 7.0

    A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...

  12. ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving

    cs.CV 2025-06 unverdicted novelty 7.0

    ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.

  13. Video-R1: Reinforcing Video Reasoning in MLLMs

    cs.CV 2025-03 conditional novelty 7.0

    Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.

  14. Teacher-Guided Policy Optimization for LLM Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.

  15. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  16. AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.

  17. Generalization in LLM Problem Solving: The Case of the Shortest Path

    cs.AI 2026-04 unverdicted novelty 6.0

    LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

  18. SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization

    cs.LG 2026-04 unverdicted novelty 6.0

    Disjoint SFT and GRPO data for autoformalization yields up to 10.4pp semantic accuracy gains over full overlap, which renders the GRPO stage redundant.

  19. ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning

    cs.IR 2026-04 unverdicted novelty 6.0

    ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.

  20. Watch Before You Answer: Learning from Visually Grounded Post-Training

    cs.CV 2026-04 unverdicted novelty 6.0

    Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.

  21. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

  22. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  23. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  24. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.

  25. On-Policy Distillation with Best-of-N Teacher Rollout Selection

    cs.CV 2026-05 unverdicted novelty 5.0

    BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.

  26. Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy

    cs.CY 2026-05 unverdicted novelty 5.0

    AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.

  27. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

    cs.AI 2026-05 unverdicted novelty 5.0

    Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.

  28. Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    cs.LG 2026-04 unverdicted novelty 5.0

    A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.

  29. Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models

    cs.AI 2026-04 unverdicted novelty 5.0

    Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.

  30. Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    cs.AI 2025-03 unverdicted novelty 5.0

    The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.

  31. DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 4.0

    DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 30 Pith papers · 5 internal anchors

  1. [1]

    Quantifying Memorization Across Neural Language Models

    1 Bousquet, O. and Elisseeff, A. Algorithmic stability and generalization performance. volume 13, 2000. 1 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing sys- tems, 33:1877–1901, 2020. 1, 2...

  2. [2]

    arXiv preprint arXiv:2405.03553 , year=

    2 Chen, G., Liao, M., Li, C., and Fan, K. AlphaMath al- most zero: Process supervision without process. arXiv preprint arXiv:2405.03553, 2024a. 3 Chen, J., Han, X., Ma, Y ., Zhou, X., and Xiang, L. Un- lock the correlation between supervised fine-tuning and reinforcement learning in training code large language models. arXiv preprint arXiv:2406.10305, 202...

  3. [3]

    URL https://arxiv.org/abs/2501. 12948. 1, 3, 7 Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 Herd of models. arXiv preprint arXiv:2407.21783, 2024. 2, 5, 6 Feng, X., Wan, Z., Wen, M., McAleer, S. M., Wen, Y ., Zhang, W., and Wang, J. AlphaZero-like tree-search can g...

  4. [4]

    3 Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 3 Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. LLaV A-NeXT: Improved rea- soning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 3 Lu, P., Bansal, ...

  5. [5]

    Proximal Policy Optimization Algorithms

    URL https://openreview.net/forum? id=8aHzds2uUyB. 2 10 SFT Memorizes, RL Generalizes Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 3, 18 Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., and Kumar, A. Re- w...

  6. [6]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    3 Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test- time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314,

  7. [7]

    In Ku, L.-W., Martins, A

    2, 3, 18 Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., Keutzer, K., and Darrell, T. Aligning large multimodal models with fac- tually augmented RLHF. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.), Findings of the Association for Computational Linguistics: ACL 2024 , pp. 13088– 13110, Bangkok, Thai...

  8. [8]

    cards": [x, y, z, w], where 'J', 'Q', and 'K' count as '10',

    2 Wang, X., Antoniades, A., Elazar, Y ., Amayuelas, A., Albalak, A., Zhang, K., and Wang, W. Y . Gener- alization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:2407.14985, 2024. 2 Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . Finetuned language mode...

  9. [14]

    forward()

    Move forward until the destination Café de la Presse is on your right. [Current observation] You observe a 2x2 grid of street view images with the following headings: [front, right back, left] You need to identify if any of the landmarks in the instruction are visible in the street view grid. [Action space] - "forward()": indicates moving forward for 1 st...

  10. [15]

    First, turn left to face east

  11. [16]

    Move forward until you reach the next intersection where Hotel 32One is on your right behind

  12. [18]

    Move forward until you reach the next intersection where Dragon Gate Chinatown SF is on your right front

  13. [20]

    forward()

    Move forward until the destination Café de la Presse is on your right. [Action space] - "forward()": indicates moving forward for 1 step; - "turn_direction(x)": indicates turn direction to the target heading, where x ∈['north', 'northeast', 'east', 'southeast', 'south', 'southwest', 'west', 'northwest']; - "stop()": indicates the navigation is finished; [...

  14. [21]

    First, turn right to face northwest

  15. [22]

    Move forward until you reach next intersection where Korean War Memorial is on your left

  16. [23]

    Turn left to face southwest

  17. [24]

    Move forward until you reach next intersection where Korean War Memorial is on your left behind

  18. [25]

    Turn right to face north

  19. [27]

    Turn left to face east

  20. [29]

    Turn left to face north

  21. [31]

    Turn right to face east

  22. [32]

    Move forward until you reach next intersection

  23. [33]

    Turn left to face northeast

  24. [34]

    Move forward until you reach next intersection where 9/11 Memorial & Museum is on your left

  25. [35]

    Turn right to face northwest

  26. [36]

    forward()

    Move forward until you reach destination where The destination 9/11 Memorial & Museum is on your right front. [Action space] "forward()": indicates moving forward one step "turn_direction(x)": indicates adjust the ego agent direction towards x direction. x could be any following [’left’, ’right’, ’slightly left’, ’slightly right’] "stop()": indicates the ...