arxiv: 2501.17161 · v2 · submitted 2025-01-28 · 💻 cs.AI · cs.CV· cs.LG

Recognition: 2 theorem links

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Tianzhe Chu , Yuexiang Zhai , Jihan Yang , Shengbang Tong , Saining Xie , Dale Schuurmans , Quoc V. Le , Sergey Levine

show 1 more author

Yi Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:25 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords supervised fine-tuningreinforcement learninggeneralizationmemorizationfoundation modelspost-trainingarithmetic reasoningvisual navigation

0 comments

The pith

Reinforcement learning enables foundation models to generalize to unseen variants while supervised fine-tuning leads to memorization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates the distinct effects of supervised fine-tuning and reinforcement learning during the post-training of foundation models on their ability to generalize. Using a new arithmetic card game called GeneralPoints and the V-IRL navigation setting, it tests performance on rule changes and visual changes not seen during training. The key finding is that reinforcement learning with outcome-based rewards allows models to adapt to these new situations in both text and vision, while supervised fine-tuning makes models rely on memorizing specific training instances. This distinction matters for building models that can handle novel real-world scenarios rather than repeating what they saw in training. The analysis also shows that reinforcement learning boosts visual recognition skills, and that supervised fine-tuning is still required first to keep outputs in a consistent format so reinforcement learning can succeed.

Core claim

Reinforcement learning, particularly with outcome-based rewards, generalizes across both rule-based textual and visual variants in tasks like GeneralPoints and V-IRL. In contrast, supervised fine-tuning tends to memorize training data and struggles with out-of-distribution scenarios. Reinforcement learning enhances underlying visual recognition capabilities, but supervised fine-tuning is essential to stabilize output formats before reinforcement learning can achieve its gains.

What carries the argument

Outcome-based reward signals in reinforcement learning applied after supervised fine-tuning, tested on the GeneralPoints card game for arithmetic reasoning and V-IRL for real-world navigation.

If this is right

RL-trained models perform well on unseen rule variants in text and new visual conditions without additional training.
SFT alone results in poor generalization to new scenarios in multi-modal tasks.
A two-stage process of SFT followed by RL combines format stability with generalization ability.
RL training improves core perceptual skills such as visual recognition that support broader task performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These results point to the value of incorporating outcome-based RL in post-training pipelines for better adaptability in changing environments.
The pattern may extend to other foundation model applications like language understanding or decision making where rules can vary.
Developers could experiment with different reward designs to see if they further enhance generalization beyond the tested games and navigation tasks.

Load-bearing premise

The performance gaps observed between SFT and RL on GeneralPoints and V-IRL truly indicate differences in generalization ability rather than being specific to the design of those two test environments.

What would settle it

Training new models with RL on GeneralPoints or V-IRL and then testing them on a completely different task with novel rules and visuals, finding no generalization advantage over SFT models, would falsify the central claim.

read the original abstract

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RL with outcome rewards generalizes to new rule and visual variants on their benchmarks while SFT mostly memorizes, and SFT is still required to stabilize outputs for RL to work.

read the letter

RL generalizes to new rule and visual variants while SFT sticks to memorizing the training set. That's the core finding from their experiments on the GeneralPoints card game and V-IRL navigation tasks. They created GeneralPoints as an arithmetic reasoning game and tested both text-based rule changes and visual versions, plus the navigation environment. The results show outcome-reward RL transfers across these shifts, and RL training also improves basic visual recognition in the navigation setting. They also demonstrate that SFT is still essential because it forces the model into the right output format so RL can then deliver gains. These are practical observations for anyone doing post-training on foundation models. The head-to-head comparison on the same tasks with controlled variants is the useful part, and the SFT-for-stability point is something people can apply directly. The soft spot is the newness of the benchmarks. The variants might not cleanly separate memorization from generalization if the rule or visual changes happen to hit surface patterns harder for SFT or if RL simply benefits from its exploration tolerance under the reward. Since the environments were built for this study, small unintended differences in difficulty or required computation could produce the same pattern. More runs, error bars, and explicit checks on variant construction would tighten that. This is aimed at researchers working on post-training for multi-modal models, especially those focused on reasoning or navigation tasks. Readers will get a clear empirical contrast and some actionable takeaways on when to use each method. The paper has enough new data and a straightforward message to deserve peer review, where the main questions will be about benchmark controls and statistical robustness.

Referee Report

3 major / 2 minor

Summary. The paper claims that RL (especially outcome-based) generalizes better than SFT to out-of-distribution rule-based textual and visual variants in the newly introduced GeneralPoints arithmetic card game and the adopted V-IRL navigation environment. SFT is said to memorize training data and fail on OOD cases, while RL also improves underlying visual recognition; SFT is nevertheless required to stabilize output formats before effective RL training.

Significance. If the central empirical distinction holds after addressing setup details, the work would usefully separate the memorization/generalization roles of SFT versus RL in multi-modal post-training and demonstrate that outcome rewards can yield rule-like behavior across modalities. The introduction of GeneralPoints and controlled variants of V-IRL supplies concrete testbeds that future studies could reuse.

major comments (3)

[Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.
[GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.
[Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.

minor comments (2)

[Abstract] Abstract contains a subject-verb agreement error: 'These findings demonstrates' should read 'These findings demonstrate'.
[Figures and tables] All result tables and figures should include error bars, exact sample sizes, and explicit comparison of SFT-only, RL-only, and SFT-then-RL pipelines.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments on our work. We provide detailed responses to each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Experiments / Results sections] The manuscript provides no details on the number of training runs, random seeds, statistical significance tests, or error bars for the reported performance gaps between SFT and RL on GeneralPoints and V-IRL variants. Without these, it is impossible to judge whether the observed generalization advantage is robust or could be explained by variance or implementation differences.

Authors: We agree with this observation. The original manuscript omitted these details due to space constraints, but we recognize their importance. In the revised version, we will report the number of independent training runs (conducted with different random seeds), include error bars (standard deviations) on all performance figures, and add statistical significance tests (such as paired t-tests) to confirm that the generalization gaps between SFT and RL are significant. revision: yes
Referee: [GeneralPoints and V-IRL environment descriptions] The claim that performance differences reflect acquisition of generalizable rules rather than RL exploiting the outcome reward's tolerance for exploration (while SFT overfits surface patterns) rests on the untested assumption that the textual and visual variants in GeneralPoints and V-IRL alter only the intended rule or visual feature. No ablations are described that hold reward structure, visual complexity, or required computation constant while changing only the generalization target.

Authors: The variants were designed with the explicit goal of isolating changes to the rule or visual feature. For GeneralPoints, textual variants modify the arithmetic operations while keeping card values, game mechanics, and the outcome-based reward identical. Similarly for V-IRL visual variants. We will revise the environment sections to provide more explicit descriptions of these controls and how they maintain constancy in other factors. While we did not perform additional ablations beyond the main experiments, the controlled construction supports our claims; we can add a paragraph discussing this design choice. revision: partial
Referee: [Analysis of visual recognition] The additional assertion that RL improves underlying visual recognition in V-IRL is not separated from policy optimization effects; the paper does not report auxiliary recognition metrics or controlled probes that would show gains independent of the navigation policy.

Authors: Our analysis in the manuscript includes probes on visual element recognition in V-IRL scenes after training. To address the separation from policy optimization, we will add explicit auxiliary metrics, such as accuracy on a visual recognition task using frozen policy models or separate evaluation on image classification of navigation-relevant objects. This will be included in the revised analysis section to demonstrate gains independent of the full navigation policy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on new environments

full rationale

The paper is an empirical study that introduces GeneralPoints and adopts V-IRL, then reports direct experimental comparisons of SFT versus RL on in-distribution and out-of-distribution textual and visual variants. No mathematical derivation, prediction, or first-principles claim is present that reduces by construction to fitted parameters, self-definitions, or self-citations. Generalization claims rest on measured performance gaps rather than any re-labeling of inputs as outputs. The work is self-contained against external benchmarks via explicit task variants and metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical comparison study; no free parameters, invented entities, or new mathematical axioms are introduced in the abstract. It relies on standard domain assumptions about measuring generalization via held-out variants.

axioms (1)

domain assumption Performance differences on held-out variants of GeneralPoints and V-IRL indicate true generalization rather than benchmark-specific effects.
The paper uses these environments as proxies for broader capabilities without additional validation of their representativeness.

pith-pipeline@v0.9.0 · 5536 in / 1342 out tokens · 52948 ms · 2026-05-12T21:25:13.756846+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
cs.SE 2026-05 unverdicted novelty 7.0

StepCodeReasoner aligns code reasoning with verifiable stepwise execution traces via print anchors and bi-level GRPO reinforcement learning, reaching SOTA results on CRUXEval (91.1%) and LiveCodeBench (86.5%) for a 7B model.
From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization
cs.CV 2026-05 unverdicted novelty 7.0

Omni-Persona benchmark with 18 tasks shows open-source models have audio-visual grounding gaps, RLVR narrows them but leads to conservative outputs, and scale or recall alone fail as diagnostics.
Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT
cs.RO 2026-05 unverdicted novelty 7.0

ConSFT prevents catastrophic forgetting in fine-tuning flow-matching VLAs by dynamically scaling gradients based on model confidence, retaining over 20% more pre-trained capability than standard SFT without prior data...
Fine-Tuning Small Reasoning Models for Quantum Field Theory
cs.LG 2026-04 unverdicted novelty 7.0

Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
cs.AI 2026-04 unverdicted novelty 7.0

Rule-VLN is the first large-scale benchmark injecting 177 regulatory categories into an urban environment, and the proposed SNRM module equips pre-trained VLN agents with zero-shot semantic reasoning and detour planni...
S-GRPO: Unified Post-Training for Large Vision-Language Models
cs.LG 2026-04 unverdicted novelty 7.0

S-GRPO unifies SFT and RL for LVLMs via conditional ground-truth injection that supplies a maximal-reward anchor when group exploration fails completely.
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
cs.DB 2026-04 unverdicted novelty 7.0

LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
cs.LG 2026-01 unverdicted novelty 7.0

A single LLM improves its own reasoning by self-distilling from privileged verified traces as teacher to its question-only student policy, outperforming off-policy distillation and RL on math benchmarks with better to...
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
cs.CV 2025-06 unverdicted novelty 7.0

ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
Teacher-Guided Policy Optimization for LLM Distillation
cs.LG 2026-05 unverdicted novelty 6.0

TGPO improves on-policy LLM distillation by using teacher predictions conditioned on student rollouts to supply informative guidance when the two distributions diverge.
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization
cs.LG 2026-04 unverdicted novelty 6.0

Disjoint SFT and GRPO data for autoformalization yields up to 10.4pp semantic accuracy gains over full overlap, which renders the GRPO stage redundant.
ReRec: Reasoning-Augmented LLM-based Recommendation Assistant via Reinforcement Fine-tuning
cs.IR 2026-04 unverdicted novelty 6.0

ReRec uses reinforcement fine-tuning with dual-graph reward shaping, reasoning-aware advantage estimation, and online curriculum scheduling to improve LLM reasoning and performance in recommendation tasks.
Watch Before You Answer: Learning from Visually Grounded Post-Training
cs.CV 2026-04 unverdicted novelty 6.0

Filtering post-training data to visually grounded questions improves VLM video understanding performance by up to 6.2 points using 69% of the data.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
ToolRL: Reward is All Tool Learning Needs
cs.LG 2025-04 conditional novelty 6.0

A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
On-Policy Distillation with Best-of-N Teacher Rollout Selection
cs.CV 2026-05 unverdicted novelty 5.0

BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
Cheap Expertise: Mapping and Challenging Industry Perspectives in the Expert Data Gig Economy
cs.CY 2026-05 unverdicted novelty 5.0

AI data firms view human expertise as an extractable, low-cost resource to feed AI systems while treating institutional expertise as something needing liberation or reform to fit this model.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models
cs.AI 2026-04 unverdicted novelty 5.0

Newer LLM backbones in VLMs do not always improve performance; gains are task-dependent, with VQA models solving different questions due to better confidence calibration and stable representations.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
DeepSight: Long-Horizon World Modeling via Latent States Prediction for End-to-End Autonomous Driving
cs.CV 2026-05 unverdicted novelty 4.0

DeepSight uses parallel latent feature prediction in BEV for long-horizon world modeling and adaptive text reasoning to reach state-of-the-art closed-loop performance on the Bench2drive benchmark.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 30 Pith papers · 5 internal anchors

[1]

Quantifying Memorization Across Neural Language Models

1 Bousquet, O. and Elisseeff, A. Algorithmic stability and generalization performance. volume 13, 2000. 1 Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing sys- tems, 33:1877–1901, 2020. 1, 2...

work page internal anchor Pith review arXiv 2000
[2]

arXiv preprint arXiv:2405.03553 , year=

2 Chen, G., Liao, M., Li, C., and Fan, K. AlphaMath al- most zero: Process supervision without process. arXiv preprint arXiv:2405.03553, 2024a. 3 Chen, J., Han, X., Ma, Y ., Zhou, X., and Xiang, L. Un- lock the correlation between supervised fine-tuning and reinforcement learning in training code large language models. arXiv preprint arXiv:2406.10305, 202...

work page arXiv 2023
[3]

URL https://arxiv.org/abs/2501. 12948. 1, 3, 7 Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The Llama 3 Herd of models. arXiv preprint arXiv:2407.21783, 2024. 2, 5, 6 Feng, X., Wan, Z., Wen, M., McAleer, S. M., Wen, Y ., Zhang, W., and Wang, J. AlphaZero-like tree-search can g...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

3 Liu, H., Li, C., Li, Y ., and Lee, Y . J. Improved base- lines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023. 3 Liu, H., Li, C., Li, Y ., Li, B., Zhang, Y ., Shen, S., and Lee, Y . J. LLaV A-NeXT: Improved rea- soning, ocr, and world knowledge, 2024. URL https://llava-vl.github.io/blog/ 2024-01-30-llava-next/ . 3 Lu, P., Bansal, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Proximal Policy Optimization Algorithms

URL https://openreview.net/forum? id=8aHzds2uUyB. 2 10 SFT Memorizes, RL Generalizes Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 3, 18 Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R., Agarwal, A., Berant, J., and Kumar, A. Re- w...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

3 Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling LLM test- time compute optimally can be more effective than scal- ing model parameters. arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

In Ku, L.-W., Martins, A

2, 3, 18 Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y ., Gan, C., Gui, L., Wang, Y .-X., Yang, Y ., Keutzer, K., and Darrell, T. Aligning large multimodal models with fac- tually augmented RLHF. In Ku, L.-W., Martins, A., and Srikumar, V . (eds.), Findings of the Association for Computational Linguistics: ACL 2024 , pp. 13088– 13110, Bangkok, Thai...

work page doi:10.18653/v1/2024 2024
[8]

cards": [x, y, z, w], where 'J', 'Q', and 'K' count as '10',

2 Wang, X., Antoniades, A., Elazar, Y ., Amayuelas, A., Albalak, A., Zhang, K., and Wang, W. Y . Gener- alization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:2407.14985, 2024. 2 Wei, J., Bosma, M., Zhao, V ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . Finetuned language mode...

work page arXiv 2024
[14]

forward()

Move forward until the destination Café de la Presse is on your right. [Current observation] You observe a 2x2 grid of street view images with the following headings: [front, right back, left] You need to identify if any of the landmarks in the instruction are visible in the street view grid. [Action space] - "forward()": indicates moving forward for 1 st...

work page
[15]

First, turn left to face east

work page
[16]

Move forward until you reach the next intersection where Hotel 32One is on your right behind

work page
[18]

Move forward until you reach the next intersection where Dragon Gate Chinatown SF is on your right front

work page
[20]

forward()

Move forward until the destination Café de la Presse is on your right. [Action space] - "forward()": indicates moving forward for 1 step; - "turn_direction(x)": indicates turn direction to the target heading, where x ∈['north', 'northeast', 'east', 'southeast', 'south', 'southwest', 'west', 'northwest']; - "stop()": indicates the navigation is finished; [...

work page 2017
[21]

First, turn right to face northwest

work page
[22]

Move forward until you reach next intersection where Korean War Memorial is on your left

work page
[23]

Turn left to face southwest

work page
[24]

Move forward until you reach next intersection where Korean War Memorial is on your left behind

work page
[25]

Turn right to face north

work page
[27]

Turn left to face east

work page
[29]

Turn left to face north

work page
[31]

Turn right to face east

work page
[32]

Move forward until you reach next intersection

work page
[33]

Turn left to face northeast

work page
[34]

Move forward until you reach next intersection where 9/11 Memorial & Museum is on your left

work page
[35]

Turn right to face northwest

work page
[36]

forward()

Move forward until you reach destination where The destination 9/11 Memorial & Museum is on your right front. [Action space] "forward()": indicates moving forward one step "turn_direction(x)": indicates adjust the ego agent direction towards x direction. x could be any following [’left’, ’right’, ’slightly left’, ’slightly right’] "stop()": indicates the ...

work page