STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

Haiyang Xu; Jieping Ye; Jitao Sang; Junyang Wang; Ming Yan; Xi Zhang; Zhaoqing Zhu

arxiv: 2605.29324 · v1 · pith:HBPNMDHTnew · submitted 2026-05-28 · 💻 cs.CL · cs.CV

STAMP: Training Explicit Memory for Mobile GUI Agents in Controllable and Scalable Virtual Environments

Junyang Wang , Haiyang Xu , Xi Zhang , Zhaoqing Zhu , Ming Yan , Jieping Ye , Jitao Sang This is my paper

Pith reviewed 2026-06-29 07:42 UTC · model grok-4.3

classification 💻 cs.CL cs.CV

keywords mobile GUI agentsexplicit memoryvirtual environmentslong-horizon tasksreinforcement learningmemory benchmarkcontext window limitssupervised memory data

0 comments

The pith

Virtual environments with injected memory variables train GUI agents to explicitly remember information across long tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that mobile GUI agents lose critical past information because screenshot inputs overwhelm context windows, forcing them to discard history. To fix this, STAMP creates simulated environments where specific facts are programmatically inserted into tasks at controlled moments, generating labeled examples of what to encode and when to retrieve it. This produces scalable supervised data plus environment rewards for reinforcement learning. The resulting agent then performs better on long-horizon mobile tasks that require memory while keeping general navigation skills intact.

Core claim

STAMP trains explicit memory in mobile GUI agents by building controllable virtual environments that inject deterministic memory variables into synthesized tasks, specifying exactly what must be memorized, when it must be encoded, and when it must be retrieved, thereby yielding verifiable training signals at scale and enabling online reinforcement learning with environment-driven rewards.

What carries the argument

Programmatically injected deterministic memory variables in virtual environments that control encoding and retrieval timing to create supervised memory data.

If this is right

Agents gain the ability to retain transient details across dozens of steps without exhausting context windows.
Training data for memory behavior can be generated programmatically instead of collected from expensive real interactions.
Online reinforcement learning becomes feasible because environment rewards directly score correct recall timing.
General mobile navigation performance remains intact while memory accuracy improves on the Memory-World benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same injection technique could be adapted to train memory in non-GUI agents such as web or desktop controllers.
If virtual memory variables transfer well, real-device data collection for long-horizon tasks could be reduced.
Benchmark results on Memory-World may understate failure modes that appear only when memory variables are not explicitly provided.

Load-bearing premise

That the memory variables and rewards programmed into virtual environments match the memory demands of actual mobile GUI use.

What would settle it

A test showing that an agent trained only on the virtual memory tasks performs no better than baseline agents when given the same long real-world mobile navigation sequences.

Figures

Figures reproduced from arXiv: 2605.29324 by Haiyang Xu, Jieping Ye, Jitao Sang, Junyang Wang, Ming Yan, Xi Zhang, Zhaoqing Zhu.

**Figure 2.** Figure 2: The end-to-end framework of STAMP. The proposed pipeline consists of three sequential phases: [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of Stamp-GUI predict memory content when appropriate and using memory when needed on an [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: General capability evaluation on the full 116 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Mobile GUI agents excel at immediate reactive control but frequently fail in realistic, long-horizon tasks that require memory. This failure stems from a fundamental conflict between limited context windows and token-heavy screenshots. To save the limited context, agents must progressively discard older visual history, permanently losing crucial transient information. Furthermore, existing action-centric datasets fail to teach agents what or when to explicitly memorize, and augmenting static real-world data is prohibitively expensive and lacks interactive verification. To resolve this, we present STAMP, a framework that trains explicit memory in mobile agents through controllable virtual environments, where deterministic memory variables are programmatically injected into synthesized tasks to control what must be memorized, when it should be encoded, and when it must later be retrieved, thereby producing verifiable supervised data at scale and enabling online reinforcement learning through environment-driven reward feedback. Evaluated on our newly introduced Memory-World benchmark, the resulting Stamp-GUI agent achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark, demonstrating exceptional memory accuracy and task resilience while maintaining strong general mobile navigation capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STAMP gives a practical way to generate scalable memory training data for GUI agents via injected variables in virtual envs, but the transfer gap to real apps is unaddressed.

read the letter

The core idea here is using controllable virtual environments where memory variables are programmatically injected into tasks. This lets them create supervised signals for what and when to memorize, plus environment rewards for reinforcement learning, all at scale without expensive real data collection.

What stands out is the explicit control over memory encoding and retrieval in the training loop. Existing GUI datasets are mostly action traces that do not teach agents to decide on memory use, and real screenshots eat context fast. The synthetic setup directly targets that by making memory requirements deterministic and verifiable during training. The new Memory-World benchmark then measures accuracy and resilience on those controlled tasks while checking that general navigation does not degrade.

The main soft spot is the untested assumption that deterministic injected scalars in virtual environments match the memory demands of actual mobile apps. Real GUIs have non-deterministic state, visual layout changes, and app-internal variables that are never exposed as clean variables. The abstract gives no cross-domain results on unmodified device traces or human memory-annotated tasks, so it is unclear whether the learned policies will hold up outside the synthetic distribution. The SOTA claim is also limited to their own benchmark with no external baselines or metric details visible.

This paper is aimed at groups working on long-horizon mobile agents who need better memory mechanisms. It is worth a serious referee because the data-generation approach is concrete and the benchmark could become a useful testbed, even if the real-world transfer question will need more experiments in revision.

Referee Report

2 major / 0 minor

Summary. The paper introduces the STAMP framework for training explicit memory capabilities in mobile GUI agents. It uses controllable virtual environments in which deterministic memory variables are programmatically injected into synthesized tasks to generate verifiable supervised data at scale and to supply environment-driven rewards for online reinforcement learning. The work also introduces the Memory-World benchmark; the resulting Stamp-GUI agent is claimed to achieve state-of-the-art performance among GUI-specialized models, set a new high watermark on Memory-World, exhibit exceptional memory accuracy and task resilience, and retain strong general mobile navigation capabilities.

Significance. If the central claim holds and the virtual-environment training signal transfers, the approach would supply a scalable, controllable route to teaching long-horizon memory policies that current context-window-limited agents lack, addressing a recognized bottleneck in realistic GUI task execution.

major comments (2)

[Abstract] Abstract: the claim that Stamp-GUI 'achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark' is asserted without any reported metrics, baselines, ablation results, or experimental protocol, rendering the central empirical claim unsupported in the provided text.
[Method description] Method description (and Abstract): the training signal rests on the assumption that programmatically injected deterministic memory variables and environment rewards accurately capture the memory requirements of real mobile GUI tasks. Real GUIs contain non-deterministic state, implicit visual/layout cues, and app-internal state not exposed as variables; no cross-domain evaluation on unmodified real-device traces or human-annotated memory tasks is described to test transfer of the learned policies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that Stamp-GUI 'achieves state-of-the-art performance among GUI-specialized models and sets a new high watermark on our Memory-World benchmark' is asserted without any reported metrics, baselines, ablation results, or experimental protocol, rendering the central empirical claim unsupported in the provided text.

Authors: We agree that the abstract asserts the central empirical claims without including supporting metrics or protocol details. The full manuscript reports these results in the experiments section. We will revise the abstract to include key quantitative results (e.g., Memory-World accuracy scores and baseline comparisons) and a concise reference to the evaluation setup. revision: yes
Referee: [Method description] Method description (and Abstract): the training signal rests on the assumption that programmatically injected deterministic memory variables and environment rewards accurately capture the memory requirements of real mobile GUI tasks. Real GUIs contain non-deterministic state, implicit visual/layout cues, and app-internal state not exposed as variables; no cross-domain evaluation on unmodified real-device traces or human-annotated memory tasks is described to test transfer of the learned policies.

Authors: We acknowledge the assumption underlying the virtual-environment training signal and the absence of cross-domain transfer experiments on real devices. The manuscript deliberately centers on controllable virtual environments to enable scalable, verifiable memory supervision; it does not claim or evaluate transfer to unmodified real-device traces. We will revise the method description and abstract to state this scope more explicitly and will add a limitations paragraph discussing the gap to real-world non-deterministic GUIs as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: training and benchmark both rely on explicit injected variables but the performance metric is not forced by construction

full rationale

The paper describes a methodological framework that programmatically injects deterministic memory variables into synthesized virtual tasks to generate training signals and then evaluates the resulting agent on the newly introduced Memory-World benchmark. No equations, self-citations, or derivations are presented that reduce the reported SOTA memory accuracy or task resilience to a tautological fit or renaming of the injected variables themselves. The central claim remains an empirical statement about agent behavior in the constructed environments rather than a self-referential prediction. This is the most common honest non-finding for a methods paper that introduces both its training distribution and its evaluation distribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5741 in / 894 out tokens · 22442 ms · 2026-06-29T07:42:48.304483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2025. Amex: Android multi- annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 2138–2156. Tongbo Chen, Zhen...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others. 2024. Auto- glm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820. Quanfeng Lu, Wenqi S...

work page arXiv 2024
[3]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458. Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namy- ong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, and 1 others. 2024. Gui agents: A survey.ArXiv preprint, abs/2412.13501. OpenAI. 2026. Gpt 5.4. Technical report, OpenAI. System Card. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082. Tongyi. 2026. Qwen-3.5. Technical report, Tongyi. System Card. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, and 1 others. 2025a. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcemen...

work page arXiv 2026
[5]

goal": ...,

Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20. Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024b. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Lingu...

work page arXiv 2025
[6]

Action: a short imperative describing what to do in the UI
[7]

Memory: the key information need to memory in the screenshot
[8]

name": <function-name>,

A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args- json-object>}. Rules: - Output exactly in the order: Action, Memory, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those three parts. - If finishing, use action=terminate in the tool call. Table 15: System prom...
[9]

Do not perform exploratory actions based on your own assumptions

Follow the task guideline strictly. Do not perform exploratory actions based on your own assumptions
[10]

The input action already includes one click, so you do not need to click the input box separately before inputting
[11]

If a popup appears, try closing it first
[12]

You must first browse all information according to the detailed guideline, and only then input the answer

Do not try to input the final answer directly. You must first browse all information according to the detailed guideline, and only then input the answer
[13]

Table 16: Prompt used by the dynamic testing planner

Before entering the final answer, you must review the action history and screenshots to ensure that all required information for the task has already appeared. Table 16: Prompt used by the dynamic testing planner. Memory-Acc Evaluation Prompt I have a task to perform using Chrome on my phone, as follows: {} Please note that this task requires memorizing s...
[14]

Complete Match

If the mobile phone agent’s memory completely matches the provided content to be memorized, output “Complete Match”
[15]

Partial Match

If the mobile phone agent’s memory partially matches the provided content to be memorized, output “Partial Match”
[16]

No Match

If the mobile phone agent’s memory does not match the provided content to be memorized at all, output “No Match”. “Match” means that both expressions convey the same meaning. They do not need to be completely identical. As long as the core content to be memorized, such as numbers or strings, is the same, it is acceptable. Now, after careful consideration,...
[17]

task.final_action.gold must be exactly identical to scenario.data.truth.gold
[18]

task.grading.gold must be exactly identical to task.final_action.gold
[19]

task.grading.pass_regex must match the gold string, preferably using ˆ...$
[20]

required_testids must include at least: - all tabs: tab-* - search input: search-input (if the scenario has a search page) - final submission: go-submit-answer, answer-input, answer-submit, result - fact display containers used by memory_items, such as fact-xxx / followers-name / price-sku
[21]

task.natural_language must clearly describe the task flow, such as which pages to visit and what kind of information to look for, but it must not reveal the actual values that need to be remembered
[22]

task.guideline must be a more detailed version of task.natural_language, and must describe the full workflow, all memory targets, and the final submission process in detail
[23]

Find the code 1234 in My > Orders > History

task.guideline must include the exact location and exact content for every memory item, for example: “Find the code 1234 in My > Orders > History” and “Find the event code 5678 in Home > Activities > Popular Events”. Table 19: Prompt used by the task generation agent. 22 Webpage Generation Agent Prompt You are a web app generator. You must generate a sing...
[24]

A mobile app style bottom tab bar using task_spec.platform.tabs, with corresponding unique data-testid values in the form tab-*
[25]

Every testid listed in task_spec.ui_contract.required_testids must exist, be unique, and be interactable when appropriate
[26]

The navigation behavior described in task_spec.ui_contract.navigation_contract must be implemented
[27]

Do not alter truth values or truth text

The truth facts in scenario.data.truth.facts must be displayed on the appropriate pages exactly as specified. Do not alter truth values or truth text
[28]

go-submit-answer

There must be an entry point with data-testid="go-submit-answer" that leads to a dedicated answer submission page containing: - data-testid="answer-input" - data-testid="answer-submit" - data-testid="result"
[29]

Example: xxx

Do not use the <select> element to implement the filter function. If you need to implement the filter function, please use multiple clickable buttons instead. Submission and validation requirements: - The one and only correct answer is: {gold} - Make sure no distractor content creates ambiguity. - {gold} must remain the unique correct answer under all cir...
[30]

predicted_answer must comply with task_spec.task.final_action.required_output_format, or at least follow the same exact formatting style as task_spec.task.final_action.gold
[31]

ok=true if and only if: - predicted_answer is exactly equal to gold - uniqueness.is_unique is true - the task flow is executable and the required information can actually be found in the HTML without contradiction
[32]

evidence.key_observations must contain enough information for a third party to verify how predicted_answer was obtained without running the page
[33]

Use the smallest possible fix strategy, with the following priority: (a) first adjust distractor values, ordering, labels, or section membership so they no longer affect the task result (b) second move distractors out of the relevant comparison or calculation area, such as from a ranking list into a recommendation or ad section (c) avoid changing scenario...
[34]

For common bugs involving highest value, top-k, sorting, filtering, or ranking, you must check all candidate items that look comparable, including newly introduced distractors, and ensure they do not change the result away from gold
[35]

Comparison scope for this task: Popularity ranking

You may modify the HTML to make the task interpretation clearer, for example by renaming a section title to “Comparison scope for this task: Popularity ranking”, but do not leak the answer in instructional text
[36]

If the final answer is formed by combining multiple strings, list each minimal component separately

evidence.key_notes must list all minimal memory units required for the final answer. If the final answer is formed by combining multiple strings, list each minimal component separately
[37]

All output content must be in English only
[38]

Do not include any Chinese characters anywhere, including in the repaired HTML
[39]

It must be implemented through multiple clickable buttons

Please ensure that the filter functionality of the webpage is not implemented through the <select> element. It must be implemented through multiple clickable buttons. scenario.truth.facts: {json.dumps(facts, ensure_ascii=False, indent=2)} gold: {gold} scenario JSON: {json.dumps(scn, ensure_ascii=False, indent=2)} task_spec JSON: {json.dumps(task_spec, ens...

[1] [1]

Qwen3-VL Technical Report

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631. Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Guozhi Wang, Dingyu Zhang, Shuai Ren, and Hongsheng Li. 2025. Amex: Android multi- annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Lin- guistics: ACL 2025, pages 2138–2156. Tongbo Chen, Zhen...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others

Memgui-bench: Benchmarking memory of mobile gui agents in dynamic environments.arXiv preprint arXiv:2602.06075. Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, and 1 others. 2024. Auto- glm: Autonomous foundation agents for guis.arXiv preprint arXiv:2411.00820. Quanfeng Lu, Wenqi S...

work page arXiv 2024

[3] [3]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458. Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namy- ong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, and 1 others. 2024. Gui agents: A survey.ArXiv preprint, abs/2412.13501. OpenAI. 2026. Gpt 5.4. Technical report, OpenAI. System Card. ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082. Tongyi. 2026. Qwen-3.5. Technical report, Tongyi. System Card. Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, and 1 others. 2025a. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcemen...

work page arXiv 2026

[5] [5]

goal": ...,

Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pages 1–20. Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024b. Android in the zoo: Chain-of-action-thought for gui agents. InFindings of the Association for Computational Lingu...

work page arXiv 2025

[6] [6]

Action: a short imperative describing what to do in the UI

[7] [7]

Memory: the key information need to memory in the screenshot

[8] [8]

name": <function-name>,

A single <tool_call>...</tool_call> block containing only the JSON: {"name": <function-name>, "arguments": <args- json-object>}. Rules: - Output exactly in the order: Action, Memory, <tool_call>. - Be brief: one for Action. - Do not output anything else outside those three parts. - If finishing, use action=terminate in the tool call. Table 15: System prom...

[9] [9]

Do not perform exploratory actions based on your own assumptions

Follow the task guideline strictly. Do not perform exploratory actions based on your own assumptions

[10] [10]

The input action already includes one click, so you do not need to click the input box separately before inputting

[11] [11]

If a popup appears, try closing it first

[12] [12]

You must first browse all information according to the detailed guideline, and only then input the answer

Do not try to input the final answer directly. You must first browse all information according to the detailed guideline, and only then input the answer

[13] [13]

Table 16: Prompt used by the dynamic testing planner

Before entering the final answer, you must review the action history and screenshots to ensure that all required information for the task has already appeared. Table 16: Prompt used by the dynamic testing planner. Memory-Acc Evaluation Prompt I have a task to perform using Chrome on my phone, as follows: {} Please note that this task requires memorizing s...

[14] [14]

Complete Match

If the mobile phone agent’s memory completely matches the provided content to be memorized, output “Complete Match”

[15] [15]

Partial Match

If the mobile phone agent’s memory partially matches the provided content to be memorized, output “Partial Match”

[16] [16]

No Match

If the mobile phone agent’s memory does not match the provided content to be memorized at all, output “No Match”. “Match” means that both expressions convey the same meaning. They do not need to be completely identical. As long as the core content to be memorized, such as numbers or strings, is the same, it is acceptable. Now, after careful consideration,...

[17] [17]

task.final_action.gold must be exactly identical to scenario.data.truth.gold

[18] [18]

task.grading.gold must be exactly identical to task.final_action.gold

[19] [19]

task.grading.pass_regex must match the gold string, preferably using ˆ...$

[20] [20]

required_testids must include at least: - all tabs: tab-* - search input: search-input (if the scenario has a search page) - final submission: go-submit-answer, answer-input, answer-submit, result - fact display containers used by memory_items, such as fact-xxx / followers-name / price-sku

[21] [21]

task.natural_language must clearly describe the task flow, such as which pages to visit and what kind of information to look for, but it must not reveal the actual values that need to be remembered

[22] [22]

task.guideline must be a more detailed version of task.natural_language, and must describe the full workflow, all memory targets, and the final submission process in detail

[23] [23]

Find the code 1234 in My > Orders > History

task.guideline must include the exact location and exact content for every memory item, for example: “Find the code 1234 in My > Orders > History” and “Find the event code 5678 in Home > Activities > Popular Events”. Table 19: Prompt used by the task generation agent. 22 Webpage Generation Agent Prompt You are a web app generator. You must generate a sing...

[24] [24]

A mobile app style bottom tab bar using task_spec.platform.tabs, with corresponding unique data-testid values in the form tab-*

[25] [25]

Every testid listed in task_spec.ui_contract.required_testids must exist, be unique, and be interactable when appropriate

[26] [26]

The navigation behavior described in task_spec.ui_contract.navigation_contract must be implemented

[27] [27]

Do not alter truth values or truth text

The truth facts in scenario.data.truth.facts must be displayed on the appropriate pages exactly as specified. Do not alter truth values or truth text

[28] [28]

go-submit-answer

There must be an entry point with data-testid="go-submit-answer" that leads to a dedicated answer submission page containing: - data-testid="answer-input" - data-testid="answer-submit" - data-testid="result"

[29] [29]

Example: xxx

Do not use the <select> element to implement the filter function. If you need to implement the filter function, please use multiple clickable buttons instead. Submission and validation requirements: - The one and only correct answer is: {gold} - Make sure no distractor content creates ambiguity. - {gold} must remain the unique correct answer under all cir...

[30] [30]

predicted_answer must comply with task_spec.task.final_action.required_output_format, or at least follow the same exact formatting style as task_spec.task.final_action.gold

[31] [31]

ok=true if and only if: - predicted_answer is exactly equal to gold - uniqueness.is_unique is true - the task flow is executable and the required information can actually be found in the HTML without contradiction

[32] [32]

evidence.key_observations must contain enough information for a third party to verify how predicted_answer was obtained without running the page

[33] [33]

Use the smallest possible fix strategy, with the following priority: (a) first adjust distractor values, ordering, labels, or section membership so they no longer affect the task result (b) second move distractors out of the relevant comparison or calculation area, such as from a ranking list into a recommendation or ad section (c) avoid changing scenario...

[34] [34]

For common bugs involving highest value, top-k, sorting, filtering, or ranking, you must check all candidate items that look comparable, including newly introduced distractors, and ensure they do not change the result away from gold

[35] [35]

Comparison scope for this task: Popularity ranking

You may modify the HTML to make the task interpretation clearer, for example by renaming a section title to “Comparison scope for this task: Popularity ranking”, but do not leak the answer in instructional text

[36] [36]

If the final answer is formed by combining multiple strings, list each minimal component separately

evidence.key_notes must list all minimal memory units required for the final answer. If the final answer is formed by combining multiple strings, list each minimal component separately

[37] [37]

All output content must be in English only

[38] [38]

Do not include any Chinese characters anywhere, including in the repaired HTML

[39] [39]

It must be implemented through multiple clickable buttons

Please ensure that the filter functionality of the webpage is not implemented through the <select> element. It must be implemented through multiple clickable buttons. scenario.truth.facts: {json.dumps(facts, ensure_ascii=False, indent=2)} gold: {gold} scenario JSON: {json.dumps(scn, ensure_ascii=False, indent=2)} task_spec JSON: {json.dumps(task_spec, ens...