arxiv: 2603.20340 · v3 · submitted 2026-03-20 · 💻 cs.SE · cs.AI

Recognition: no theorem link

ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

Zijian Lu , Yiping Zuo , Yupeng Nie , Xin He , Weibei Fan , Lianyong Qi , Shi Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-15 08:55 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords web agentsskill refinementcontract-based skillsmultimodal agentsrepairable artifactsVisualWebArenaimplicit skillsagent reliability

0 comments

The pith

ContractSkill converts implicit web agent skills into explicit, repairable artifacts for local fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Web agents often generate unstable skills that hurt their performance because those skills stay implicit and hard to check or fix. ContractSkill addresses this by turning a draft skill into an executable artifact with clear procedural steps. This structure allows deterministic verification, pinpointing faults, and making small local repairs instead of rewriting everything. Experiments on VisualWebArena demonstrate gains in realistic settings, and the repaired skills stay usable even without the original model. The work suggests the real issue is not just creating skills but making them explicit and maintainable.

Core claim

The central claim is that web skills become stable and improvable once converted into contract-based artifacts with explicit structure. This enables verification and localized editing, shifting refinement from complete rewrites to targeted changes. Results from VisualWebArena and MiniWoB confirm the approach works in both realistic and controlled environments, with evidence that the artifacts transfer across models within the benchmark.

What carries the argument

The ContractSkill framework, which wraps a skill draft into an explicit procedural structure for verification, fault localization, and repair.

Load-bearing premise

That web skills have enough structure to be captured in explicit procedural contracts without losing necessary adaptability.

What would settle it

If applying ContractSkill leads to no measurable increase in task completion rates on VisualWebArena compared to direct acting or implicit skills.

Figures

Figures reproduced from arXiv: 2603.20340 by Lianyong Qi, Shi Jin, Weibei Fan, Xin He, Yiping Zuo, Yupeng Nie, Zijian Lu.

**Figure 1.** Figure 1: Overview of ContractSkill. The pipeline proceeds through five stages. It starts with input and draft generation, then [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Representative repair case on VisualWebArena. The verifier localizes the failure to the post-exploration stage, and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Template-level MiniWoB pass rates across the main [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: VWA failure analysis across Qwen3.5-Plus and GLM [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Self-generated skills for web agents are often unstable and can even hurt performance relative to direct acting. We argue that the key bottleneck is not only skill generation quality, but the fact that web skills remain implicit and therefore cannot be checked or locally repaired. To address this, we present ContractSkill, a framework that converts a draft skill into an executable artifact with explicit procedural structure, enabling deterministic verifica tion, fault localization, and minimal local repair. This turns skill refinement from full rewriting into localized editing of a single skill artifact. Experiments on VisualWebArena show that Contract Skill is effective in realistic web environments, while MiniWoB provides a controlled test of the mechanism behind the gain. Under matched transfer layers, repaired artifacts also remain reusable after removing the source model from the loop, providing evi dence of portability within the same benchmark family rather than full-benchmark generalization. These results suggest that the central challenge is not merely generating skills, but mak ing them explicit, executable, and repairable. Code is available at https://github.com/underfitting-lu/contractskill.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ContractSkill gives web agents explicit repairable skills instead of implicit ones, with decent benchmark results, but the experiments do not isolate whether the repair step or just the explicit structure drives the gains.

read the letter

The core move here is turning self-generated web skills from opaque things that need full regeneration into explicit contract artifacts that support verification, fault finding, and small local fixes. That shift is the main contribution, and the VisualWebArena results plus the MiniWoB controlled test give it some grounding. The reusability claim after removing the source model is also useful, as it shows the artifacts can travel without the original generator in the loop. Code release is a plus for anyone who wants to check the implementation. The paper does a reasonable job of motivating why implicit skills are unstable and why making them checkable matters for practical agent work. The argument stays consistent with the benchmarks it uses and does not collapse into circular definitions. The main gap is the missing ablation that would separate the benefit of explicit contracts from the actual repair mechanism. Without that, it is hard to know whether the reported improvements come from the structure itself or from the localized editing step the authors emphasize. Baselines and variance details are also light in the available summary, which makes it tougher to judge effect sizes. This work is aimed at people building multimodal web agents who already care about skill stability and reuse. A reader in that area would get a concrete framework and some evidence that explicit contracts help, even if more controls would strengthen the case. It is solid enough on its own terms to deserve peer review rather than a desk reject, mainly because the idea is clear and the benchmarks are relevant.

Referee Report

2 major / 2 minor

Summary. The paper introduces ContractSkill, a framework that converts draft skills for multimodal web agents into executable artifacts with explicit procedural structure and contracts. This enables deterministic verification, fault localization, and minimal local repair, reframing skill refinement as localized editing of a single artifact rather than full rewriting. Experiments on VisualWebArena demonstrate effectiveness in realistic web environments, while MiniWoB provides a controlled test of the underlying mechanism; repaired artifacts remain reusable after source-model removal under matched transfer layers, suggesting portability within benchmark families.

Significance. If the results hold, the work offers a concrete path to improving stability and reusability of self-generated skills in web agents by tackling implicitness as the core bottleneck. Code availability supports reproducibility and strengthens the contribution for the software-engineering community working on agentic systems.

major comments (2)

[Experiments] Experiments section (likely §4): No ablation isolates the performance contribution of the local-repair mechanism from the benefits of the initial explicit contract structure alone. This is load-bearing for the central claim that repairability (rather than procedural structuring) drives the VisualWebArena gains and the reusability result after model removal; without the control, the argument that implicitness is the key bottleneck does not follow from the data.
[§4] §4 / Table 1 (or equivalent results table): The abstract reports positive results but the manuscript must supply exact metrics, full baseline descriptions, error bars or statistical tests, and the precise definition of 'matched transfer layers' to substantiate the effectiveness and portability claims.

minor comments (2)

[Abstract] Abstract: 'Contract Skill' appears inconsistently as two words; standardize to 'ContractSkill'.
[Abstract] Abstract: Typo 'evi dence' should be 'evidence'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested changes.

read point-by-point responses

Referee: [Experiments] Experiments section (likely §4): No ablation isolates the performance contribution of the local-repair mechanism from the benefits of the initial explicit contract structure alone. This is load-bearing for the central claim that repairability (rather than procedural structuring) drives the VisualWebArena gains and the reusability result after model removal; without the control, the argument that implicitness is the key bottleneck does not follow from the data.

Authors: We agree that the current experiments lack an ablation isolating the local-repair mechanism from the benefits of explicit contract structure alone. This is a valid concern for substantiating the central claim. In the revised manuscript we will add a controlled ablation comparing (i) draft skills, (ii) explicit contracts without repair, and (iii) explicit contracts with the local-repair procedure on both VisualWebArena and MiniWoB. The new results will be reported with the same metrics used in the main experiments. revision: yes
Referee: [§4] §4 / Table 1 (or equivalent results table): The abstract reports positive results but the manuscript must supply exact metrics, full baseline descriptions, error bars or statistical tests, and the precise definition of 'matched transfer layers' to substantiate the effectiveness and portability claims.

Authors: We will expand Section 4 and Table 1 (and any supplementary tables) to report exact numerical metrics for all conditions, full descriptions of every baseline, error bars or results of statistical significance tests where multiple runs were performed, and a precise definition of 'matched transfer layers' (including the layer indices and transfer procedure). These details will be added to make the experimental claims fully verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity; framework and empirical results are self-contained against external benchmarks

full rationale

The paper introduces ContractSkill as a method to convert implicit web skills into explicit, contract-structured artifacts for verification and local repair. Its central claims rest on experiments conducted on the independent VisualWebArena and MiniWoB benchmarks rather than on any internal fitting, self-definition, or self-citation chain. No equations, parameter estimations, or uniqueness theorems are presented that reduce the reported gains to the inputs by construction. The distinction between implicit and explicit skills is treated as a testable hypothesis, not a definitional premise, leaving the derivation chain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that implicit skills are the primary source of instability in web agents and that explicit contracts can be generated without introducing new failure modes.

axioms (1)

domain assumption Web skills remain implicit and therefore cannot be checked or locally repaired
Directly stated as the key bottleneck in the abstract.

invented entities (1)

ContractSkill framework no independent evidence
purpose: Converts draft skills into executable artifacts with explicit procedural structure
Newly introduced framework whose independent evidence is the reported benchmark gains.

pith-pipeline@v0.9.0 · 5500 in / 1225 out tokens · 26153 ms · 2026-05-15T08:55:10.247381+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 3 internal anchors

[1]

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent S: An Open Agentic Framework that Uses Computers Like a Human. OpenReview. https://openreview.net/forum?id=43XMKuTTK0 ICLR 2025 Workshop AgenticAI Oral

work page 2025
[2]

Leo Boisvert, Megh Thakkar, Maxime Gasse, et al. 2024. WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track. Neural Information Processing Systems Foundation, Inc., Vancouver, BC, Canada, 5996–6051. doi:10.52202/079017-0195

work page doi:10.52202/079017-0195 2024
[3]

Minghao Chen, Yihang Li, Yanting Yang, et al. 2024. AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning. In Advances in Neural Information Processing Systems , Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 589–631. doi:10.52202/079017-0019

work page doi:10.52202/079017-0019 2024
[4]

Weizhi Chen, Ziwei Wang, Leyang Yang, et al . 2025. PG-Agent: An Agent Powered by Page Graph. In Proceedings of the 33rd ACM International Conference on Multimedia . Association for Computing Machinery, New York, NY, USA, 6878–6887. doi:10.1145/3746027.3755189

work page doi:10.1145/3746027.3755189 2025
[5]

Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith

Edmund M. Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith

work page
[6]

In Computer Aided Verification (Lecture Notes in Computer Science, Vol

Counterexample-Guided Abstraction Refinement. In Computer Aided Verification (Lecture Notes in Computer Science, Vol. 1855) . Springer, Chicago, IL, USA, 154–169. doi:10.1007/10722167_15

work page doi:10.1007/10722167_15
[7]

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, et al. 2025. The BrowserGym Ecosystem for Web Agent Research. Transactions on Machine Learning Research

work page 2025
[8]

Xiang Deng, Yu Gu, Boyuan Zheng, et al. 2023. Mind2Web: Towards a Generalist Agent for the Web. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track . Neural Information Processing Systems Foundation, Inc., New Orleans, LA, USA

work page 2023
[9]

Alexandre Drouin, Maxime Gasse, Massimo Caccia, et al . 2024. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) . PMLR, Vienna, Austria, 11642–11662. https://proceedings.mlr.press/v235/drouin24a.html

work page 2024
[10]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54–72. doi:10.1109/TSE.2011.104

work page doi:10.1109/tse.2011.104 2012
[11]

Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic Patch Generation Learned from Human-Written Patches. In 2013 35th Interna- tional Conference on Software Engineering (ICSE) . IEEE, San Francisco, CA, USA, 802–811. doi:10.1109/ICSE.2013.6606626

work page doi:10.1109/icse.2013.6606626 2013
[12]

Jing Yu Koh, Robert Lo, Lawrence Jang, et al. 2024. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 881–905. doi:10.18653/v1/2024.acl-long.50

work page doi:10.18653/v1/2024.acl-long.50 2024
[13]

Hanyu Lai, Xiao Liu, Iat Long Iong, et al. 2024. AutoWebGLM: A Large Language Model-based Web Navigating Agent. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . Association for Computing Machinery, New York, NY, USA, 5295–5306. doi:10.1145/3637528.3671620

work page doi:10.1145/3637528.3671620 2024
[14]

Xiaoxiao Li. 2026. When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail. arXiv:2601.04748 [cs.AI] https://arxiv.org/abs/2601.04748

work page arXiv 2026
[15]

Xiangyi Li, Wenbo Chen, Yimin Liu, et al. 2026. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv preprint arXiv:2602.12670. https://arxiv.org/abs/2602.12670

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang

work page
[17]

International Conference on Learning Representations

Reinforcement Learning on Web Interfaces Using Workflow-Guided Ex- ploration. International Conference on Learning Representations

work page
[18]

Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Cor- rect Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages . Association for Computing Machinery, St. Petersburg, FL, USA, 298–312. doi:10.1145/2837614.2837617

work page doi:10.1145/2837614.2837617 2016
[19]

Xing Han Lu, Zdenek Kasner, and Siva Reddy. 2024. WebLINX: Real-World Website Navigation with Multi-Turn Dialogue. In Proceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 33007–33056. https://proceedings.mlr.press/ v235/lu24e.html

work page 2024
[20]

Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. DirectFix: Looking for Simple Program Repairs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, Florence, Italy, 448–458. doi:10.1109/ICSE.2015.63

work page doi:10.1109/icse.2015.63 2015
[21]

Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis. In Proceedings of the 38th International Conference on Software Engineering. Association for Computing Machinery, Austin, TX, USA, 691–701. doi:10.1145/2884781.2884807

work page doi:10.1145/2884781.2884807 2016
[22]

Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. SemFix: Program Repair via Semantic Analysis. In 2013 35th Interna- tional Conference on Software Engineering (ICSE) . IEEE, San Francisco, CA, USA, 772–781. doi:10.1109/ICSE.2013.6606623

work page doi:10.1109/icse.2013.6606623 2013
[23]

Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang

work page
[24]

In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol

World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70) . PMLR, Sydney, Australia, 3135–3144. https://proceedings.mlr.press/v70/shi17a.html

work page
[25]

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. In Advances in Neural Information Processing Systems , Vol. 36. Curran Associates, Inc., Red Hook, NY, USA

work page 2023
[26]

Yunteng Tan, Zhi Gao, and Xinxiao Wu. 2026. Enhancing Web Agents with a Hierarchical Memory Tree. arXiv:2603.07024 [cs.AI] https://arxiv.org/abs/2603. 07024

work page arXiv 2026
[27]

Brandon Trabucco, Gunnar Sigurdsson, Robinson Piramuthu, and Ruslan Salakhutdinov. 2025. InSTA: Towards Internet-Scale Training For Agents. arXiv:2502.06776 [cs.LG] https://arxiv.org/abs/2502.06776

work page arXiv 2025
[28]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, et al. 2024. Voyager: An Open-Ended Em- bodied Agent with Large Language Models. Transactions on Machine Learning Research. Accepted by TMLR; ICLR 2025 Journal Track

work page 2024
[29]

Junyang Wang, Haiyang Xu, Haitao Jia, et al. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collab- oration. In Advances in Neural Information Processing Systems , Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 2686–2710. doi:10.52202/079017-0088

work page doi:10.52202/079017-0088 2024
[30]

Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. Inducing Programmatic Skills for Agentic Tasks. Proceedings of the Conference on Language Modeling

work page 2025
[31]

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267) . PMLR, Vancouver, BC, Canada, 63897–63911. https://proceedings.mlr.press/v267/wang25bx.html

work page 2025
[32]

Zhaotian Weng, Antonis Antoniades, Deepak Nathani, et al . 2026. Group- Evolving Agents: Open-Ended Self-Improvement via Experience Sharing. arXiv:2602.04837 [cs.AI] https://arxiv.org/abs/2602.04837

work page arXiv 2026
[33]

Tianbao Xie, Danyang Zhang, Jixuan Chen, et al. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track. Neural Information Processing Systems Foundation, Inc., Vancouver, BC, Canada, 52040–52094. doi:10.52202/079017-1650

work page doi:10.52202/079017-1650 2024
[34]

Ran Xu, Kaixin Ma, Wenhao Yu, et al. 2025. Retrieval-augmented GUI Agents with Generative Guidelines. In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing . Association for Computational Linguistics, Suzhou, China, 17866–17875. doi:10.18653/v1/2025.emnlp-main.902

work page doi:10.18653/v1/2025.emnlp-main.902 2025
[35]

Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Archi- tecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.MA] https://arxiv.org/abs/2602.12430

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Ke Yang, Yao Liu, Sapana Chaudhary, et al. 2025. AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. The Thirteenth International Conference on Learning Representations

work page 2025
[37]

Narasimhan, and Yuan Cao

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations

work page 2023
[38]

Tianjun Yao, Yongqiang Chen, Yujia Zheng, et al. 2026. ParamMem: Augmenting Language Agents with Parametric Reflective Memory. arXiv:2602.23320 [cs.LG] https://arxiv.org/abs/2602.23320

work page arXiv 2026
[39]

Simon Yu, Gang Li, Weiyan Shi, and Peng Qi. 2025. PolySkill: Learning Gen- eralizable Skills Through Polymorphic Abstraction. arXiv:2510.15863 [cs.CL] https://arxiv.org/abs/2510.15863

work page arXiv 2025
[40]

Chi Zhang, Zhao Yang, Jiaxuan Liu, et al. 2025. AppAgent: Multimodal Agents as Smartphone Users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 70:1–70:20. doi:10.1145/3706598.3713600

work page doi:10.1145/3706598.3713600 2025
[41]

Andrew Zhao, Daniel Huang, Quentin Xu, et al. 2024. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence 38, 17 (2024), 19632–19642. doi:10.1609/AAAI.V38I17.29936

work page doi:10.1609/aaai.v38i17.29936 2024
[42]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, et al . 2025. Skill- Weaver: Web Agents can Self-Improve by Discovering and Honing Skills. arXiv:2504.07079 [cs.AI] https://arxiv.org/abs/2504.07079

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT- 4V(ision) is a Generalist Web Agent, if Grounded. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 61349–61385. https://proceedings. mlr.press/v235/zheng24e.html

work page 2024
[44]

Xu, Hao Zhu, et al

Shuyan Zhou, Frank F. Xu, Hao Zhu, et al . 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. The Twelfth International Conference on Learning Representations. Poster. 10

work page 2024