Recognition: no theorem link
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
Pith reviewed 2026-05-15 08:55 UTC · model grok-4.3
The pith
ContractSkill converts implicit web agent skills into explicit, repairable artifacts for local fixes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that web skills become stable and improvable once converted into contract-based artifacts with explicit structure. This enables verification and localized editing, shifting refinement from complete rewrites to targeted changes. Results from VisualWebArena and MiniWoB confirm the approach works in both realistic and controlled environments, with evidence that the artifacts transfer across models within the benchmark.
What carries the argument
The ContractSkill framework, which wraps a skill draft into an explicit procedural structure for verification, fault localization, and repair.
Load-bearing premise
That web skills have enough structure to be captured in explicit procedural contracts without losing necessary adaptability.
What would settle it
If applying ContractSkill leads to no measurable increase in task completion rates on VisualWebArena compared to direct acting or implicit skills.
Figures
read the original abstract
Self-generated skills for web agents are often unstable and can even hurt performance relative to direct acting. We argue that the key bottleneck is not only skill generation quality, but the fact that web skills remain implicit and therefore cannot be checked or locally repaired. To address this, we present ContractSkill, a framework that converts a draft skill into an executable artifact with explicit procedural structure, enabling deterministic verifica tion, fault localization, and minimal local repair. This turns skill refinement from full rewriting into localized editing of a single skill artifact. Experiments on VisualWebArena show that Contract Skill is effective in realistic web environments, while MiniWoB provides a controlled test of the mechanism behind the gain. Under matched transfer layers, repaired artifacts also remain reusable after removing the source model from the loop, providing evi dence of portability within the same benchmark family rather than full-benchmark generalization. These results suggest that the central challenge is not merely generating skills, but mak ing them explicit, executable, and repairable. Code is available at https://github.com/underfitting-lu/contractskill.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ContractSkill, a framework that converts draft skills for multimodal web agents into executable artifacts with explicit procedural structure and contracts. This enables deterministic verification, fault localization, and minimal local repair, reframing skill refinement as localized editing of a single artifact rather than full rewriting. Experiments on VisualWebArena demonstrate effectiveness in realistic web environments, while MiniWoB provides a controlled test of the underlying mechanism; repaired artifacts remain reusable after source-model removal under matched transfer layers, suggesting portability within benchmark families.
Significance. If the results hold, the work offers a concrete path to improving stability and reusability of self-generated skills in web agents by tackling implicitness as the core bottleneck. Code availability supports reproducibility and strengthens the contribution for the software-engineering community working on agentic systems.
major comments (2)
- [Experiments] Experiments section (likely §4): No ablation isolates the performance contribution of the local-repair mechanism from the benefits of the initial explicit contract structure alone. This is load-bearing for the central claim that repairability (rather than procedural structuring) drives the VisualWebArena gains and the reusability result after model removal; without the control, the argument that implicitness is the key bottleneck does not follow from the data.
- [§4] §4 / Table 1 (or equivalent results table): The abstract reports positive results but the manuscript must supply exact metrics, full baseline descriptions, error bars or statistical tests, and the precise definition of 'matched transfer layers' to substantiate the effectiveness and portability claims.
minor comments (2)
- [Abstract] Abstract: 'Contract Skill' appears inconsistently as two words; standardize to 'ContractSkill'.
- [Abstract] Abstract: Typo 'evi dence' should be 'evidence'.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate the requested changes.
read point-by-point responses
-
Referee: [Experiments] Experiments section (likely §4): No ablation isolates the performance contribution of the local-repair mechanism from the benefits of the initial explicit contract structure alone. This is load-bearing for the central claim that repairability (rather than procedural structuring) drives the VisualWebArena gains and the reusability result after model removal; without the control, the argument that implicitness is the key bottleneck does not follow from the data.
Authors: We agree that the current experiments lack an ablation isolating the local-repair mechanism from the benefits of explicit contract structure alone. This is a valid concern for substantiating the central claim. In the revised manuscript we will add a controlled ablation comparing (i) draft skills, (ii) explicit contracts without repair, and (iii) explicit contracts with the local-repair procedure on both VisualWebArena and MiniWoB. The new results will be reported with the same metrics used in the main experiments. revision: yes
-
Referee: [§4] §4 / Table 1 (or equivalent results table): The abstract reports positive results but the manuscript must supply exact metrics, full baseline descriptions, error bars or statistical tests, and the precise definition of 'matched transfer layers' to substantiate the effectiveness and portability claims.
Authors: We will expand Section 4 and Table 1 (and any supplementary tables) to report exact numerical metrics for all conditions, full descriptions of every baseline, error bars or results of statistical significance tests where multiple runs were performed, and a precise definition of 'matched transfer layers' (including the layer indices and transfer procedure). These details will be added to make the experimental claims fully verifiable. revision: yes
Circularity Check
No circularity; framework and empirical results are self-contained against external benchmarks
full rationale
The paper introduces ContractSkill as a method to convert implicit web skills into explicit, contract-structured artifacts for verification and local repair. Its central claims rest on experiments conducted on the independent VisualWebArena and MiniWoB benchmarks rather than on any internal fitting, self-definition, or self-citation chain. No equations, parameter estimations, or uniqueness theorems are presented that reduce the reported gains to the inputs by construction. The distinction between implicit and explicit skills is treated as a testable hypothesis, not a definitional premise, leaving the derivation chain non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Web skills remain implicit and therefore cannot be checked or locally repaired
invented entities (1)
-
ContractSkill framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent S: An Open Agentic Framework that Uses Computers Like a Human. OpenReview. https://openreview.net/forum?id=43XMKuTTK0 ICLR 2025 Workshop AgenticAI Oral
work page 2025
-
[2]
Leo Boisvert, Megh Thakkar, Maxime Gasse, et al. 2024. WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track. Neural Information Processing Systems Foundation, Inc., Vancouver, BC, Canada, 5996–6051. doi:10.52202/079017-0195
-
[3]
Minghao Chen, Yihang Li, Yanting Yang, et al. 2024. AutoManual: Constructing Instruction Manuals by LLM Agents via Interactive Environmental Learning. In Advances in Neural Information Processing Systems , Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 589–631. doi:10.52202/079017-0019
-
[4]
Weizhi Chen, Ziwei Wang, Leyang Yang, et al . 2025. PG-Agent: An Agent Powered by Page Graph. In Proceedings of the 33rd ACM International Conference on Multimedia . Association for Computing Machinery, New York, NY, USA, 6878–6887. doi:10.1145/3746027.3755189
-
[5]
Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith
Edmund M. Clarke, Orna Grumberg, Somesh Jha, Yuan Lu, and Helmut Veith
-
[6]
In Computer Aided Verification (Lecture Notes in Computer Science, Vol
Counterexample-Guided Abstraction Refinement. In Computer Aided Verification (Lecture Notes in Computer Science, Vol. 1855) . Springer, Chicago, IL, USA, 154–169. doi:10.1007/10722167_15
-
[7]
Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, et al. 2025. The BrowserGym Ecosystem for Web Agent Research. Transactions on Machine Learning Research
work page 2025
-
[8]
Xiang Deng, Yu Gu, Boyuan Zheng, et al. 2023. Mind2Web: Towards a Generalist Agent for the Web. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023) Datasets and Benchmarks Track . Neural Information Processing Systems Foundation, Inc., New Orleans, LA, USA
work page 2023
-
[9]
Alexandre Drouin, Maxime Gasse, Massimo Caccia, et al . 2024. WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235) . PMLR, Vienna, Austria, 11642–11662. https://proceedings.mlr.press/v235/drouin24a.html
work page 2024
-
[10]
Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair. IEEE Transactions on Software Engineering 38, 1 (2012), 54–72. doi:10.1109/TSE.2011.104
-
[11]
Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic Patch Generation Learned from Human-Written Patches. In 2013 35th Interna- tional Conference on Software Engineering (ICSE) . IEEE, San Francisco, CA, USA, 802–811. doi:10.1109/ICSE.2013.6606626
-
[12]
Jing Yu Koh, Robert Lo, Lawrence Jang, et al. 2024. VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Bangkok, Thailand, 881–905. doi:10.18653/v1/2024.acl-long.50
-
[13]
Hanyu Lai, Xiao Liu, Iat Long Iong, et al. 2024. AutoWebGLM: A Large Language Model-based Web Navigating Agent. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining . Association for Computing Machinery, New York, NY, USA, 5295–5306. doi:10.1145/3637528.3671620
- [14]
-
[15]
Xiangyi Li, Wenbo Chen, Yimin Liu, et al. 2026. SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks. arXiv preprint arXiv:2602.12670. https://arxiv.org/abs/2602.12670
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang
-
[17]
International Conference on Learning Representations
Reinforcement Learning on Web Interfaces Using Workflow-Guided Ex- ploration. International Conference on Learning Representations
-
[18]
Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Cor- rect Code. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages . Association for Computing Machinery, St. Petersburg, FL, USA, 298–312. doi:10.1145/2837614.2837617
-
[19]
Xing Han Lu, Zdenek Kasner, and Siva Reddy. 2024. WebLINX: Real-World Website Navigation with Multi-Turn Dialogue. In Proceedings of the 41st Interna- tional Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 33007–33056. https://proceedings.mlr.press/ v235/lu24e.html
work page 2024
-
[20]
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2015. DirectFix: Looking for Simple Program Repairs. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering. IEEE, Florence, Italy, 448–458. doi:10.1109/ICSE.2015.63
-
[21]
Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis. In Proceedings of the 38th International Conference on Software Engineering. Association for Computing Machinery, Austin, TX, USA, 691–701. doi:10.1145/2884781.2884807
-
[22]
Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chan- dra. 2013. SemFix: Program Repair via Semantic Analysis. In 2013 35th Interna- tional Conference on Software Engineering (ICSE) . IEEE, San Francisco, CA, USA, 772–781. doi:10.1109/ICSE.2013.6606623
-
[23]
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang
-
[24]
World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 70) . PMLR, Sydney, Australia, 3135–3144. https://proceedings.mlr.press/v70/shi17a.html
-
[25]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. In Advances in Neural Information Processing Systems , Vol. 36. Curran Associates, Inc., Red Hook, NY, USA
work page 2023
- [26]
- [27]
-
[28]
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, et al. 2024. Voyager: An Open-Ended Em- bodied Agent with Large Language Models. Transactions on Machine Learning Research. Accepted by TMLR; ICLR 2025 Journal Track
work page 2024
-
[29]
Junyang Wang, Haiyang Xu, Haitao Jia, et al. 2024. Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collab- oration. In Advances in Neural Information Processing Systems , Vol. 37. Curran Associates, Inc., Red Hook, NY, USA, 2686–2710. doi:10.52202/079017-0088
-
[30]
Zora Zhiruo Wang, Apurva Gandhi, Graham Neubig, and Daniel Fried. 2025. Inducing Programmatic Skills for Agentic Tasks. Proceedings of the Conference on Language Modeling
work page 2025
-
[31]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 267) . PMLR, Vancouver, BC, Canada, 63897–63911. https://proceedings.mlr.press/v267/wang25bx.html
work page 2025
- [32]
-
[33]
Tianbao Xie, Danyang Zhang, Jixuan Chen, et al. 2024. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track. Neural Information Processing Systems Foundation, Inc., Vancouver, BC, Canada, 52040–52094. doi:10.52202/079017-1650
-
[34]
Ran Xu, Kaixin Ma, Wenhao Yu, et al. 2025. Retrieval-augmented GUI Agents with Generative Guidelines. In Proceedings of the 2025 Conference on Empir- ical Methods in Natural Language Processing . Association for Computational Linguistics, Suzhou, China, 17866–17875. doi:10.18653/v1/2025.emnlp-main.902
-
[35]
Renjun Xu and Yang Yan. 2026. Agent Skills for Large Language Models: Archi- tecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430 [cs.MA] https://arxiv.org/abs/2602.12430
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[36]
Ke Yang, Yao Liu, Sapana Chaudhary, et al. 2025. AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents. The Thirteenth International Conference on Learning Representations
work page 2025
-
[37]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations
work page 2023
- [38]
- [39]
-
[40]
Chi Zhang, Zhao Yang, Jiaxuan Liu, et al. 2025. AppAgent: Multimodal Agents as Smartphone Users. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, New York, NY, USA, 70:1–70:20. doi:10.1145/3706598.3713600
-
[41]
Andrew Zhao, Daniel Huang, Quentin Xu, et al. 2024. ExpeL: LLM Agents Are Experiential Learners. Proceedings of the AAAI Conference on Artificial Intelligence 38, 17 (2024), 19632–19642. doi:10.1609/AAAI.V38I17.29936
-
[42]
SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills
Boyuan Zheng, Michael Y. Fatemi, Xiaolong Jin, et al . 2025. Skill- Weaver: Web Agents can Self-Improve by Discovering and Honing Skills. arXiv:2504.07079 [cs.AI] https://arxiv.org/abs/2504.07079
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT- 4V(ision) is a Generalist Web Agent, if Grounded. In Proceedings of the 41st International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 235). PMLR, Vienna, Austria, 61349–61385. https://proceedings. mlr.press/v235/zheng24e.html
work page 2024
-
[44]
Shuyan Zhou, Frank F. Xu, Hao Zhu, et al . 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. The Twelfth International Conference on Learning Representations. Poster. 10
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.