LLM Based Web Accessibility Repair: An Empirical Study of Detection, Remediation, and Cost

Diego Elias Costa; Durjoy Dey; Ghada Abushaqra; Oluwatoyosi Oyelayo; Parham Asadi

arxiv: 2605.27716 · v1 · pith:SAIRTR45new · submitted 2026-05-26 · 💻 cs.SE

LLM Based Web Accessibility Repair: An Empirical Study of Detection, Remediation, and Cost

Oluwatoyosi Oyelayo , Ghada Abushaqra , Parham Asadi , Durjoy Dey , Diego Elias Costa This is my paper

Pith reviewed 2026-06-29 15:21 UTC · model grok-4.3

classification 💻 cs.SE

keywords web accessibilitylarge language modelsautomated repairempirical evaluationaccessibility violationsLLM agentshybrid systemsremediation cost

0 comments

The pith

LLMs achieve partial web accessibility repairs but resolve fewer than 26 percent of cases completely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests large language models for finding and fixing accessibility problems in web pages. The LLM matches rule-based tools in detection and produces valid fixes that cut violations in most files, yet complete fixes happen in under 26 percent of cases and some patches create new problems. A sympathetic reader cares because web accessibility affects millions of users, and expensive manual work could be reduced if automated methods prove reliable. The results point toward combining LLMs with traditional checks for better outcomes.

Core claim

LLM-based agents detect accessibility violations with F1 scores around 0.65 overall and 0.83 for semantic issues, comparable to rule-based tools, and generate syntactically valid fixes in over 99.7 percent of cases that reduce violations from 3.98 to 1.7 per file in 80.2 percent of instances. However, full resolution occurs in fewer than 26 percent of cases, about 30 percent of patches introduce structural changes, and iterative refinement raises costs by 52 percent and API usage by 1.64 times without improving results.

What carries the argument

The Kimi K2.5 LLM agent applied to detection and remediation of accessibility violations in HTML files, benchmarked against rule-based tools using F1 scores, violation counts, and compliance metrics.

If this is right

LLM detection works well for semantic violations but struggles with syntactic and layout ones.
Generated fixes are reliable in syntax but often leave residual violations requiring further work.
Adding iterative refinement to the agent increases expense without better repair rates.
Scalable solutions must integrate LLM generation with rule-based validation and constraint checks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-world testing with actual users who have disabilities would show if the partial fixes translate to better usability.
Embedding these LLM agents into existing accessibility scanning tools could lower the cost of initial repairs.
Models trained specifically on accessibility guidelines might handle layout violations better than general LLMs.

Load-bearing premise

The chosen web files and violation types, along with the automated metrics, accurately reflect real-world accessibility issues and that the measured improvements mean better usability for people with disabilities.

What would settle it

Observing full resolution of all violations in more than 50 percent of test files without structural changes or new violations introduced by the LLM would falsify the claim that LLMs are insufficient for complete remediation.

Figures

Figures reproduced from arXiv: 2605.27716 by Diego Elias Costa, Durjoy Dey, Ghada Abushaqra, Oluwatoyosi Oyelayo, Parham Asadi.

**Figure 1.** Figure 1: Overview of the dual-pipeline architecture combining rule-based and LLM-based accessibility analysis [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: Agent iteration effectiveness across iterations. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 5.** Figure 5: Before vs fixed counterpart on uploaded pages [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 4.** Figure 4: visualizes their distribution across the dataset. The results show that violations are dominated by a small set of recurring issues, particularly: • region (landmark structure), • color-contrast (WCAG 1.4.3), • link-name and button-name (WCAG 4.1.2), • image-alt (WCAG 1.1.1). These findings indicate that accessibility issues are primarily concentrated in semantic labeling and structural organization, rath… view at source ↗

**Figure 6.** Figure 6: Distribution of accessibility violations across [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

read the original abstract

Ensuring web accessibility at scale remains challenging because rule-based tools provide limited coverage while manual remediation is costly and error-prone. This paper evaluates large language model based agents, specifically Kimi K2.5, for automated accessibility detection and repair compared with rule-based approaches. For detection, the LLM achieves performance comparable to rule-based tools, with F1 around 0.65, strong semantic understanding with F1 of 0.83, but lower reliability for syntactic and layout-related violations. For remediation, LLM-generated fixes are syntactically valid in over 99.7 percent of cases and improve accessibility compliance in 80.2 percent of instances, reducing violations from 3.98 to 1.7 per file. However, fewer than 26 percent of cases are fully resolved, and about 30 percent of patches introduce structural changes. We also find that iterative agent-based refinement increases computational cost by 52 percent and API usage by 1.64 times without improving remediation outcomes. These findings indicate that while LLMs are effective for partial accessibility repair, they are insufficient for complete and reliable remediation. Scalable accessibility solutions require hybrid approaches that combine LLM capabilities with rule-based validation and constraint-aware correction mechanisms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs like Kimi K2.5 deliver partial accessibility fixes with concrete numbers on detection and cost, but full resolution stays low and iteration adds expense without gains.

read the letter

The one or two things your colleague should know about this paper are that Kimi K2.5 achieves detection performance comparable to rule-based tools with F1 scores of 0.65 overall and 0.83 on semantic issues, and that its remediation reduces average violations per file from 3.98 to 1.7 in 80.2% of cases but fully resolves fewer than 26% while adding cost on iteration.

What is actually new here is the combined evaluation of detection and remediation using this LLM agent, along with the analysis of computational costs from iterative refinement. The paper does well by reporting specific percentages on syntactic validity at over 99.7%, the rate of structural changes at about 30%, and the cost increases of 52% compute and 1.64 times API usage without outcome gains. These measurements provide useful benchmarks that build on prior rule-based and early AI approaches with direct observations.

The soft spots center on reporting gaps and metric validity. The abstract does not specify dataset size, selection method, or any statistical tests, leaving open the possibility of bias in the chosen files or sensitivity to prompts. The push for hybrid rule-based and LLM systems is reasonable based on the partial success rates, but the stress-test note is correct that automated violation counts may not correspond to actual usability improvements for disabled users. No user studies or correlation analysis is mentioned to support that the measured changes matter in practice.

This paper is for researchers focused on web accessibility automation and LLM applications in software engineering. A reader interested in empirical data on what current models can achieve in this area would get value from the reported outcomes and limitations.

I recommend sending this to peer review. The empirical results are concrete enough to merit referee feedback on the methods and generalizability.

Referee Report

2 major / 2 minor

Summary. The paper evaluates the LLM Kimi K2.5 for web accessibility violation detection and remediation on web files, reporting F1 scores of ~0.65 overall (0.83 on semantic issues) comparable to rule-based tools, with remediation fixes being syntactically valid in >99.7% of cases, improving compliance in 80.2% of instances (reducing violations from 3.98 to 1.7 per file), but fully resolving <26% of cases. Iterative refinement increases cost by 52% and API usage by 1.64x without outcome gains. The central claim is that LLMs enable partial repair but are insufficient for complete/reliable remediation, necessitating hybrid LLM + rule-based systems.

Significance. If the empirical measurements hold, the work supplies concrete before-and-after violation counts, validity rates, and cost multipliers that quantify LLM limitations in accessibility repair, supporting the case for hybrid approaches. The direct reporting of percentages (80.2% improvement, <26% full resolution) and the comparison of single vs. iterative agent use are strengths that could inform tool design, though generalizability depends on unstated dataset details.

major comments (2)

[Abstract / Methods] Abstract and Methods (dataset description): The central claim that LLMs are 'insufficient for complete and reliable remediation' rests on the observed drop to 1.7 violations/file and <26% full resolution, yet the abstract and reported experiments provide no dataset size, file selection criteria, number of violation types, or statistical tests; this leaves the 80.2% improvement rate vulnerable to selection bias and undermines the empirical grounding for requiring hybrid systems.
[Evaluation / Remediation results] Evaluation section (remediation metrics): The conclusion that automated fixes do not achieve reliable remediation relies exclusively on rule-based checker outputs (violation counts, F1); no user study, task-performance measure, or correlation analysis with actual usability for screen-reader or low-vision users is reported, so the claim that <26% full resolution indicates insufficiency for real accessibility gains lacks direct evidence.

minor comments (2)

[Abstract] Abstract: 'F1 around 0.65' and 'F1 of 0.83' should be replaced with exact values and the number of instances or files evaluated to improve precision.
[Methods] The paper should clarify the prompting strategy and model version details (Kimi K2.5) in the methods to allow replication of the detection and remediation agents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical evaluation of LLM-based accessibility repair. The comments highlight important areas for improving transparency and acknowledging limitations. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (dataset description): The central claim that LLMs are 'insufficient for complete and reliable remediation' rests on the observed drop to 1.7 violations/file and <26% full resolution, yet the abstract and reported experiments provide no dataset size, file selection criteria, number of violation types, or statistical tests; this leaves the 80.2% improvement rate vulnerable to selection bias and undermines the empirical grounding for requiring hybrid systems.

Authors: We agree that the abstract and Methods section would benefit from more explicit reporting of dataset characteristics to allow readers to assess potential selection bias. The full manuscript describes the experimental files and violation types in the Evaluation section, but we will revise the abstract to note the dataset scale and expand Methods with a dedicated paragraph on file selection criteria, the specific violation types covered, and any statistical tests applied. These additions will be incorporated in the next version. revision: yes
Referee: [Evaluation / Remediation results] Evaluation section (remediation metrics): The conclusion that automated fixes do not achieve reliable remediation relies exclusively on rule-based checker outputs (violation counts, F1); no user study, task-performance measure, or correlation analysis with actual usability for screen-reader or low-vision users is reported, so the claim that <26% full resolution indicates insufficiency for real accessibility gains lacks direct evidence.

Authors: We acknowledge that the evaluation relies solely on automated rule-based metrics and does not include user studies measuring actual usability for screen-reader or low-vision users. Violation counts and F1 scores are standard proxies in accessibility literature, but we agree this does not directly demonstrate real-world accessibility gains. We will add a new paragraph in the Discussion section that explicitly states this limitation, qualifies the insufficiency claim as based on automated metrics, and suggests future user studies to validate practical impact. This constitutes a partial revision as we cannot add new empirical user data at this stage. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or fitted predictions

full rationale

The paper reports direct experimental outcomes from running LLMs on web files: F1 scores for detection, violation counts before/after remediation, syntactic validity rates, and cost metrics. No equations, parameters fitted to subsets then re-predicted, self-citation chains, or ansatzes appear in the abstract or described full text. All claims rest on observed counts and percentages from the chosen checkers and files, which are external to any model defined inside the paper. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical study that relies on standard evaluation practices in software engineering without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (2)

domain assumption F1 score and per-file violation counts are valid proxies for accessibility detection and remediation quality
Central to all reported performance numbers in the abstract
domain assumption The evaluated web files are representative of typical accessibility issues encountered in practice
Required to generalize the 80.2% improvement and <26% full-resolution claims

pith-pipeline@v0.9.1-grok · 5765 in / 1363 out tokens · 60514 ms · 2026-06-29T15:21:14.171042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Santos, J., et al. (2024). AccessGuru: Leveraging LLMs to detect and correct web accessibility violations in HTML code. arXiv

2024
[2]

Kumar, R., et al. (2024). AI-enhanced web form development: Tackling accessibility barriers with generative technologies. IEEE

2024
[3]

Williams, J., et al. (2024). Improving web accessibility with an LLM-based tool: A preliminary evaluation for STEM images. ACM CHI

2024
[4]

Zhang, Y ., et al. (2024). Designing for inclusion: Human- centered multi-agent accessibility repair in modern web ap- plications. ACM CHI

2024
[5]

Garcia, M., et al. (2024). How an LLM can improve automatic web accessibility validation? arXiv

2024
[6]

Chen, L., et al. (2024). An assessment of LLM-based auditing and validation for web accessibility. arXiv

2024
[7]

Garc ´ıa, S., et al. (2021). Wrappers for web access logs feature selection. World Wide Web, 24, 1875–1898

2021
[8]

S., & Zettlemoyer, L

Wu, J., Ross, A. S., & Zettlemoyer, L. (2024).Large Language Models as Accessibility Auditors: Evaluating LLMs for WCAG Compliance Detection

2024
[9]

P., & Ladner, R

Li, Y ., Bigham, J. P., & Ladner, R. E. (2023).AI-Assisted Web Accessibility Repair: Generating Accessible Code with Language Models

2023
[10]

(2024).Automatic Alt-Text Generation for Web Images Using Vision-Language Models

Sunkara, S., Gurari, D., & Grauman, K. (2024).Automatic Alt-Text Generation for Web Images Using Vision-Language Models

2024
[11]

T. B. Brown, B. Mann, N. Ryder, et al., ”Language Models are Few-Shot Learners,”Advances in Neural In- formation Processing Systems (NeurIPS), 2020. Available: https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

GPT-4 Technical Report

OpenAI, ”GPT-4 Technical Report,” 2023. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

M. Chen, J. Tworek, H. Jun, et al., ”Evaluating Large Language Models Trained on Code,” 2021. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[14]

Li et al., ”Competition-Level Code Gen- eration with AlphaCode,”Science, vol

Y . Li et al., ”Competition-Level Code Gen- eration with AlphaCode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022. Available: https://www.science.org/doi/10.1126/science.abq1158

work page doi:10.1126/science.abq1158 2022
[15]

S. Yao, J. Zhao, D. Yu, et al., ”ReAct: Synergizing Rea- soning and Acting in Language Models,” 2023. Available: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, et al., ”Toolformer: Lan- guage Models Can Teach Themselves to Use Tools,” 2023. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Santos, J., et al. (2024). AccessGuru: Leveraging LLMs to detect and correct web accessibility violations in HTML code. arXiv

2024

[2] [2]

Kumar, R., et al. (2024). AI-enhanced web form development: Tackling accessibility barriers with generative technologies. IEEE

2024

[3] [3]

Williams, J., et al. (2024). Improving web accessibility with an LLM-based tool: A preliminary evaluation for STEM images. ACM CHI

2024

[4] [4]

Zhang, Y ., et al. (2024). Designing for inclusion: Human- centered multi-agent accessibility repair in modern web ap- plications. ACM CHI

2024

[5] [5]

Garcia, M., et al. (2024). How an LLM can improve automatic web accessibility validation? arXiv

2024

[6] [6]

Chen, L., et al. (2024). An assessment of LLM-based auditing and validation for web accessibility. arXiv

2024

[7] [7]

Garc ´ıa, S., et al. (2021). Wrappers for web access logs feature selection. World Wide Web, 24, 1875–1898

2021

[8] [8]

S., & Zettlemoyer, L

Wu, J., Ross, A. S., & Zettlemoyer, L. (2024).Large Language Models as Accessibility Auditors: Evaluating LLMs for WCAG Compliance Detection

2024

[9] [9]

P., & Ladner, R

Li, Y ., Bigham, J. P., & Ladner, R. E. (2023).AI-Assisted Web Accessibility Repair: Generating Accessible Code with Language Models

2023

[10] [10]

(2024).Automatic Alt-Text Generation for Web Images Using Vision-Language Models

Sunkara, S., Gurari, D., & Grauman, K. (2024).Automatic Alt-Text Generation for Web Images Using Vision-Language Models

2024

[11] [11]

T. B. Brown, B. Mann, N. Ryder, et al., ”Language Models are Few-Shot Learners,”Advances in Neural In- formation Processing Systems (NeurIPS), 2020. Available: https://arxiv.org/abs/2005.14165

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

GPT-4 Technical Report

OpenAI, ”GPT-4 Technical Report,” 2023. Available: https://arxiv.org/abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

M. Chen, J. Tworek, H. Jun, et al., ”Evaluating Large Language Models Trained on Code,” 2021. Available: https://arxiv.org/abs/2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021

[14] [14]

Li et al., ”Competition-Level Code Gen- eration with AlphaCode,”Science, vol

Y . Li et al., ”Competition-Level Code Gen- eration with AlphaCode,”Science, vol. 378, no. 6624, pp. 1092–1097, 2022. Available: https://www.science.org/doi/10.1126/science.abq1158

work page doi:10.1126/science.abq1158 2022

[15] [15]

S. Yao, J. Zhao, D. Yu, et al., ”ReAct: Synergizing Rea- soning and Acting in Language Models,” 2023. Available: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dess `ı, et al., ”Toolformer: Lan- guage Models Can Teach Themselves to Use Tools,” 2023. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023