Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

Jianwei Tai

arxiv: 2605.25889 · v4 · pith:2MSX6JZ5new · submitted 2026-05-25 · 💻 cs.CR · cs.LG

Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

Jianwei Tai This is my paper

Pith reviewed 2026-06-29 21:27 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords vision-language-action modelsadversarial robustnessinformation-theoretic bounddata processing inequalitymutual informationcapability-robustness trade-offadversarial perturbations

0 comments

The pith

For any vision-language-action policy, capability and robustness sum to at most task entropy plus adversarial channel capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that for any VLA policy the sum of capability, defined as mutual information between optimal and policy actions, and robustness, defined as mutual information between clean and perturbed actions minus dependence on the perturbation, cannot exceed the sum of optimal-action entropy and mutual information between clean and perturbed inputs. The proof follows from two applications of the data processing inequality to the relevant Markov chains in the information flow. This bound supplies a theoretical explanation for why small input perturbations cause large drops in success rate. An encoder-specific tightening of the bound places realized capability already inside a noticeable fraction of the total budget. The same construction produces three label-free diagnostics that apply across discrete-token, L1-regression, and flow-matching policies.

Core claim

For any VLA policy, capability I(A*;Aπ) and robustness I(Aπ;Ãπ)−I(Aπ;δ) sum to at most H(A*)+I(X;X̃). The proof reduces to two applications of the Data Processing Inequality.

What carries the argument

Two successive applications of the data processing inequality to the Markov chains linking optimal actions to policy actions and policy actions to perturbed actions under input perturbation.

Load-bearing premise

The information flow inside the VLA policy obeys the Markov chain conditions required for the data processing inequality to apply to the pairs of actions and perturbations.

What would settle it

A single VLA policy in which the measured sum of capability and robustness exceeds H(A*) + I(X; X̃) while the Markov conditions hold would falsify the bound.

Figures

Figures reproduced from arXiv: 2605.25889 by Jianwei Tai.

**Figure 1.** Figure 1: Theorem 1 schematic: capability + robustness (LHS) is bounded by task entropy + attack channel capacity (RHS); the gap is slack S ≥ 0. (i) Capability. By definition of mutual information, I(A ⋆ ; Aπ) = H(A ⋆ ) − H(A ⋆ | Aπ) ≤ H(A ⋆ ). (i) Equality iff H(A⋆ | Aπ) = 0, i.e. A⋆ is determined by Aπ (equivalently, I(A⋆ ; Aπ) = H(A⋆ )). (ii) Coupling under attack. We claim I(Aπ; A˜ π) ≤ I(X; X˜). (ii) Two applic… view at source ↗

**Figure 2.** Figure 2: P7 Gaussian-VLA: median slack vs. ε across 252 cells; bands show IQR. Both Sa and Sm stay non-negative and decrease monotonically (Cor. 1). Result. Sa ≥ 0 in 252/252 cells and Sm ≥ 0 in 252/252 cells. After seed-averaging the 84 grouped configs, all have Sa ≥ 0. After Holm–Bonferroni correction across 84 groups, 52/84 (62%) reach significance at α = 0.05 for the alternative S > 0 (one-sided t-test on per-s… view at source ↗

**Figure 3.** Figure 3: OpenVLA-7B on LIBERO. Left: realized Robdisc ∈ [0.88, 1.53] nats across 48 cells; lighter dots are per-seed values. Right: per-suite bound terms at ε = 8/255 with PCA-tightened RHS; RHS exceeds LHS by ∼ 3–4 orders of magnitude. 0 10 20 30 40 50 R H ST = ST (×10 3 n ats) max | |/(T S1) = 0.05% Spatial Object Goal LIBERO-10 1 2 3 4 5 6 7 8 9 10 horizon T 0 100 LHS disc T (nats) [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 4.** Figure 4: Cumulative bound vs. horizon T on all four LIBERO suites at ε = 8/255. Top: RHST matches T · S1 within 0.03%. Bottom: cumulative LHS stays four orders below RHS at every horizon. where ϕ is OpenVLA’s frozen DINOv2+SigLIP encoder (d = 2176 pooled features) using InfoNCE (Poole et al. 2019) after PGD on Spatial at ε = 8. By DPI, INCE ≤ I(ϕ(X); ϕ(X˜)) ≤ I(X; X˜), so InfoNCE is a lower bound on the encoder-cha… view at source ↗

**Figure 5.** Figure 5: Defense comparison on Spatial across ε ∈ {2, 4, 8, 16}/255, five policies. Left: realized Robdisc. Right: encoderPCA channel ceiling S def enc. The two panels are orthogonal axes (realized robustness vs. mechanism diagnostic). Spatial suite shown; Object and Goal cells in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models reach high success rates on clean inputs but collapse under small adversarial perturbations: a $16/255$ PGD attack drops OpenVLA-7B's LIBERO success from $95\%$ to under $5\%$. Whether this trade-off has a theoretical floor was open. We prove that it does. For any VLA policy, capability $I(\Astar;\Api)$ and robustness $I(\Api;\Atildepi)-I(\Api;\delta)$ sum to at most $H(\Astar)+I(X;\Xtilde)$, the task entropy plus adversarial channel capacity. The proof reduces to two applications of the Data Processing Inequality. The pixel-level bound is loose by $\sim 10^3$ nats and serves as a ceiling guarantee; an encoder-specific corollary tightens it by over an order of magnitude, into a regime where realized capability already consumes $5$--$9\%$ of the budget. We validate Theorem~\ref{thm:main} with zero violations across $308$ cells: $252$ closed-form Gaussian-VLA, $48$ OpenVLA-7B$+$LIBERO$+$PGD ($4$ suites $\times$ $4$ $\eps$ $\times$ $3$ seeds), $4$ Square-Attack, and $4$ multi-step ($T{=}10$). A complementary measurability inequality $\Rob_{\text{disc}} \le \Cap_{\text{disc}}$ further holds across $144$ cross-architecture cells spanning OpenVLA, OpenVLA-OFT (continuous-$L_1$), and SmolVLA (flow-matching). The same construction yields three label-free diagnostics: a pre-flight encoder ceiling, a defense-forensics probe that localizes input-side vs.\ language-model intervention, and a head-agnostic robustness ratio comparable across discrete-token, $L_1$-regression, and flow-matching policies. Together these provide the cross-setting axis defense and architecture comparisons currently lack.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a DPI-derived upper bound on VLA capability plus robustness with zero violations in 308 cells, but the Markov chain assumptions need explicit verification.

read the letter

The main takeaway is a new upper bound showing that for any VLA policy, capability I(A*;Aπ) plus the robustness term I(Aπ;Ãπ)−I(Aπ;δ) is at most H(A*)+I(X;X̃). The proof uses two standard DPI steps, and they report no violations across closed-form Gaussians, real OpenVLA-7B on LIBERO under PGD and square attacks, multi-step cases, and cross-architecture checks.

What is actually new is the VLA-specific application, the encoder corollary that tightens the bound enough to be relevant (realized capability already uses 5-9% of it), and the three label-free diagnostics for comparing architectures without ground truth. The empirical volume is higher than typical for this style of result.

The soft spot is the Markov chain justification. The bound requires A*—X—Aπ (or similar) for the first DPI and an analogous structure once δ perturbs the input. In a VLA, A* is the expert action for the task while X is the observed image; nothing in the setup automatically gives the conditional independence needed. If the paper defines the exact chains and shows they hold, the claim is fine; otherwise the theoretical floor rests on that step. The pixel-level bound being loose by ~10^3 nats is already noted, so the value sits in the corollary and diagnostics.

This is for researchers working on robust VLA design or information-theoretic limits in control. A reader who wants a concrete floor to benchmark defenses or architecture choices will find usable material. The combination of a fresh bound and broad empirical support is enough to send it to peer review.

Referee Report

1 major / 1 minor

Summary. The paper claims to derive an information-theoretic upper bound on the sum of capability I(A*;Aπ) and robustness I(Aπ;Ãπ)−I(Aπ;δ) for any VLA policy, showing it is at most H(A*)+I(X;X̃). The derivation is stated to reduce to two applications of the Data Processing Inequality. The bound is presented as a loose pixel-level ceiling guarantee, with a tighter encoder-specific corollary, and is supported by zero violations across 308 empirical cells (Gaussian simulations, OpenVLA-7B on LIBERO with PGD, Square-Attack, multi-step, and cross-architecture checks) plus a measurability inequality Rob_disc ≤ Cap_disc.

Significance. If the bound and its DPI derivation hold, the result supplies a theoretical floor to the observed capability-robustness trade-off in VLAs, together with three label-free diagnostics (encoder ceiling, input-vs-LM forensics probe, head-agnostic robustness ratio) that apply across discrete, L1-regression, and flow-matching policies. The empirical coverage across real models and attack types strengthens the practical utility of the ceiling guarantee.

major comments (1)

[Theorem 1 (thm:main)] Theorem 1 (thm:main): the two DPI steps require explicit Markov chains (e.g., A*—X—Aπ for the capability term and Aπ—X—X̃—Ãπ or Aπ—(X,δ)—Ãπ for the subtracted robustness term). The manuscript does not define or verify these chains; in a VLA, A* is the expert action while X is the observed image, so no automatic conditional independence places A* upstream of Aπ given X. The same gap appears once δ perturbs X to X̃. Without these chains the DPI applications are not justified and the stated bound does not follow.

minor comments (1)

The abstract states the pixel-level bound is loose by ~10^3 nats; a short quantitative comparison of the encoder-specific corollary versus the pixel bound would help readers assess tightness.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to make the Markov chain assumptions explicit in the proof of Theorem 1. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Theorem 1 (thm:main): the two DPI steps require explicit Markov chains (e.g., A*—X—Aπ for the capability term and Aπ—X—X̃—Ãπ or Aπ—(X,δ)—Ãπ for the subtracted robustness term). The manuscript does not define or verify these chains; in a VLA, A* is the expert action while X is the observed image, so no automatic conditional independence places A* upstream of Aπ given X. The same gap appears once δ perturbs X to X̃. Without these chains the DPI applications are not justified and the stated bound does not follow.

Authors: We agree that the Markov chains should be stated explicitly for clarity. In the VLA setting the policy π is a (possibly stochastic) mapping from the observation X (and language instruction) to the action Aπ only. Consequently Aπ is conditionally independent of A* given X, establishing the Markov chain A* — X — Aπ. The DPI then directly yields I(A*; Aπ) ≤ I(A*; X) ≤ H(A*). For the robustness term, X̃ is obtained from X via the perturbation δ and Ãπ = π(X̃). The resulting chain Aπ — X — X̃ — Ãπ holds because both actions are generated solely from their respective inputs, so I(Aπ; Ãπ) ≤ I(X; X̃). The subtracted term I(Aπ; δ) accounts for any direct dependence on the perturbation; combining the two DPI applications produces the claimed bound. We will add a dedicated paragraph (and optional diagram) in the proof of Theorem 1 that defines these chains and verifies the conditional independences from the policy definition. revision: yes

Circularity Check

0 steps flagged

Standard DPI application with no self-referential reduction

full rationale

The central claim is obtained by applying the Data Processing Inequality (a standard external theorem) twice under explicitly stated Markov chain assumptions on the VLA information flow. No parameters are fitted to data and renamed as predictions, no self-citation chain supports the bound, and the result is not equivalent to its inputs by definition or renaming. The derivation remains independent of the paper's own experimental validations or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Data Processing Inequality (standard information theory) and the assumption that VLA policies induce the required Markov chains; no free parameters or new entities are introduced.

axioms (1)

standard math Data Processing Inequality applies to the relevant Markov chains in any VLA policy
The proof reduces to two applications of the Data Processing Inequality.

pith-pipeline@v0.9.1-grok · 5908 in / 1277 out tokens · 35545 ms · 2026-06-29T21:27:29.931894+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Datasets and Bench- marks Track

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Datasets and Bench- marks Track. Liu, H.; Ruan, S.; Long, J.; Wu, J.; Hou, J.; Tang, H.; Jiang, T.; Zhou, W.; and Yao, W. 2025. Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variati...

work page arXiv 2025
[2]

Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L

Multimodal Adversarial Attacks on Vision-Language- Action Models.arXiv preprint arXiv:2511.16203. Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. Sigmoid Loss for Language Image Pre-Training. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Zhang, H.; Yu, Y .; Jiao, J.; Xing, E. P.; El Ghaoui, L.; and Jordan, ...

work page arXiv 2023

[1] [1]

InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Datasets and Bench- marks Track

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Datasets and Bench- marks Track. Liu, H.; Ruan, S.; Long, J.; Wu, J.; Hou, J.; Tang, H.; Jiang, T.; Zhou, W.; and Yao, W. 2025. Eva-VLA: Evaluating Vision-Language-Action Models’ Robustness Under Real-World Physical Variati...

work page arXiv 2025

[2] [2]

Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L

Multimodal Adversarial Attacks on Vision-Language- Action Models.arXiv preprint arXiv:2511.16203. Zhai, X.; Mustafa, B.; Kolesnikov, A.; and Beyer, L. 2023. Sigmoid Loss for Language Image Pre-Training. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Zhang, H.; Yu, Y .; Jiao, J.; Xing, E. P.; El Ghaoui, L.; and Jordan, ...

work page arXiv 2023