GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

Hongyuan Zhang; Yuan Yuan; Yujun Li

arxiv: 2605.03403 · v2 · pith:KZPVITFKnew · submitted 2026-05-05 · 💻 cs.CV · cs.LG

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

Yujun Li , Hongyuan Zhang , Yuan Yuan This is my paper

Pith reviewed 2026-05-07 17:59 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords test-timeadaptationmodelsoptimizationgrpogrpo-ttapolicygroup

0 comments

The pith

GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models such as CLIP match images to text prompts but often lose accuracy when test images come from a different distribution than training data. Test-time adaptation tries to adjust the model using only the incoming test images and no ground-truth labels. The authors take an existing reinforcement learning technique called Group Relative Policy Optimization, originally used for training large models, and modify it for this unlabeled setting. They sample the top-K most likely class predictions from the model's similarity scores to form groups, then optimize the visual encoder with two kinds of rewards: one that encourages the model to align better with the image content and another that promotes diversity among the group predictions. This turns the adaptation into a policy optimization problem that runs at test time. Experiments on multiple benchmarks show the approach beats existing test-time adaptation techniques, with bigger gains when the data distribution shifts naturally.

Core claim

GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

Load-bearing premise

That constructing output groups by sampling top-K class candidates from CLIP similarity distributions enables effective probability-driven optimization without ground-truth labels, and that the designed alignment and dispersion rewards guide effective visual encoder tuning at test time.

Figures

Figures reproduced from arXiv: 2605.03403 by Hongyuan Zhang, Yuan Yuan, Yujun Li.

**Figure 1.** Figure 1: Performance comparison of different methods. Across ImageNet variants (left) and cross-domain datasets (right), GRPOTTA consistently surpasses existing test-time adaptation methods. access to task-specific training data or labeled validation sets, which is often impractical in real-world scenarios. To address these challenges, test-time adaptation (TTA) has emerged as a promising paradigm that enables mod… view at source ↗

**Figure 2.** Figure 2: The framework of the proposed GRPO-TTA framework. Firstly, a single test image is augmented to generate multiple views, where confident views with low-entropy predictions are selected. Then, for each selected view, we sample the top-K classes and calculate their probabilistic distributions. Finally, advantages are computed using CLIP similarity and the probabilistic distributions generated by the similarit… view at source ↗

**Figure 3.** Figure 3: Ablation studies. (Left) Performance on three datasets with varying scale factor λ in Eq. (10); (Middle) Performance on three datasets with different sampling factors K; (Right) Performance on three datasets with different TTA steps. superior performance across all datasets, achieving improvements of +0.65%, +1.05%, and +0.74% over CLIPTTA, DPE, and DOTA, respectively. In particular, RLCF is the first at… view at source ↗

read the original abstract

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The method depends on a small number of hyperparameters for sampling and reward balancing plus domain assumptions about the utility of CLIP similarity scores for unsupervised group construction; no new physical entities are postulated.

free parameters (2)

K (number of top class candidates)
Controls the size of output groups sampled from similarity distributions; value not specified in abstract.
Reward weighting coefficients
Balance between alignment and dispersion reward terms; values not reported in abstract.

axioms (2)

domain assumption CLIP similarity distributions provide useful probability signals for constructing class candidate groups without labels.
Invoked when building output groups for policy optimization.
domain assumption Group-wise relative policy optimization can be directly applied to visual encoder tuning in the test-time setting.
Core reformulation stated in the method description.

pith-pipeline@v0.9.0 · 5457 in / 1431 out tokens · 136086 ms · 2026-05-07T17:59:29.852352+00:00 · methodology

Review history (2 revisions) →

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

Core claim

Load-bearing premise

discussion (0)