pith. sign in

arxiv: 2507.21584 · v4 · submitted 2025-07-29 · 💻 cs.CV

TARS: MinMax Token-Adaptive Preference Strategy for Hallucination Reduction in MLLMs

Pith reviewed 2026-05-19 02:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords hallucination reductionmultimodal large language modelspreference optimizationmin-max optimizationtoken perturbationspectral alignmentvisual grounding
0
0 comments X

The pith

TARS cuts hallucinations in multimodal models by half through min-max token perturbation in preference optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reformulating direct preference optimization as a min-max problem, with inner maximization perturbing visual-agnostic tokens and outer minimization enforcing visual alignment, substantially reduces hallucinations in MLLMs. It adds a spectral alignment loss based on Fast Fourier Transform to keep hidden representations coherent in the frequency domain. This works with only 4.8k preference samples and no expert feedback, beating standard DPO and even larger-scale data augmentation approaches. A reader would care because it shows a way to improve visual grounding without scaling up training data or relying on costly annotations.

Core claim

TARS reformulates DPO as a principled min-max optimization where the inner maximization selectively perturbs visual-agnostic tokens to induce worst-case distributional shifts, while the outer minimization enforces alignment with causal visual signals rather than surface-level patterns. A novel spectral alignment loss regularizes hidden representations in the frequency domain via FFT to preserve global semantic structure without rigid token-level correspondence. Experiments show this reduces hallucination rates from 26.4% to 13.2% and cognition scores from 2.5 to 0.4 using 4.8k samples, outperforming standard DPO and surpassing 5x larger LLM-based augmentation on 28.8k samples.

What carries the argument

The min-max token-adaptive preference optimization that perturbs visual-agnostic tokens in the inner loop to create adversarial shifts forcing reliance on visual grounding.

If this is right

  • Hallucination rates drop from 26.4 percent to 13.2 percent on standard benchmarks.
  • Cognition scores improve from 2.5 to 0.4 with the same limited preference data.
  • Performance exceeds both standard DPO and five times larger LLM-augmented training sets.
  • The gap to closed models such as GPT-4o narrows on key hallucination metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The min-max framing could extend to other alignment settings where linguistic shortcuts dominate over grounded signals.
  • Spectral regularization via FFT may stabilize representations in additional multimodal fine-tuning tasks.
  • Lower data requirements could enable more frequent or resource-constrained updates to deployed MLLMs.

Load-bearing premise

Perturbing visual-agnostic tokens will produce distributional shifts that improve visual grounding rather than introducing new superficial cues the model can exploit.

What would settle it

Apply TARS to an MLLM, measure hallucination rates on benchmarks such as POPE, and observe whether rates remain at or above the 26.4% baseline achieved by standard DPO without the adversarial perturbation step.

Figures

Figures reproduced from arXiv: 2507.21584 by Chang Liu, Huan Wang, Jiasheng Tang, Keda Tao, Kejia Zhang, Zhiming Luo.

Figure 1
Figure 1. Figure 1: Left: We present TARS, a token-adaptive preference strategy for mitigating hallucina￾tions in MLLMs. TARS reformulates direct preference optimization (DPO) as a min-max objective that (1) minimizes behavioral misalignment via preference feedback and (2) maximizes adaptability through perturbations of visual-agnostic tokens. Right: Evaluation on LLaVA-v1.5-13B with pref￾erence optimization (PO) (Liu et al.,… view at source ↗
Figure 2
Figure 2. Figure 2: Motivation illustration for TARS. (a) and (b) illustrate standard DPO and our token [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of TARS. TARS reformulates preference optimization as a Min–Max prob￾lem: (1) The maximization branch perturbs visual-agnostic tokens to simulate semantically shifted contexts (red dashed box); (2) The minimization branch fine-tunes the model to align with human preferences via the DPO objective (purple dashed box). TARS encourages the model to attend to causally grounded visual signals rather tha… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of average scores across question categories on the MMHal benchmark. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of AMBER hallucina￾tion rate versus preference data scale. 1 1 3 5 7 9 1.0 0.6 2.2 3.8 5.4 7.0 Preference Data Non-hallucination Hallucination 1 1 3 5 7 9 1.0 0.6 2.2 3.8 5.4 7.0 Preference Data Non-hallucination Hallucination 1 1 3 5 7 9 1.0 0.6 2.2 3.8 5.4 7.0 Preference Data Non-hallucination Hallucination (a) LLaVA (b) DPO (c) TARS [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of hidden representations across preference-aligned, non-hallucinated, and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) are prone to hallucinations, generating plausible but visually ungrounded outputs, partly because direct preference optimization (DPO) overfits to superficial linguistic cues under static preference supervision. We propose TARS, a token-adaptive preference strategy that reformulates DPO as a principled min-max optimization problem. The inner maximization selectively perturbs visual-agnostic tokens to induce worst-case distributional shifts, while the outer minimization enforces alignment with causal visual signals rather than surface-level patterns. A novel spectral alignment loss further regularizes hidden representations in the frequency domain via the Fast Fourier Transform (FFT), preserving global semantic structure without rigid token-level correspondence. We evaluate TARS across multiple hallucination benchmarks. Using only 4.8k preference samples without expert feedback, TARS reduces hallucination rates from 26.4\% to 13.2\% and cognition scores from 2.5 to 0.4, outperforming standard DPO by a large margin. Notably, TARS surpasses $5\times$ LLM-based data augmentation trained on 28.8k samples (Hal-Rate: 16.0\% vs.\ 13.2\%), demonstrating that reshaping the optimization landscape via adversarial token perturbation is fundamentally more effective than scaling training data. TARS further narrows the gap with GPT-4o on key metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes TARS, a token-adaptive preference strategy that reformulates direct preference optimization (DPO) as a min-max game for hallucination reduction in multimodal large language models. The inner maximization selectively perturbs visual-agnostic tokens to create worst-case distributional shifts, while the outer minimization enforces alignment with causal visual signals; a novel FFT-based spectral alignment loss regularizes hidden representations in the frequency domain. Using 4.8k preference samples without expert feedback, TARS is reported to reduce hallucination rates from 26.4% to 13.2% and cognition scores from 2.5 to 0.4, outperforming standard DPO and 5× LLM-based data augmentation on 28.8k samples across hallucination benchmarks.

Significance. If the central claims hold after verification, the work would be significant for multimodal AI research. It provides evidence that reshaping the optimization landscape through targeted adversarial token perturbation can be more effective than simply scaling preference data for improving visual grounding, offering a data-efficient alternative that narrows the gap with models like GPT-4o without requiring large-scale expert annotations.

major comments (3)
  1. [§3.2] §3.2 (Method, inner maximization): The criterion for identifying visual-agnostic tokens (e.g., cross-attention thresholds, gradient attribution, or modality-specific masking) is not specified with pseudocode, equations, or implementation details. This is load-bearing for the central claim, as the skeptic correctly notes that without a verifiable selection heuristic independent of position or frequency, the outer minimization could exploit superficial cues rather than achieve causal visual alignment.
  2. [§4.3] §4.3 (Ablation studies): No ablation isolates the selective perturbation of visual-agnostic tokens from generic adversarial noise or random token perturbation. Without this, the reported gains over standard DPO cannot be attributed specifically to the proposed mechanism rather than broader regularization effects, directly undermining the distinction from data-augmentation baselines.
  3. [§4.1] §4.1 and Table 2 (Results): The evaluation appears confined to the same hallucination benchmarks likely used for hyperparameter tuning and method development, with no independent validation sets or external benchmarks reported. This raises a circularity risk for the performance claims (e.g., Hal-Rate 13.2% vs. 16.0%), as the optimization may overfit to benchmark-specific patterns.
minor comments (3)
  1. [Abstract] The abstract and §3.1 would benefit from an explicit equation for the min-max objective and the spectral alignment loss to improve readability.
  2. [§4] Figure captions and axis labels in the experimental section could be clarified to distinguish between hallucination rate and cognition score metrics.
  3. Missing references to prior work on adversarial training in preference optimization or FFT applications in representation learning should be added for context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments have identified key areas where additional clarity and experiments will strengthen the presentation of TARS. We address each major comment point by point below, indicating the revisions we will incorporate in the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method, inner maximization): The criterion for identifying visual-agnostic tokens (e.g., cross-attention thresholds, gradient attribution, or modality-specific masking) is not specified with pseudocode, equations, or implementation details. This is load-bearing for the central claim, as the skeptic correctly notes that without a verifiable selection heuristic independent of position or frequency, the outer minimization could exploit superficial cues rather than achieve causal visual alignment.

    Authors: We agree that the token selection mechanism requires explicit formalization to ensure reproducibility and to support the central claim of causal visual alignment. In the revised manuscript, we will add a dedicated paragraph in §3.2 with the precise criterion: visual-agnostic tokens are those whose average cross-attention score to visual patch tokens falls below the 40th percentile of the attention distribution for that sequence. We will include the corresponding equation and pseudocode (new Algorithm 1) for the inner maximization step, confirming that selection depends solely on modality-specific attention patterns rather than token position or frequency statistics. revision: yes

  2. Referee: [§4.3] §4.3 (Ablation studies): No ablation isolates the selective perturbation of visual-agnostic tokens from generic adversarial noise or random token perturbation. Without this, the reported gains over standard DPO cannot be attributed specifically to the proposed mechanism rather than broader regularization effects, directly undermining the distinction from data-augmentation baselines.

    Authors: We acknowledge that the current ablation suite does not fully isolate the contribution of selective perturbation. In the revised version we will add two controlled variants: (i) TARS-Rand, which applies the same perturbation budget to randomly chosen tokens, and (ii) TARS-Adv, which injects generic adversarial noise across all tokens. New results in an expanded §4.3 and Table 4 show that both variants underperform the full TARS model by 2.8–4.1 points on Hal-Rate, indicating that the gains arise specifically from targeting visual-agnostic tokens rather than from generic regularization or data-augmentation effects. revision: yes

  3. Referee: [§4.1] §4.1 and Table 2 (Results): The evaluation appears confined to the same hallucination benchmarks likely used for hyperparameter tuning and method development, with no independent validation sets or external benchmarks reported. This raises a circularity risk for the performance claims (e.g., Hal-Rate 13.2% vs. 16.0%), as the optimization may overfit to benchmark-specific patterns.

    Authors: We appreciate the referee’s caution about potential circularity. Hyperparameters were tuned on a 10% held-out split of the 4.8k preference data that was never used for final reporting; all numbers in Table 2 are on the official test partitions of POPE, HallusionBench, and AMBER, which were not involved in development. To further address the concern, the revised manuscript will include results on one additional external benchmark (MMHal-Bench) that was not seen during any stage of method design or tuning. The consistent relative gains on this held-out set support that the improvements are not benchmark-specific artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation and empirical claims are self-contained

full rationale

The paper proposes a novel min-max reformulation of DPO incorporating selective perturbation of visual-agnostic tokens and an FFT-based spectral alignment loss. These are presented as new algorithmic choices with independent motivation from the problem of superficial cue overfitting. Results are reported on standard hallucination benchmarks using a fixed 4.8k-sample preference set; no load-bearing equation or claim reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled via prior work. The central performance gains are externally falsifiable on the cited benchmarks and do not rely on uniqueness theorems or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the unstated premise that visual-agnostic tokens can be reliably identified and that frequency-domain regularization preserves causal visual information; no free parameters or invented entities are explicitly listed in the abstract.

axioms (1)
  • domain assumption Perturbing visual-agnostic tokens creates worst-case shifts that the outer minimization can align to causal visual signals.
    This premise is required for the min-max game to improve grounding rather than merely increase robustness to arbitrary noise.

pith-pipeline@v0.9.0 · 5794 in / 1282 out tokens · 78018 ms · 2026-05-19T02:51:16.945571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 10 internal anchors

  1. [1]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 ,

  2. [2]

    Hallucination of Multimodal Large Language Models: A Survey

    Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930,

  3. [3]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Xiang Chen, Chenxi Wang, Yida Xue, Ningyu Zhang, Xiaoyan Yang, Qiang Li, Yue Shen, Lei Liang, Jinjie Gu, and Huajun Chen. Unified hallucination detection for multimodal large language models. In ACL, 2024c. Xuweiyi Chen, Ziqiao Ma, Xuejun Zhang, Sihan Xu, Shengyi Qian, Jianing Yang, David Fouhey, and Joyce Chai. Multi-object hallucination in vision langua...

  4. [4]

    InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500,

  5. [5]

    Efficient reasoning models: A survey

    Sicheng Feng, Gongfan Fang, Xinyin Ma, and Xinchao Wang. Efficient reasoning models: A survey. arXiv preprint arXiv:2504.10903,

  6. [6]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276,

  7. [7]

    Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

    Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023b. Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baseli...

  8. [8]

    Proximal Policy Optimization Algorithms

    Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan ¨O Arık, and Tomas Pfister. Data-augmented phrase-level alignment for mitigating object hallucination. In ICLR, 2025a. 12 Preprint Pritam Sarkar, Sayna Ebrahimi, Ali Etemad, Ahmad Beirami, Sercan O Arik, and Tomas Pfister. Mitigating object hallucination in mllms via data-augmented phrase-le...

  9. [9]

    arXiv preprint arXiv:2505.21334 , year=

    Kele Shao, Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. Holitom: Holistic token merging for fast video large language models. arXiv preprint arXiv:2505.21334,

  10. [10]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525,

  11. [11]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Fei Wang, Wenxuan Zhou, James Y Huang, Nan Xu, Sheng Zhang, Hoifung Poon, and Muhao Chen. mdpo: Conditional preference optimization for multimodal large language models. In EMNLP, 2024a. Junyang Wang, Yuhang Wang, Guohai Xu, Jing Zhang, Yukai Gu, Haitao Jia, Jiaqi Wang, Haiyang Xu, Ming Yan, Ji Zhang, et al. Amber: An llm-free multi-dimensional benchmark ...

  12. [12]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Yidong Wang, Zhuohao Yu, Wenjin Yao, Zhengran Zeng, Linyi Yang, Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie, Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and Yue Zhang. Pan- daLM: An automatic evaluation benchmark for LLM instruction tuning optimization. In ICLR, 2024b. Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao...

  13. [13]

    The vision encoder also serves as the similarity function G(·) used in Eq

    as the vision encoder. The vision encoder also serves as the similarity function G(·) used in Eq. (9) to compute alignment between visual inputs and text tokens. All experiments are conducted using greedy decoding with a temperature of 0 to ensure deterministic outputs and reproducibility. A.2 DPO T RAINING SETUPS For fair comparison, DPO (Wang et al., 20...

  14. [14]

    For POPE (Li et al., 2023), we construct a new benchmark of 9,000 VQA pairs by sampling using the popular, random, and adversarial strategies

    as specified in their respective papers. For POPE (Li et al., 2023), we construct a new benchmark of 9,000 VQA pairs by sampling using the popular, random, and adversarial strategies. For evaluation metrics, we adopt four response-level hallucination measures across different bench- marks: CHAIR (Rohrbach et al.,