Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

arxiv: 2601.21078 · v3 · submitted 2026-01-28 · 💻 cs.CV

Towards Mitigating Modality Bias in Vision-Language Models for Temporal Action Localization

Jiaqi Li , Guangming Wang , Shuntian Zheng , Minzhe Ni , Xiaoman Lu , Guanghui Ye , Yu Guan This is my paper

Pith reviewed 2026-05-16 10:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal action localizationmodality biasvision-language modelsdebiasing reweightingresidual aggregationTHUMOS14ActionVLM

0 comments p. Extension

The pith

ActionVLM reduces modality bias in vision-language models for temporal action localization by keeping vision as the primary signal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ActionVLM to address the problem of modality bias in vision-language models used for temporal action localization. Current methods tend to over-rely on language priors, which can degrade performance when visual evidence is key. By introducing a debiasing reweighting module that estimates when language adds real value over vision alone, and using residual aggregation to treat language as a refinement, the approach maintains vision dominance. This leads to improved accuracy in identifying action boundaries and categories in untrimmed videos. Experiments demonstrate gains of up to 3.2% mAP on the THUMOS14 benchmark.

Core claim

ActionVLM is a vision-language aggregation framework for temporal action localization that preserves vision as the dominant modality. It employs a debiasing reweighting module to estimate the incremental benefit of language over vision-only predictions and dynamically reweights the language input accordingly. A residual aggregation strategy further ensures language serves only as complementary refinement. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning, resulting in state-of-the-art performance improvements on standard benchmarks.

What carries the argument

The debiasing reweighting module, which calculates the language advantage as the incremental benefit over vision-only predictions and adjusts weights dynamically, paired with a residual aggregation strategy that adds language features as refinements to vision-based predictions.

Load-bearing premise

The debiasing reweighting module accurately estimates the true benefit of adding language without introducing its own biases or needing heavy parameter tuning.

What would settle it

Running the model on a dataset where language priors are particularly misleading, such as actions with visually similar but linguistically distinct categories, and observing if performance drops below a vision-only baseline would test the estimation reliability.

read the original abstract

Temporal Action Localization (TAL) requires identifying both the boundaries and categories of actions in untrimmed videos. While vision-language models (VLMs) offer rich semantics to complement visual evidence, existing approaches tend to overemphasize linguistic priors at the expense of visual performance, leading to a pronounced modality bias. We propose ActionVLM, a vision-language aggregation framework that systematically mitigates modality bias in TAL. Our key insight is to preserve vision as the dominant signal while adaptively exploiting language only when beneficial. To this end, we introduce (i) a debiasing reweighting module that estimates the language advantage-the incremental benefit of language over vision-only predictions-and dynamically reweights language modality accordingly, and (ii) a residual aggregation strategy that treats language as a complementary refinement rather than the primary driver. This combination alleviates modality bias, reduces overconfidence from linguistic priors, and strengthens temporal reasoning. Experiments on THUMOS14 show that our model outperforms state-of-the-art by up to 3.2% mAP. Our code is available at https://github.com/JiaqiLi404/ActionVLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ActionVLM adds a reweighting module and residual fusion to cut language over-reliance in TAL, delivering a modest 3.2% mAP lift on THUMOS14, but the core estimator lacks direct checks against true modality complementarity.

read the letter

The paper's main move is practical: keep vision as the base signal and only bring in language when it actually helps, via a learned 'language advantage' score that reweights the modalities and a residual path that treats language as refinement. That combination is new enough as a packaged approach for this task, and the authors ship code, which is useful. On THUMOS14 they report up to 3.2% mAP over prior SOTA, which is a real if small win for a core video task. The idea of explicitly estimating incremental benefit rather than just concatenating or attending is sensible and addresses a known failure mode in VLMs. The residual aggregation is a clean way to avoid letting language dominate. Soft spots are mostly around validation. The advantage estimator is trained end-to-end, so it could be picking up dataset correlations instead of genuine complementarity; without an oracle diagnostic or held-out test that measures how well the estimated scores match independent utility, the gain could partly be recalibration. The abstract gives no ablations, significance numbers, or error breakdowns, so the central claim rests on the headline number alone. If the full experiments include those controls and the gain holds across splits, the work is solid enough. This is for people working on multimodal video models who already know the bias problem and want a drop-in module to try. It is not a foundational rethinking of TAL, but the empirical result plus code makes it worth a serious referee's time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper proposes ActionVLM, a vision-language aggregation framework for temporal action localization (TAL) that mitigates modality bias by preserving vision as the dominant signal. It introduces a debiasing reweighting module to estimate the 'language advantage' (incremental benefit of language over vision-only predictions) and dynamically reweight the language modality, along with a residual aggregation strategy that treats language as a complementary refinement. Experiments on THUMOS14 report up to 3.2% mAP improvement over state-of-the-art methods, with code released publicly.

Significance. If the central claims hold after proper validation, the work would offer a practical approach to balancing modalities in VLMs for TAL tasks, reducing over-reliance on linguistic priors while strengthening temporal reasoning. The public code release at https://github.com/JiaqiLi404/ActionVLM supports reproducibility. However, the current evidence for the 3.2% gain is limited to a headline result without supporting diagnostics, so the significance remains provisional pending stronger empirical grounding.

major comments (2)

[Method] The debiasing reweighting module is presented as estimating language advantage end-to-end to isolate genuine modality complementarity, but the manuscript provides no oracle, held-out diagnostic, or independent measure to validate that the estimator recovers true incremental utility rather than fitting to dataset-specific correlations between language priors and action labels (see method description and skeptic note on this point).
[Experiments] The headline result of up to 3.2% mAP gain on THUMOS14 is attributed to the proposed modules, yet the abstract and experiments section report no details on baselines, ablations, statistical significance testing, or error analysis, leaving the attribution to bias mitigation weakly supported.

minor comments (2)

[Method] Notation for 'language advantage' and the reweighting formula should be formalized with an equation to clarify the exact computation and any learned parameters.
[Experiments] The abstract claims 'outperforms state-of-the-art by up to 3.2% mAP' without specifying the exact SOTA methods or the mAP metric variant (e.g., average mAP at IoU thresholds); this should be explicit in the results table.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We agree that stronger empirical validation and expanded experimental reporting will strengthen the manuscript. We address each major comment below and will incorporate the suggested improvements in the revised version.

read point-by-point responses

Referee: [Method] The debiasing reweighting module is presented as estimating language advantage end-to-end to isolate genuine modality complementarity, but the manuscript provides no oracle, held-out diagnostic, or independent measure to validate that the estimator recovers true incremental utility rather than fitting to dataset-specific correlations between language priors and action labels (see method description and skeptic note on this point).

Authors: We acknowledge that an explicit validation of the language-advantage estimator would strengthen the claims. In the revision we will add held-out diagnostic experiments: (i) performance on action classes where language priors are known to be weak or conflicting, (ii) comparison against a simple oracle reweighting baseline that uses ground-truth per-sample gains, and (iii) correlation analysis between the estimated advantage scores and actual per-sample mAP improvements. These additions will be placed in a new subsection of the experiments. revision: yes
Referee: [Experiments] The headline result of up to 3.2% mAP gain on THUMOS14 is attributed to the proposed modules, yet the abstract and experiments section report no details on baselines, ablations, statistical significance testing, or error analysis, leaving the attribution to bias mitigation weakly supported.

Authors: We agree that the current experimental presentation is insufficient to fully support attribution. The revised manuscript will expand the experiments section to include: (i) a complete table of all baselines with implementation details and hyper-parameter settings, (ii) exhaustive ablation tables isolating each proposed component, (iii) statistical significance tests (paired t-tests over multiple runs with reported p-values), and (iv) qualitative error analysis highlighting cases where modality bias is reduced. These results will be added while keeping the main claims unchanged. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation; empirical method with experimental validation

full rationale

The paper presents an empirical framework (ActionVLM) with a debiasing reweighting module and residual aggregation strategy. The 3.2% mAP gain on THUMOS14 is reported from direct experiments rather than any first-principles derivation or prediction that reduces to fitted inputs by construction. No equations, self-definitional steps, or load-bearing self-citations are present that would equate outputs to inputs tautologically. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that vision should remain the dominant modality and that language benefit can be estimated from prediction differences.

axioms (1)

domain assumption Vision provides the primary reliable signal for action boundaries and categories
Stated as the key insight that language should be used only when beneficial.

pith-pipeline@v0.9.0 · 5513 in / 1133 out tokens · 22182 ms · 2026-05-16T10:05:25.912536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

debiasing reweighting module that estimates the language advantage—the incremental benefit of language over vision-only predictions—and dynamically reweights language modality accordingly
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

residual aggregation strategy that treats language as a complementary refinement rather than the primary driver

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.