Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test

Elena Merino-G\'omez; Francisco Javier Santos-Mart\'in; Javier Conde; Pedro Reviriego; Tairan Fu

arxiv: 2604.22829 · v1 · submitted 2026-04-19 · 💻 cs.CV

Lost in the Vibrations: Vision Language Models Fail the Dynamic Gauges Test

Tairan Fu , Francisco Javier Santos-Mart\'in , Javier Conde , Pedro Reviriego , Elena Merino-G\'omez This is my paper

Pith reviewed 2026-05-10 06:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelsanalog gaugesdynamic gaugesneedle vibrationsmetrologyindustrial automationvideo analysisAI evaluation

0 comments

The pith

Vision-language models fail to accurately interpret dynamic analog gauges with vibrating needles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether current vision-language models can serve as reliable instruments for reading analog gauges in industrial settings. It creates video clips of gauges where needles move and vibrate at different speeds and asks models to track positions and report readings. The results show that models like GPT-5 and Gemini 3 struggle with following fast changes and understanding scale markings, leading to unreliable measurements. A sympathetic reader would care because many factories still use old analog equipment that robots need to monitor for safety. Without this capability, full automation of legacy systems remains out of reach.

Core claim

The central claim is that VLMs have limited ability in interpreting needle trajectories and scale semantics in video sequences of various gauge types under diverse motion profiles, failing to provide the traceability and reliability needed for safety-critical monitoring and not achieving performance for trustworthy synthetic instruments under IEEE and ISO standards.

What carries the argument

A novel dataset of video sequences featuring circular, linear, and Vernier gauges with needles under varied speed profiles and vibrations, which tests models' temporal analysis and semantic understanding of scales.

Load-bearing premise

The new dataset and chosen evaluation criteria accurately capture the real demands of metrology and uncertainty quantification in industrial environments.

What would settle it

If one of the tested VLMs or a new model correctly tracks needle positions through vibrations and reports accurate readings matching human observers within metrology tolerances across all gauge types and speeds, that would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2604.22829 by Elena Merino-G\'omez, Francisco Javier Santos-Mart\'in, Javier Conde, Pedro Reviriego, Tairan Fu.

**Figure 1.** Figure 1: Illustrations of the three gauge types in the DGD dataset. From left to right: (a) uniform angular rotation, (b) linear motion, (c) Vernier. The video snapshots show the in-band digital chronometer. Can Frontier Models Read Dynamic Gauges?: Testing VLMs on the Dynamic Gauge Dataset To put Vision-Language Models (VLMs) to the test, we select a set of frontier models that includes 1 Dataset: https://doi.org/… view at source ↗

**Figure 2.** Figure 2: Results for the six uniform angular rotation videos. Mean Absolute Error (MAE) values are included in the legend for each model. The results for the linear movement gauges in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Results for the linear movement videos. Mean Absolute Error (MAE) values are included in the legend for each model. The plots for the Vernier videos in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

The digital transformation of industrial manufacturing increasingly relies on the ability of autonomous robots to interact with legacy infrastructure, particularly analog gauges. While Vision-Language Models (VLMs) have demonstrated potential in zero-shot instrument recognition, their deployment in measurement systems remains constrained by an inherent inability to accurately analyze high-frequency temporal events and needle vibrations. This paper evaluates state-of-the-art models, including GPT-5 and Gemini 3, against the strict requirements of metrology and uncertainty quantification. To facilitate this evaluation, we introduce a novel dataset comprising video sequences of various gauge types: circular, linear, and Vernier, under diverse motion speed profiles. Our findings indicate that current VLMs exhibit limited ability in interpreting needle trajectories and scale semantics, failing to provide the traceability and reliability needed for safety-critical monitoring. The results demonstrate that these models have not yet achieved the performance necessary to be classified as trustworthy synthetic instruments under existing IEEE and ISO standards.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adds a video dataset for testing VLMs on moving analog gauge needles and reports that current models fall short on trajectories and vibrations, but the claimed failure against metrology standards rests on loose mapping.

read the letter

The main takeaway is that VLMs still have trouble tracking needle positions and scale readings when gauges vibrate or move at varying speeds, based on a new collection of video clips. The authors built sequences covering circular, linear, and Vernier gauges under different motion profiles and ran tests on models including GPT-5 and Gemini 3, concluding the outputs lack the reliability needed for safety-critical industrial use.

Referee Report

3 major / 2 minor

Summary. The paper introduces a novel video dataset of analog gauges (circular, linear, and Vernier types) under diverse needle motion speed profiles and evaluates state-of-the-art VLMs including GPT-5 and Gemini 3 on their ability to interpret needle trajectories, vibrations, and scale semantics. It concludes that current VLMs exhibit limited performance on these dynamic tasks and therefore cannot provide the traceability and reliability required for safety-critical monitoring under IEEE and ISO metrology standards.

Significance. If the empirical findings hold after addressing the mapping to quantitative metrology, the work would usefully document a capability gap in VLMs for high-frequency temporal measurement tasks relevant to industrial robotics and legacy infrastructure. The dataset itself is a constructive contribution that could support future benchmarking, though its impact is reduced by the absence of direct quantitative linkage to regulatory uncertainty requirements.

major comments (3)

[§4 and §5] §4 (Dataset) and §5 (Evaluation): The motion speed profiles and needle vibration sequences are characterized only qualitatively; no quantitative spectral analysis or amplitude statistics are provided to confirm they represent high-frequency temporal events that would trigger metrological concerns under ISO 17025 or gauge-specific standards.
[§5.2 and §6] §5.2 (Results) and §6 (Discussion): The claim that VLMs 'fail to provide the traceability and reliability needed for safety-critical monitoring' rests on custom qualitative assessments of trajectory interpretation rather than explicit computation of combined standard uncertainty u_c, expanded uncertainty U = k·u_c, or comparison against maximum permissible errors (MPE) as defined in ISO/IEC 17025 and related gauge standards; this mapping is load-bearing for the central regulatory conclusion.
[§3 and §6] §3 (Related Work) and §6: The paper invokes IEEE and ISO standards for 'trustworthy synthetic instruments' but does not cite or apply the specific clauses on measurement traceability chains or synthetic instrument validation; without this, the assertion that the observed VLM errors imply non-compliance remains unsupported.

minor comments (2)

[Abstract] The abstract states findings without reporting any numerical metrics, error rates, or inter-model comparisons; adding a concise quantitative summary would improve readability.
[§4] Notation for gauge types and motion profiles is introduced without a summary table; a compact table listing gauge categories, speed ranges, and number of sequences per category would aid clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have prompted us to strengthen the quantitative aspects of the dataset characterization and to make the mapping to metrological standards more explicit. We address each major comment below and describe the corresponding revisions.

read point-by-point responses

Referee: [§4 and §5] §4 (Dataset) and §5 (Evaluation): The motion speed profiles and needle vibration sequences are characterized only qualitatively; no quantitative spectral analysis or amplitude statistics are provided to confirm they represent high-frequency temporal events that would trigger metrological concerns under ISO 17025 or gauge-specific standards.

Authors: We agree that quantitative characterization strengthens the work. In the revised manuscript we have added a spectral analysis section that applies FFT to all motion profiles, reports dominant frequencies and power spectral density peaks, and supplies amplitude statistics (RMS, peak-to-peak, and standard deviation) for each vibration sequence. These data confirm the presence of components above 20 Hz, directly relevant to the metrological concerns raised. revision: yes
Referee: [§5.2 and §6] §5.2 (Results) and §6 (Discussion): The claim that VLMs 'fail to provide the traceability and reliability needed for safety-critical monitoring' rests on custom qualitative assessments of trajectory interpretation rather than explicit computation of combined standard uncertainty u_c, expanded uncertainty U = k·u_c, or comparison against maximum permissible errors (MPE) as defined in ISO/IEC 17025 and related gauge standards; this mapping is load-bearing for the central regulatory conclusion.

Authors: We accept that the original claim required a more explicit quantitative bridge. The revised manuscript now includes a dedicated uncertainty propagation subsection that converts observed needle-position errors into estimates of standard uncertainty u_c for each gauge type, derives expanded uncertainty U at k=2, and compares these values against published MPE ranges from ISO and manufacturer specifications. We have also moderated the regulatory language to state that the observed errors would increase u_c beyond acceptable limits for safety-critical use, rather than asserting outright non-compliance. revision: partial
Referee: [§3 and §6] §3 (Related Work) and §6: The paper invokes IEEE and ISO standards for 'trustworthy synthetic instruments' but does not cite or apply the specific clauses on measurement traceability chains or synthetic instrument validation; without this, the assertion that the observed VLM errors imply non-compliance remains unsupported.

Authors: We appreciate the request for greater specificity. The revised version now cites ISO/IEC 17025:2017 clause 6.5 (metrological traceability) and clause 7.6 (evaluation of measurement uncertainty), together with relevant sections of IEEE Std 1241-2010 on synthetic instruments. We apply these clauses directly to the VLM results by showing how the lack of repeatable trajectory interpretation breaks the traceability chain required for a synthetic instrument. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical dataset evaluation

full rationale

The paper introduces a novel video dataset of analog gauges under varying motion profiles and performs direct empirical testing of existing VLMs (GPT-5, Gemini 3, etc.) on needle trajectory and scale interpretation tasks. No equations, parameter fitting, or derivations are present; conclusions follow from observed model outputs versus dataset ground truth without any self-definitional loops, fitted inputs renamed as predictions, or load-bearing self-citations. The analysis chain is therefore self-contained and independent of its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical study introducing a dataset and evaluating models; no free parameters, axioms, or invented entities in the central claim.

pith-pipeline@v0.9.0 · 5472 in / 888 out tokens · 31526 ms · 2026-05-10T06:16:32.922030+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Applied AI in instrumentation and measurement: The deep learning revolution,

M. Khanafer and S. Shirmohammadi, "Applied AI in instrumentation and measurement: The deep learning revolution," in IEEE Instrumentation & Measurement Magazine, vol. 23, no. 6, pp. 10-17, Sept. 2020

work page 2020
[2]

Under pressure: Learning-based analog gauge reading in the wild,

M. Reitsma, J. Keller, K. Blomqvist, and R. Siegwart, "Under pressure: Learning-based analog gauge reading in the wild," in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp. 16410– 16416

work page 2024
[4]

Temporalvqa: A benchmark for temporal video question answering

W. Chow, J. Mao, B. Li, D. Seita, V. Guizilini, and Y. Wang, "PhysBench: Benchmarking and enhancing vision -language models for physical world understanding," arXiv preprint arXiv:2501.10674, 2025

work page arXiv 2025
[5]

V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

Z. Cheng, J. Hu, Z. Liu, C. Si, W. Li, and S. Gong, "V -star: Benchmarking video-LLMs on video spatio-temporal reasoning," arXiv preprint arXiv:2503.11495, 2025

work page arXiv 2025
[6]

Learning to read analog gauges from synthetic data,

J. Leon-Alcazar, Y. Alnumay, C. Zheng, H. Trigui, S. Patel, and B. Ghanem, "Learning to read analog gauges from synthetic data," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2024, pp. 4821–4831. Arxiv

work page 2024
[7]

Do vision -language models measure up? Benchmarking visual measurement reading with measurebench,

F. Lin et al., "Do vision -language models measure up? Benchmarking visual measurement reading with measurebench," arXiv preprint arXiv:2510.26865, 2025

work page arXiv 2025
[8]

Systems and Software Engineering — Systems and Software Quality Requirements and Evaluation (SQuaRE) — Measurement of Data Quality, ISO/IEC 25024:2015, 2015

work page 2015
[9]

Rethinking visual chain -of-thought: Spatial grounding vs. linguistic traces,

Z. Zhong et al., "Rethinking visual chain -of-thought: Spatial grounding vs. linguistic traces," arXiv preprint arXiv:2506.09122, 2025. Tairan Fu (tairan.fu@polimi.it) is a PhD student at Politecnico di Milano working on the use of AI in manufacturing. Francisco Javier Santos -Martín (francisco.santos@uva.es) is an associate professor at the Universidad d...

work page arXiv 2025

[1] [1]

Applied AI in instrumentation and measurement: The deep learning revolution,

M. Khanafer and S. Shirmohammadi, "Applied AI in instrumentation and measurement: The deep learning revolution," in IEEE Instrumentation & Measurement Magazine, vol. 23, no. 6, pp. 10-17, Sept. 2020

work page 2020

[2] [2]

Under pressure: Learning-based analog gauge reading in the wild,

M. Reitsma, J. Keller, K. Blomqvist, and R. Siegwart, "Under pressure: Learning-based analog gauge reading in the wild," in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), 2024, pp. 16410– 16416

work page 2024

[3] [4]

Temporalvqa: A benchmark for temporal video question answering

W. Chow, J. Mao, B. Li, D. Seita, V. Guizilini, and Y. Wang, "PhysBench: Benchmarking and enhancing vision -language models for physical world understanding," arXiv preprint arXiv:2501.10674, 2025

work page arXiv 2025

[4] [5]

V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

Z. Cheng, J. Hu, Z. Liu, C. Si, W. Li, and S. Gong, "V -star: Benchmarking video-LLMs on video spatio-temporal reasoning," arXiv preprint arXiv:2503.11495, 2025

work page arXiv 2025

[5] [6]

Learning to read analog gauges from synthetic data,

J. Leon-Alcazar, Y. Alnumay, C. Zheng, H. Trigui, S. Patel, and B. Ghanem, "Learning to read analog gauges from synthetic data," in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), 2024, pp. 4821–4831. Arxiv

work page 2024

[6] [7]

Do vision -language models measure up? Benchmarking visual measurement reading with measurebench,

F. Lin et al., "Do vision -language models measure up? Benchmarking visual measurement reading with measurebench," arXiv preprint arXiv:2510.26865, 2025

work page arXiv 2025

[7] [8]

Systems and Software Engineering — Systems and Software Quality Requirements and Evaluation (SQuaRE) — Measurement of Data Quality, ISO/IEC 25024:2015, 2015

work page 2015

[8] [9]

Rethinking visual chain -of-thought: Spatial grounding vs. linguistic traces,

Z. Zhong et al., "Rethinking visual chain -of-thought: Spatial grounding vs. linguistic traces," arXiv preprint arXiv:2506.09122, 2025. Tairan Fu (tairan.fu@polimi.it) is a PhD student at Politecnico di Milano working on the use of AI in manufacturing. Francisco Javier Santos -Martín (francisco.santos@uva.es) is an associate professor at the Universidad d...

work page arXiv 2025