Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Fahad Rahman; Kishor Datta Gupta; Marufa Kamal; Md. Mahfuzur Rahman; Mohd Ariful Haque; Sunzida Siddique

arxiv: 2511.15204 · v3 · submitted 2025-11-19 · 💻 cs.CV · cs.AI

Physics-Based Benchmarking Metrics for Multimodal Synthetic Images

Kishor Datta Gupta , Marufa Kamal , Md. Mahfuzur Rahman , Fahad Rahman , Mohd Ariful Haque , Sunzida Siddique This is my paper

Pith reviewed 2026-05-17 21:09 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords PCMDEsynthetic image evaluationphysics-constrained metricLLM reasoningvision-language modelsstructural accuracymultimodal benchmarking

0 comments

The pith

A new metric called PCMDE combines vision-language models and large language model reasoning to enforce physics-based structural constraints when evaluating synthetic images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Physics-Constrained Multimodal Data Evaluation (PCMDE) as a way to score how faithfully synthetic multimodal images follow semantic and physical rules. Existing measures such as BLEU, CIDEr, VQA, SigLIP-2 and CLIPScore frequently overlook relational accuracy and domain-specific structure. PCMDE extracts spatial and semantic features, fuses them with weighted , and then applies LLM reasoning to check constraints including alignment, position and consistency. A reader would care because reliable automatic checks could improve training and validation loops for image-generation systems used in simulation, design and scientific visualization. The method is presented as a three-stage pipeline that adapts component-level validation to the input scene.

Core claim

The central claim is that a three-stage architecture—multimodal feature extraction via object detection and vision-language models, confidence-weighted component fusion, and physics-guided reasoning with large language models—overcomes the inability of prior metrics to capture semantic or structural accuracy in synthetic images, particularly in context-dependent or domain-specific cases.

What carries the argument

The PCMDE pipeline, a three-stage process that extracts spatial-semantic features, performs adaptive fusion, and applies LLM-based reasoning to enforce relational constraints such as alignment, position and consistency.

If this is right

Synthetic-image generators can receive automatic feedback on structural violations during training.
Evaluation becomes possible for scenarios where visual similarity alone does not guarantee physical plausibility.
Component-level scores allow targeted diagnosis of which parts of a generated scene fail constraints.
The same pipeline can be reused across different multimodal domains by swapping the knowledge-mapping module.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might generalize to video or 3D scene generation if the reasoning stage is extended to temporal or volumetric constraints.
Integration with simulation engines could turn PCMDE into an online validator that flags outputs before they reach a user.
If hallucinations remain low, the method could serve as a lightweight substitute for full physics simulators in rapid prototyping loops.

Load-bearing premise

Large language models can perform reliable physics-guided reasoning on extracted image features without introducing hallucinations or domain-specific mistakes.

What would settle it

A controlled test set of synthetic images containing deliberate violations (misaligned objects, inconsistent positions, or physically impossible relations) where PCMDE scores are compared against expert human ratings of structural fidelity.

Figures

Figures reproduced from arXiv: 2511.15204 by Fahad Rahman, Kishor Datta Gupta, Marufa Kamal, Md. Mahfuzur Rahman, Mohd Ariful Haque, Sunzida Siddique.

**Figure 1.** Figure 1: Visually realistic but structurally incorrect images. Both examples violate fundamental aerodynamic or vehicular structure. Multiple metrics (CLIPScore, VQA Score, and SigLIP-2 etc.) demonstrate the similarity of image and text features. While image and text features belong to different domains. We do not directly compare image and text; a rule-based approach is used to refine the final result using VLM wi… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed hybrid evaluation pipeline combining Transformer model-based detection, vision-language models, and LLM reasoning with physical consistency rules. component classes as C = {c1, c2, . . . , cM}, where M is the total number of component types in the domain (e.g., M = 5 for aircraft: head, tail, engine, wing, tail wing; M = 13 for cars: wheels, bonnet, windshield, headlights, tailligh… view at source ↗

**Figure 3.** Figure 3: Representative aircraft images exhibiting structural consistency or physical implausibilities. Std/Mean × 100%) normalizes variability across different scales, enabling fair comparison between metrics with different ranges (CLIP Score ∈ [0, 100]vs.SigLIP −2 ∈ [0, 1]). Following standard interpretation (Everitt & Skrondal, 2010), a moderate CV in range of 10% < CV < 20% represents healthy discriminative … view at source ↗

**Figure 4.** Figure 4: Representative car images showing structural completeness or critical component omissions. caption. Assuming a [caption] for an image is provided to the VQA model, functioning as follows: “Does this figure show [caption]? Please answer yes or no.” It is highly effective for coarse caption alignment but unresponsive to nuanced physical implausibilities (e.g., mirrored components, impossible part numbers).… view at source ↗

**Figure 5.** Figure 5: Score distributions across 70 synthetic images per dataset. PCMDE (blue) demonstrates the widest absolute dispersion; SigLIP displays significant saturation (approaching zero variance). A.4. Sample Annotated images from Dataset 16 [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Inferred images of sample aircraft from test data to illustrate the class names. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Inferred photos of sample cars from test data to illustrate the class names. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Current state of the art measures like BLEU, CIDEr, VQA score, SigLIP-2 and CLIPScore are often unable to capture semantic or structural accuracy, especially for domain-specific or context-dependent scenarios. For this, this paper proposes a Physics-Constrained Multimodal Data Evaluation (PCMDE) metric combining large language models with reasoning, knowledge based mapping and vision-language models to overcome these limitations. The architecture is comprised of three main stages: (1) feature extraction of spatial and semantic information with multimodal features through object detection and VLMs; (2) Confidence-Weighted Component Fusion for adaptive component-level validation; and (3) physics-guided reasoning using large language models for structural and relational constraints (e.g., alignment, position, consistency) enforcement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that existing metrics such as BLEU, CIDEr, VQA score, SigLIP-2, and CLIPScore are inadequate for capturing semantic and structural accuracy in multimodal synthetic images, particularly in domain-specific or context-dependent cases. It proposes the Physics-Constrained Multimodal Data Evaluation (PCMDE) metric, which combines LLMs with reasoning, knowledge-based mapping, and VLMs. The architecture consists of three stages: (1) feature extraction of spatial and semantic information via object detection and VLMs, (2) confidence-weighted component fusion for adaptive validation, and (3) physics-guided reasoning using LLMs to enforce structural and relational constraints such as alignment, position, and consistency.

Significance. If the proposed three-stage pipeline were shown to produce scores that reliably detect structural and relational violations where baselines fail, the work could offer a more grounded alternative for evaluating synthetic multimodal data. The integration of physics-guided LLM reasoning with VLM features addresses a recognized gap in current metrics. However, the manuscript supplies only an architectural sketch with no datasets, implementations, ablations, quantitative comparisons, or error analyses, so any assessment of significance remains conditional on future validation.

major comments (3)

Abstract: The central claim that PCMDE 'overcomes these limitations' of BLEU, CIDEr, VQA, SigLIP-2, and CLIPScore is presented without any experimental results, validation data, baseline comparisons, or quantitative evidence that the three-stage pipeline enforces physics constraints more reliably. This absence makes the improvement an untested assertion rather than a demonstrated result.
The manuscript describes the PCMDE architecture but provides no implementation details, dataset, or error analysis for the LLM physics-guided reasoning stage (stage 3). Without such evidence, it is impossible to assess whether the approach avoids hallucinations or domain-specific errors when enforcing constraints like alignment and consistency, which is load-bearing for the claim of superiority over existing metrics.
The evaluation relies on black-box LLMs and VLMs whose outputs define the physics-guided score. No independent benchmarks, parameter-free derivations, or ablation studies are supplied to show that the metric is grounded externally rather than reflecting prompted or fitted behavior.

minor comments (1)

The abstract and description use terms such as 'knowledge based mapping' and 'physics-guided reasoning' without defining their precise mechanisms or how they differ from standard VLM prompting.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We acknowledge that the manuscript presents PCMDE primarily as an architectural proposal without accompanying experiments, datasets, or quantitative results. This limits direct validation of its advantages over existing metrics. We will revise the manuscript to moderate claims, expand on implementation aspects, and clarify the scope as a conceptual framework with planned empirical follow-up.

read point-by-point responses

Referee: Abstract: The central claim that PCMDE 'overcomes these limitations' of BLEU, CIDEr, VQA, SigLIP-2, and CLIPScore is presented without any experimental results, validation data, baseline comparisons, or quantitative evidence that the three-stage pipeline enforces physics constraints more reliably. This absence makes the improvement an untested assertion rather than a demonstrated result.

Authors: We agree that the abstract phrasing is too assertive given the absence of empirical support. The intent was to describe the design motivation for addressing known shortcomings of reference-based and embedding-based metrics in domain-specific settings. We will revise the abstract to state that PCMDE is proposed to address these limitations through its three-stage architecture, removing any implication of demonstrated superiority. revision: yes
Referee: The manuscript describes the PCMDE architecture but provides no implementation details, dataset, or error analysis for the LLM physics-guided reasoning stage (stage 3). Without such evidence, it is impossible to assess whether the approach avoids hallucinations or domain-specific errors when enforcing constraints like alignment and consistency, which is load-bearing for the claim of superiority over existing metrics.

Authors: The referee is correct that no concrete implementation, dataset, or error analysis is provided for stage 3. The current manuscript focuses on the high-level design. In revision we will add pseudocode for the physics-guided reasoning module, explicit prompt templates used to enforce constraints such as spatial alignment and relational consistency, and a dedicated subsection discussing known LLM limitations (hallucinations, domain drift) together with mitigation strategies such as constraint verification loops and few-shot examples drawn from physics principles. revision: partial
Referee: The evaluation relies on black-box LLMs and VLMs whose outputs define the physics-guided score. No independent benchmarks, parameter-free derivations, or ablation studies are supplied to show that the metric is grounded externally rather than reflecting prompted or fitted behavior.

Authors: We accept that the current description does not include ablations or external grounding experiments. The physics constraints are meant to be injected via explicit rule-based mapping and chain-of-thought prompting rather than learned fitting, but without ablations this remains an unverified design choice. We will expand the manuscript with a section on prompt engineering for physics grounding and outline a planned ablation protocol (varying LLM backbone, removing individual stages, and comparing against purely embedding-based baselines) for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: architectural proposal with no derivations or self-referential reductions

full rationale

The paper proposes PCMDE as a three-stage pipeline (feature extraction via object detection and VLMs, confidence-weighted fusion, LLM physics-guided reasoning) to address limitations of BLEU/CIDEr/etc. No equations, predictions, or first-principles derivations are presented in the abstract or described architecture. No self-citations, uniqueness theorems, ansatzes, or fitted parameters renamed as outputs appear. The central claim is an untested architectural suggestion rather than a reduction to inputs by construction. This is a normal non-finding for a high-level proposal paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into exact parameters; the approach assumes LLMs can perform reliable constraint checking and that component fusion weights can be set meaningfully.

free parameters (1)

confidence weights in component fusion
Adaptive weights for fusing spatial, semantic, and multimodal features; likely chosen or tuned per scenario.

axioms (1)

domain assumption Large language models can accurately perform physics-guided reasoning on image features for constraints like alignment and consistency
Invoked in stage 3 of the architecture without stated validation.

invented entities (1)

PCMDE metric no independent evidence
purpose: To provide physics-constrained evaluation of multimodal synthetic images
Newly proposed combination; no independent evidence supplied in abstract.

pith-pipeline@v0.9.0 · 5447 in / 1322 out tokens · 72828 ms · 2026-05-17T21:09:20.664661+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The architecture is comprised of three main stages: (1) feature extraction ... (2) Confidence-Weighted Component Fusion ... (3) physics-guided reasoning using large language models for structural and relational constraints
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PCMDE ... provides interpretable diagnostics identifying specific violations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Learning Transferable Visual Models From Natural Language Supervision

IEEE, 2019. Oliveira dos Santos, G., Luna Colombini, E., and Avila, S. Cider-r: Robust consensus-based image description evaluation.arXiv e-prints, pp. arXiv–2109, 2021. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

doi: 10.3390/jimaging11080252

ISSN 2313-433X. doi: 10.3390/jimaging11080252. URL https://www.mdpi.com/2313-433X/11/ 8/252. Tamayo-Urgil´es, D., Sanchez-Gordon, S., Val- divieso Caraguay, ´A. L., and Hern ´andez- ´Alvarez, M. Gan-based generation of synthetic data for vehicle driving events.Applied Sciences, 14(20):9269, 2024. Tian, Y ., Ye, Q., and Doermann, D. Yolo12: Attention- cent...

work page doi:10.3390/jimaging11080252 2024

[1] [1]

Learning Transferable Visual Models From Natural Language Supervision

IEEE, 2019. Oliveira dos Santos, G., Luna Colombini, E., and Avila, S. Cider-r: Robust consensus-based image description evaluation.arXiv e-prints, pp. arXiv–2109, 2021. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[2] [2]

doi: 10.3390/jimaging11080252

ISSN 2313-433X. doi: 10.3390/jimaging11080252. URL https://www.mdpi.com/2313-433X/11/ 8/252. Tamayo-Urgil´es, D., Sanchez-Gordon, S., Val- divieso Caraguay, ´A. L., and Hern ´andez- ´Alvarez, M. Gan-based generation of synthetic data for vehicle driving events.Applied Sciences, 14(20):9269, 2024. Tian, Y ., Ye, Q., and Doermann, D. Yolo12: Attention- cent...

work page doi:10.3390/jimaging11080252 2024