Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

Jiankang Deng; Jifei Song; Shaogang Gong; Yu Cao; Zhensong Zhang; Ziquan Liu

arxiv: 2606.22918 · v1 · pith:OTILUYV6new · submitted 2026-06-22 · 💻 cs.CV · cs.GT

Each Judge Its Own Yardstick: Discovering Per-VLM Taxonomies for Physical Video Evaluation

Yu Cao , Ziquan Liu , Zhensong Zhang , Jiankang Deng , Shaogang Gong , Jifei Song This is my paper

Pith reviewed 2026-06-26 09:01 UTC · model grok-4.3

classification 💻 cs.CV cs.GT

keywords vision-language modelsphysical video evaluationper-VLM taxonomyiterative refinementphysical consistencymodel-specific evaluationJudgeFitcommonsense ratings

0 comments

The pith

Each vision-language model performs better as a video judge when given its own custom taxonomy of physical errors instead of a shared global one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models serve as automated judges for physical consistency in videos generated by world models, but their differing architectures mean they perceive different aspects of physics. A uniform evaluation schema forces all models to use the same dimensions regardless of their capabilities. JudgeFit iteratively builds a tailored taxonomy for each VLM by having it describe errors on videos, clustering those into dimensions, then calibrating against human ratings and fixing unreliable or redundant dimensions. This process is repeated until the taxonomy reliably matches human judgments on new videos. Across 16 VLMs the tailored versions beat the global baseline by 32 percent on average and highlight distinct blind spots per model family.

Core claim

The paper claims that an iterative procedure can discover a per-VLM evaluation taxonomy for physical video evaluation that outperforms any fixed global schema. Starting from the target VLM's own error descriptions on a small video set, dimensions are clustered and then refined by diagnosing performance against human physical-commonsense ratings and prompting repairs. When applied to models from eight families the resulting taxonomies deliver a mean 32 percent relative improvement on held-out videos and reveal model-specific reliability patterns that global rankings obscure.

What carries the argument

JudgeFit, the iterative refinement procedure that constructs and repairs per-VLM taxonomies by combining self-prompted error enumeration with human calibration diagnostics.

If this is right

Automated judges can provide more accurate reward signals for training video generators and world models.
Selection of VLMs for evaluation tasks can be guided by their specific reliability profiles rather than overall rankings.
Data filtering for training sets can avoid model-specific failure modes in physical perception.
Benchmarks can expose distinct blind spots across model families that aggregate scores hide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on whether the discovered taxonomies can be used to fine-tune the VLMs themselves toward better physical understanding.
Similar adaptive taxonomies might improve evaluation in other modalities such as generated audio or 3D scenes if human ground truth can be collected.
The per-VLM profiles suggest that future comparisons of VLMs should report dimension-level reliability rather than single aggregate scores.

Load-bearing premise

Human physical-commonsense ratings constitute an unbiased external ground truth against which VLM dimension scores can be calibrated without the calibration step itself introducing new biases.

What would settle it

Collecting fresh human ratings on held-out videos with different prompts or raters and finding that the refined per-VLM taxonomy no longer outperforms the global schema by the reported margin would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.22918 by Jiankang Deng, Jifei Song, Shaogang Gong, Yu Cao, Zhensong Zhang, Ziquan Liu.

**Figure 1.** Figure 1: Evaluation under two taxonomies. The fixed four-domain schema (left) concentrates VLM scores on the mechanics axis with the other three axes nearly inactive, offering little resolution on cross-model differences. Our per-VLM refined taxonomy (right) renders every dimension discriminative and reveals distinct strengths across models. 2022; Rocamonde et al., 2024), train large-scale visual reward models for… view at source ↗

**Figure 2.** Figure 2: Overview of JUDGEFIT. Stage 1 (Seed): the VLM judge describes physics errors in free-form text over the annotation set Dannot, and an LLM clusters these descriptions into the initial seed taxonomy T0. Stage 2 (Refine): the VLM scores each video along the current dimensions, a ridge-regression calibrator maps these scores to human ratings y, and the resulting diagnosis R drives an LLM editor that proposes l… view at source ↗

**Figure 3.** Figure 3: Refinement converges quickly and lifts weak seeds. Selection objective on Dannot across refinement rounds, colored by developer (cumulative best per model). The mean trajectory (dashed, ± std) rises most steeply from seed to round 1 and flattens thereafter, with most models plateauing by round 2. Reka-Edge starts below zero at the seed and gains the most from a single round, while strong seeds (ByteDance)… view at source ↗

**Figure 4.** Figure 4: Cross-VLM atomic-concept coverage. Binary coverage matrix over 16 VLMs (rows, ordered by refinement performance) and atomic physical concepts (columns), grouped into canonical types by vertical separators. A cell is filled when some dimension in a VLM’s refined taxonomy references that concept. Top marginal: number of VLMs covering each concept; right marginal: number of concepts each VLM covers. A shared … view at source ↗

**Figure 5.** Figure 5: Bias-by-discrimination typology of the 16 VLMs. Each VLM is placed by its bias (x; left strict, right lenient) and discrimination ∆ (y), colored by refined Spearman ρ. The dashed line marks the median bias, splitting strict from lenient. Judges split cleanly along the bias axis instead, 8 strict and 8 lenient, differing in scale placement, not discrimination. Bias and discrimination are coupled: more len… view at source ↗

**Figure 6.** Figure 6: LLM editor prompt used in the refine stage. Fields in braces are filled per VLM and per round. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Maintaining physical consistency in video generators and world models increasingly relies on vision-language models (VLMs) as automated judges that provide reward signals, ranking decisions, and data-filtering criteria. Yet VLMs differ substantially in training data and architecture, encoding physical phenomena through distinct internal representations. A single global evaluation schema therefore gives every VLM the same axes of competence, regardless of what each can actually perceive. We propose JudgeFit, an iterative refinement procedure that discovers a per-VLM evaluation taxonomy. An initial taxonomy is constructed by prompting the target VLM to enumerate physics errors on a small set of videos and clustering the resulting descriptions. The taxonomy is then refined through a diagnostic step: we calibrate the VLM's per-dimension scores to human physical-commonsense ratings, diagnose which dimensions it scores unreliably or redundantly, and prompt an LLM to repair them, iterating until convergence. We further instantiate this procedure as a benchmark and apply it to 16 VLMs spanning eight model families. The refined taxonomy outperforms the global-schema baseline on held-out videos for every VLM tested, with a mean relative improvement of approximately 32%. Beyond aggregate accuracy, the per-VLM profiles expose model-specific blind spots that overall rankings cannot anticipate, with reliability patterns differing markedly across model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JudgeFit builds per-VLM physical-error taxonomies through VLM clustering plus human calibration and LLM repair, reporting 32% relative lift over a fixed schema, but the calibration step shares the same human ratings used for evaluation.

read the letter

The main point is that the authors give a concrete iterative procedure to create a custom taxonomy for each VLM. It starts by prompting the target model to list physics errors on a small video set, clusters the descriptions, scores the resulting dimensions against human physical-commonsense ratings, flags unreliable or redundant ones from the residuals, and uses an LLM to repair them until the scores stabilize. They run this on 16 VLMs from eight families and report a mean 32% relative improvement on held-out videos versus a single global schema.

The procedure is new in its full loop and the multi-model application is useful. The per-VLM profiles do surface family-level differences in which dimensions hold up, which a shared schema would miss. That part is practical for anyone who needs a judge tuned to a specific model's actual strengths.

The soft spot is the evaluation symmetry. Calibration to human ratings happens inside the refinement loop, and the held-out test appears to use those same ratings as the target. The global baseline receives no parallel calibration, so the reported gain could reflect better fitting to the human data rather than discovery of independent axes. The abstract supplies no dataset sizes, variance numbers, significance tests, or checks for rater or prompt sensitivity, which leaves the 32% figure difficult to interpret.

This is aimed at groups that already use VLMs to filter or reward generated video. Readers who need model-specific diagnostics would find the profiles worth examining.

Send it for review. The workflow is straightforward and the experiment covers enough models to be worth referee time, but expect questions on whether the lift survives an independent rating pool or a held-out calibration set.

Referee Report

3 major / 2 minor

Summary. The paper proposes JudgeFit, an iterative procedure to discover per-VLM taxonomies for physical video evaluation. An initial taxonomy is generated by prompting the target VLM on a small video set and clustering descriptions; the taxonomy is then refined by calibrating VLM per-dimension scores to human physical-commonsense ratings, diagnosing unreliable or redundant dimensions from residuals, and prompting an LLM to repair them until convergence. Applied to 16 VLMs across eight families, the refined per-VLM taxonomies outperform a global-schema baseline on held-out videos with a mean relative improvement of ~32%, while also exposing model-specific blind spots.

Significance. If the central numerical claim survives scrutiny on statistical details and evaluation symmetry, the work is significant for automated evaluation of video generators and world models. It demonstrates that tailored, per-VLM axes can yield more reliable reward signals and data filters than a single global schema, and the multi-family evaluation provides evidence that reliability patterns vary systematically by architecture and training data.

major comments (3)

[Abstract] Abstract: The abstract states a 32% mean relative improvement on held-out videos but supplies no information on dataset size, statistical significance, variance across models, how held-out videos were chosen, or whether the improvement survives multiple-testing correction. Without these details the central numerical claim cannot be assessed.
[Abstract / §3] Abstract / §3 (procedure): Calibration of VLM scores to human ratings occurs inside the taxonomy-refinement loop, and the same ratings implicitly serve as the evaluation target on held-out videos. The global-schema baseline receives no equivalent calibration step, so the reported outperformance may reflect asymmetric fitting to rater-specific biases rather than discovery of independent per-VLM axes.
[§4] §4 (experiments): The paper does not report whether the 32% lift is robust when human ratings are collected under a different prompt template or from an independent rater pool, which is required to isolate genuine gain from circular dependence on the calibration signal.

minor comments (2)

[§2] §2: The clustering step for the initial taxonomy would benefit from an explicit description of the distance metric and linkage method used.
[Figure 3] Figure 3: Axis labels and legend entries are too small to read at standard print size; consider enlarging or splitting into multiple panels.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major point below with clarifications on the reported results and evaluation protocol, and indicate revisions where the manuscript will be updated.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states a 32% mean relative improvement on held-out videos but supplies no information on dataset size, statistical significance, variance across models, how held-out videos were chosen, or whether the improvement survives multiple-testing correction. Without these details the central numerical claim cannot be assessed.

Authors: The abstract is intentionally concise. Full details on dataset size, the held-out selection procedure (disjoint split from the calibration videos), variance across the 16 models, and statistical testing (including any corrections) appear in §4. We will revise the abstract to include the evaluation scale and a reference to the statistical analysis performed. revision: yes
Referee: [Abstract / §3] Abstract / §3 (procedure): Calibration of VLM scores to human ratings occurs inside the taxonomy-refinement loop, and the same ratings implicitly serve as the evaluation target on held-out videos. The global-schema baseline receives no equivalent calibration step, so the reported outperformance may reflect asymmetric fitting to rater-specific biases rather than discovery of independent per-VLM axes.

Authors: The calibration step is the core diagnostic mechanism of JudgeFit: human ratings on a calibration subset are used only to identify unreliable or redundant dimensions via residuals and to prompt repairs to the taxonomy. Held-out videos are disjoint from the calibration set, and both the refined per-VLM taxonomy and the global-schema baseline are evaluated against the same human ratings on the held-out set. The asymmetry is by design—the global baseline represents the standard unrefined comparator. We will add explicit wording in §3 to state the disjoint split and the purpose of calibration. revision: partial
Referee: [§4] §4 (experiments): The paper does not report whether the 32% lift is robust when human ratings are collected under a different prompt template or from an independent rater pool, which is required to isolate genuine gain from circular dependence on the calibration signal.

Authors: We acknowledge that robustness to alternative rating protocols is not reported. The current results use one rater pool and template. We will add a limitations paragraph in the revised §4 noting this and framing it as an avenue for future validation, while observing that consistent gains across eight model families provide indirect support for the method's stability. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's procedure generates an initial taxonomy via VLM prompting on a small video set, refines it by calibrating per-dimension scores against external human physical-commonsense ratings, and evaluates the refined taxonomy on held-out videos using the same human ratings as benchmark. This constitutes a standard calibration-plus-held-out-test workflow against an independent external signal rather than a self-referential reduction. No equations, self-citations, uniqueness theorems, or ansatzes are quoted that would force the reported 32% relative improvement by construction from the inputs. The global-schema baseline comparison, while asymmetric, does not create a definitional loop within the claimed derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The method rests on the assumption that human physical-commonsense ratings are a stable external reference and that LLM-based dimension repair can be performed without introducing new inconsistencies. No free parameters are explicitly named in the abstract; the number of taxonomy dimensions and the clustering threshold are implicit but not quantified.

free parameters (1)

number of taxonomy dimensions after refinement
Determined by the iterative diagnosis and repair loop; the abstract does not state how many dimensions result or how the stopping criterion is set.

axioms (2)

domain assumption VLMs can produce enumerations of physics errors when prompted on video clips
This is the starting point for constructing the initial taxonomy.
domain assumption Human physical-commonsense ratings provide an unbiased calibration target
Used to diagnose unreliable or redundant dimensions.

pith-pipeline@v0.9.1-grok · 5779 in / 1571 out tokens · 19352 ms · 2026-06-26T09:01:47.993400+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 13 linked inside Pith

[1]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=
[2]

Advances in Neural Information Processing Systems , volume=

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. Advances in Neural Information Processing Systems , volume=
[3]

Forty-first International Conference on Machine Learning , year=

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Forty-first International Conference on Machine Learning , year=
[4]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024
[5]

International Conference on Learning Representations , volume=

Vision-language models are zero-shot reward models for reinforcement learning , author=. International Conference on Learning Representations , volume=
[6]

arXiv preprint arXiv:2509.08826 , year=

Rewarddance: Reward scaling in visual generation , author=. arXiv preprint arXiv:2509.08826 , year=

arXiv
[7]

Advances in Neural Information Processing Systems , volume=

Improving video generation with human feedback , author=. Advances in Neural Information Processing Systems , volume=
[8]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Large language models are not fair evaluators , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[9]

Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024) , pages=

Calibrating llm-based evaluator , author=. Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024) , pages=

2024
[10]

International Conference on Learning Representations , volume=

Videophy: Evaluating physical commonsense for video generation , author=. International Conference on Learning Representations , volume=
[11]

arXiv preprint arXiv:2503.06800 , year=

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation , author=. arXiv preprint arXiv:2503.06800 , year=

arXiv
[12]

arXiv preprint arXiv:2410.05363 , year=

Towards world simulator: Crafting physical commonsense-based benchmark for video generation , author=. arXiv preprint arXiv:2410.05363 , year=

Pith/arXiv arXiv
[13]

International Conference on Learning Representations , volume=

Physbench: Benchmarking and enhancing vision-language models for physical world understanding , author=. International Conference on Learning Representations , volume=
[14]

Advances in Neural Information Processing Systems , volume=

Seephys: Does seeing help thinking?--benchmarking vision-based physics reasoning , author=. Advances in Neural Information Processing Systems , volume=
[15]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Do generative video models understand physical principles? , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=
[16]

International Conference on Learning Representations , volume=

Large language models as optimizers , author=. International Conference on Learning Representations , volume=
[17]

gradient descent

Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[18]

Nature , volume=

Optimizing generative ai by backpropagating language model feedback , author=. Nature , volume=. 2025 , publisher=

2025
[19]

arXiv preprint arXiv:2310.03714 , year=

Dspy: Compiling declarative language model calls into self-improving pipelines , author=. arXiv preprint arXiv:2310.03714 , year=

Pith/arXiv arXiv
[20]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences , author=. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=
[21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Llm-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[22]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

Learning to judge: LLMs designing and applying evaluation rubrics , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026
[23]

OpenAI Blog , volume=

Video generation models as world simulators , author=. OpenAI Blog , volume=
[24]

International Conference on Learning Representations , volume=

Cogvideox: Text-to-video diffusion models with an expert transformer , author=. International Conference on Learning Representations , volume=
[25]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2501.03575 , year=

Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=

Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vbench: Comprehensive benchmark suite for video generative models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[28]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=
[29]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[30]

Judging the judges: A systematic study of position bias in llm-as-a-judge , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=
[31]

arXiv preprint arXiv:2106.08261 , year=

Physion: Evaluating physical prediction from vision in humans and machines , author=. arXiv preprint arXiv:2106.08261 , year=

arXiv
[32]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024
[33]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=
[34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[35]

, author=

A constant error in psychological ratings. , author=. Journal of applied psychology , volume=. 1920 , publisher=

1920
[36]

2025 , publisher=

Item response theory: Foundations for psychologists and social scientists , author=. 2025 , publisher=

2025
[37]

Advances in Neural Information Processing Systems , volume=

Llm evaluators recognize and favor their own generations , author=. Advances in Neural Information Processing Systems , volume=
[38]

arXiv preprint arXiv:2410.21819 , year=

Self-preference bias in llm-as-a-judge , author=. arXiv preprint arXiv:2410.21819 , year=

Pith/arXiv arXiv
[39]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=
[40]

arXiv preprint arXiv:2412.05579 , year=

Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

Pith/arXiv arXiv
[41]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv
[42]

arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv
[43]

2025 , url=

Bai, Shuai and others , journal=. 2025 , url=

2025
[44]

arXiv preprint arXiv:2404.12387 , year=

Reka core, flash, and edge: A series of powerful multimodal language models , author=. arXiv preprint arXiv:2404.12387 , year=

arXiv
[45]

arXiv preprint arXiv:2507.01006 , year=

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning , author=. arXiv preprint arXiv:2507.01006 , year=

Pith/arXiv arXiv
[46]

arXiv preprint arXiv:2601.02780 , year=

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

Pith/arXiv arXiv
[47]

arXiv preprint arXiv:2604.26752 , year=

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents , author=. arXiv preprint arXiv:2604.26752 , year=

Pith/arXiv arXiv
[48]

arXiv preprint arXiv:2408.03326 , year=

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

Pith/arXiv arXiv
[49]

arXiv preprint arXiv:2504.10479 , year=

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

Pith/arXiv arXiv
[50]

2025 , institution=

Amazon Nova 2: Multimodal reasoning and generation models , author=. 2025 , institution=

2025
[51]

0 model card: Towards intelligence frontier for real-world complexity , author=

Seed2. 0 model card: Towards intelligence frontier for real-world complexity , author=. 2026 , institution=

2026
[52]

2026 , howpublished =

2026

[1] [1]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

[2] [2]

Advances in Neural Information Processing Systems , volume=

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. Advances in Neural Information Processing Systems , volume=

[3] [3]

Forty-first International Conference on Machine Learning , year=

Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark , author=. Forty-first International Conference on Machine Learning , year=

[4] [4]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation , author=. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages=

2024

[5] [5]

International Conference on Learning Representations , volume=

Vision-language models are zero-shot reward models for reinforcement learning , author=. International Conference on Learning Representations , volume=

[6] [6]

arXiv preprint arXiv:2509.08826 , year=

Rewarddance: Reward scaling in visual generation , author=. arXiv preprint arXiv:2509.08826 , year=

arXiv

[7] [7]

Advances in Neural Information Processing Systems , volume=

Improving video generation with human feedback , author=. Advances in Neural Information Processing Systems , volume=

[8] [8]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Large language models are not fair evaluators , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[9] [9]

Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024) , pages=

Calibrating llm-based evaluator , author=. Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (lrec-coling 2024) , pages=

2024

[10] [10]

International Conference on Learning Representations , volume=

Videophy: Evaluating physical commonsense for video generation , author=. International Conference on Learning Representations , volume=

[11] [11]

arXiv preprint arXiv:2503.06800 , year=

Videophy-2: A challenging action-centric physical commonsense evaluation in video generation , author=. arXiv preprint arXiv:2503.06800 , year=

arXiv

[12] [12]

arXiv preprint arXiv:2410.05363 , year=

Towards world simulator: Crafting physical commonsense-based benchmark for video generation , author=. arXiv preprint arXiv:2410.05363 , year=

Pith/arXiv arXiv

[13] [13]

International Conference on Learning Representations , volume=

Physbench: Benchmarking and enhancing vision-language models for physical world understanding , author=. International Conference on Learning Representations , volume=

[14] [14]

Advances in Neural Information Processing Systems , volume=

Seephys: Does seeing help thinking?--benchmarking vision-based physics reasoning , author=. Advances in Neural Information Processing Systems , volume=

[15] [15]

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

Do generative video models understand physical principles? , author=. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages=

[16] [16]

International Conference on Learning Representations , volume=

Large language models as optimizers , author=. International Conference on Learning Representations , volume=

[17] [17]

gradient descent

Automatic prompt optimization with “gradient descent” and beam search , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023

[18] [18]

Nature , volume=

Optimizing generative ai by backpropagating language model feedback , author=. Nature , volume=. 2025 , publisher=

2025

[19] [19]

arXiv preprint arXiv:2310.03714 , year=

Dspy: Compiling declarative language model calls into self-improving pipelines , author=. arXiv preprint arXiv:2310.03714 , year=

Pith/arXiv arXiv

[20] [20]

Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences , author=. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology , pages=

[21] [21]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Llm-rubric: A multidimensional, calibrated approach to automated evaluation of natural language texts , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[22] [22]

Findings of the Association for Computational Linguistics: EACL 2026 , pages=

Learning to judge: LLMs designing and applying evaluation rubrics , author=. Findings of the Association for Computational Linguistics: EACL 2026 , pages=

2026

[23] [23]

OpenAI Blog , volume=

Video generation models as world simulators , author=. OpenAI Blog , volume=

[24] [24]

International Conference on Learning Representations , volume=

Cogvideox: Text-to-video diffusion models with an expert transformer , author=. International Conference on Learning Representations , volume=

[25] [25]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2501.03575 , year=

Cosmos world foundation model platform for physical ai , author=. arXiv preprint arXiv:2501.03575 , year=

Pith/arXiv arXiv

[27] [27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vbench: Comprehensive benchmark suite for video generative models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[28] [28]

Advances in Neural Information Processing Systems , volume=

Imagereward: Learning and evaluating human preferences for text-to-image generation , author=. Advances in Neural Information Processing Systems , volume=

[29] [29]

International Conference on Machine Learning , pages=

Scaling laws for reward model overoptimization , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[30] [30]

Judging the judges: A systematic study of position bias in llm-as-a-judge , author=. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics , pages=

[31] [31]

arXiv preprint arXiv:2106.08261 , year=

Physion: Evaluating physical prediction from vision in humans and machines , author=. arXiv preprint arXiv:2106.08261 , year=

arXiv

[32] [32]

National Science Review , volume=

A survey on multimodal large language models , author=. National Science Review , volume=. 2024 , publisher=

2024

[33] [33]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

[34] [34]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

[35] [35]

, author=

A constant error in psychological ratings. , author=. Journal of applied psychology , volume=. 1920 , publisher=

1920

[36] [36]

2025 , publisher=

Item response theory: Foundations for psychologists and social scientists , author=. 2025 , publisher=

2025

[37] [37]

Advances in Neural Information Processing Systems , volume=

Llm evaluators recognize and favor their own generations , author=. Advances in Neural Information Processing Systems , volume=

[38] [38]

arXiv preprint arXiv:2410.21819 , year=

Self-preference bias in llm-as-a-judge , author=. arXiv preprint arXiv:2410.21819 , year=

Pith/arXiv arXiv

[39] [39]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=

[40] [40]

arXiv preprint arXiv:2412.05579 , year=

Llms-as-judges: a comprehensive survey on llm-based evaluation methods , author=. arXiv preprint arXiv:2412.05579 , year=

Pith/arXiv arXiv

[41] [41]

arXiv preprint arXiv:2507.06261 , year=

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

Pith/arXiv arXiv

[42] [42]

arXiv preprint arXiv:2502.13923 , year=

Pith/arXiv arXiv

[43] [43]

2025 , url=

Bai, Shuai and others , journal=. 2025 , url=

2025

[44] [44]

arXiv preprint arXiv:2404.12387 , year=

Reka core, flash, and edge: A series of powerful multimodal language models , author=. arXiv preprint arXiv:2404.12387 , year=

arXiv

[45] [45]

arXiv preprint arXiv:2507.01006 , year=

Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning , author=. arXiv preprint arXiv:2507.01006 , year=

Pith/arXiv arXiv

[46] [46]

arXiv preprint arXiv:2601.02780 , year=

Mimo-v2-flash technical report , author=. arXiv preprint arXiv:2601.02780 , year=

Pith/arXiv arXiv

[47] [47]

arXiv preprint arXiv:2604.26752 , year=

GLM-5V-Turbo: Toward a Native Foundation Model for Multimodal Agents , author=. arXiv preprint arXiv:2604.26752 , year=

Pith/arXiv arXiv

[48] [48]

arXiv preprint arXiv:2408.03326 , year=

Llava-onevision: Easy visual task transfer , author=. arXiv preprint arXiv:2408.03326 , year=

Pith/arXiv arXiv

[49] [49]

arXiv preprint arXiv:2504.10479 , year=

Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models , author=. arXiv preprint arXiv:2504.10479 , year=

Pith/arXiv arXiv

[50] [50]

2025 , institution=

Amazon Nova 2: Multimodal reasoning and generation models , author=. 2025 , institution=

2025

[51] [51]

0 model card: Towards intelligence frontier for real-world complexity , author=

Seed2. 0 model card: Towards intelligence frontier for real-world complexity , author=. 2026 , institution=

2026

[52] [52]

2026 , howpublished =

2026