pith. sign in

arxiv: 2606.28322 · v1 · pith:EOEAECDEnew · submitted 2026-06-26 · 💻 cs.CV

PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

Pith reviewed 2026-06-29 03:51 UTC · model grok-4.3

classification 💻 cs.CV
keywords multimodal evaluationperception rubricsgated scoringimage understandinghuman alignmentbenchmark brittlenessconsensus captions
0
0 comments X

The pith

PerceptionRubrics uses gated scoring on atomic rubrics to align multimodal evaluation more closely with human perception than conventional benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PerceptionRubrics as a framework that replaces holistic scoring with detailed, instance-specific rubrics for 1,038 images. It derives these rubrics from consensus golden captions and divides them into Must-Right facts and Easy-Wrong details. A gated mechanism then applies sharp penalties when mandatory facts are missed, rather than allowing errors to average out. This approach exposes cases where models handle isolated elements yet fail overall perceptual requirements. The authors report that the resulting scores track human judgments better than standard metrics.

Core claim

PerceptionRubrics shows that models frequently succeed on fragmented elements but fail strict conjunctive constraints, that an 8 percent perception gap persists between open-source and proprietary systems, and that gated metrics achieve stronger human alignment by treating strict perceptual fidelity as a prerequisite for reliable generation.

What carries the argument

The Gated Scoring mechanism, which imposes binary penalties on failures in Must-Right rubrics while allowing separate assessment of Easy-Wrong details.

If this is right

  • Models must satisfy every essential visual fact to receive high scores rather than compensating through partial credit.
  • A consistent perception deficit of roughly 8 percent separates open-source models from proprietary ones even on dense scenes.
  • Reliable generation requires strict perceptual fidelity as a foundation rather than relying on average-case semantic overlap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training pipelines could incorporate similar conjunctive constraints as auxiliary losses to reduce brittleness on complex images.
  • The same gating structure might transfer to video or 3D evaluation where temporal or spatial conjunctions matter.
  • Benchmark designers in other modalities could test whether replacing linear averages with mandatory-fact gates improves correlation with human preference data.

Load-bearing premise

The rubrics produced by the circular peer-review pipeline on golden captions accurately capture human perception without systematic bias.

What would settle it

A controlled study in which human raters assign substantially different overall quality rankings to model outputs than the gated rubric scores would falsify the claim of superior alignment.

Figures

Figures reproduced from arXiv: 2606.28322 by Daxin Jiang, En Yu, Hangyu Guo, Han Zhou, Haodong Li, Hongbo Peng, Jianjian Sun, Kangheng Lin, Keyu Lv, Liang Zhao, Mitt Huang, Vishal M. Patel, Xiangyu Zhang, Yana Wei, Yanlin Lai, Yin Tang, Zheng Ge.

Figure 1
Figure 1. Figure 1: Motivation of PERCEPTIONRUBRICS. Top: An existing benchmark favors GPT-4o despite key omissions, while humans prefer responses that capture more perceptually important details. Bottom: Compared with DetailCaps and DOCCI, PER￾CEPTIONRUBRICS more clearly distinguishes model capabilities. genuine perceptual capability. This has led to a evaluation paradox where leaderboards are increasingly saturated in the h… view at source ↗
Figure 2
Figure 2. Figure 2: Rubric Demonstration of PERCEPTIONRUBRICS. Representative examples are selected for each task, highlighting “Must Right” ( ; essential features) and “Easy Wrong” pitfalls ( ; error-prone fine-grained details). rather than genuine visual grounding (Zhou et al., 2023; Zhang et al., 2025). Even in open-ended captioning, refer￾ences are frequently imprecise, biased, or too sparse (Dong et al., 2024) to challen… view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark Statistics of PERCEPTIONRUBRICS: The distribution of tasks across 7 main categories. conventional benchmarks (e.g., DOCCI (Onoe et al., 2024)), an effect amplified by our gated scoring. Fur￾thermore, a near-perfect correlation between basic per￾ception and hallucination resistance confirms strict fi￾delity as a prerequisite for reliable generation. 2. Related Work Visual Perception Benchmarks in … view at source ↗
Figure 4
Figure 4. Figure 4: The PERCEPTIONRUBRICS Construction Pipeline. Adopting a caption-centric approach, we first synthesize golden captions via circular peer-review (Top). These captions then serve as anchors to generate Must-Right and Easy-Wrong rubrics through domain-specific prompting (Bottom). Calibrating to Human Sensitivity. To resolve the para￾dox where high semantic scores mask brittle performance, we prioritize precisi… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of golden caption lengths in our bench￾mark. The histogram shows the word count frequency across the dataset. mains an unresolved challenge. Moreover, unlike in reason￾ing tasks where open-sourced models often rival proprietary flagships (Huang et al., 2026; Bai et al., 2025a), our results show a distinctive performance gap. The best-performing open-source model (Qwen3.5, 61.61%) still trails … view at source ↗
Figure 8
Figure 8. Figure 8: Rubric Coverage vs. Evaluation Stability. As the sampled rubric ratio increases from 20% to 80%, the standard deviation of model scores decreases. VL-10B, and Kimi-K2.6. Then we performed repeated eval￾uations using two distinct judges with the same inputs: GPT￾OSS-120B (OpenAI, 2025b) and GPT-5.5 (OpenAI, 2026b). Despite GPT-OSS-120B exhibiting a slightly stricter scoring distribution (systematically lowe… view at source ↗
Figure 9
Figure 9. Figure 9: Alignment with Human Preference. We compare benchmark scores from DOCCI (Onoe et al., 2024), DetailCaps (Dong et al., 2024), and PERCEPTIONRUBRICS against human preference scores from Vision Arena for the five overlapping models. Each point denotes one model. PERCEPTIONRUBRICS shows the strongest correlation with Vision Arena, achieving Pearson 0.916 and Spearman 1.000. 1 2 3 4 5 Word Count (×10³) 0.0 0.2 … view at source ↗
Figure 10
Figure 10. Figure 10: (a-b) Length Bias. The two figures examine the correlation between response length (word count) and benchmark scores. (c) Evaluation Robustness. Results obtained with different judges exhibit consistent and stable performance trends. 20%, 40%, 60%, and 80% of rubrics from both the Must￾Right and Easy-Wrong sets. For each sampling ratio, we perform three independent runs and compute the standard deviation … view at source ↗
Figure 11
Figure 11. Figure 11: Distribution analysis of rubrics. (a) Frequency distribution of the total rubrics count across the dataset. (b) Probability density comparison of rubrics count between Must-Right and Easy-Wrong categories. B. Model Roles and Pipeline Details To construct and evaluate PerceptionRubrics, we utilized a diverse set of models, assigning specific roles based on their capabilities. The detailed assignments are l… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples of the fine-grained rubrics across four categories: Natural Scene, Document & OCR, Digital UI & UX, and Structured Data. Each example consists of an image and two tiers of rubrics: Must-Right (top group) focusing on core facts, and Easy-Wrong (bottom group) focusing on challenging details, negative constraints, and logical reasoning. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative examples of the fine-grained rubrics across three additional categories: Logic & Puzzle, STEM & Expert, and Creative & Cultural. Each example consists of an image and two tiers of rubrics: Must-Right (top group) focusing on core facts, and Easy-Wrong (bottom group) focusing on challenging details, negative constraints, and logical reasoning. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PerceptionRubrics, a rubric-based evaluation framework for multimodal models that shifts from holistic semantic matching to atomic auditing. It pairs 1,038 information-dense images with over 12,000 instance-specific rubrics derived from golden captions via a novel Circular Peer-Review consensus pipeline; these are distilled into Must-Right (essential facts) and Easy-Wrong (fine-grained details) criteria. The framework uses a Gated Scoring mechanism that applies sharp binary penalties on failure of mandatory visual facts. The authors report three main findings: a reliability gap where models succeed on fragments but fail conjunctive constraints; a persistent 8% perception deficit between open-source and proprietary models; and superior human alignment of the gated metrics relative to conventional benchmarks, validating strict perceptual fidelity as a prerequisite for reliable generation.

Significance. If the results hold after proper validation, the work could meaningfully advance multimodal evaluation by exposing brittleness hidden by saturated benchmarks and by supplying a more perceptually grounded alternative that emphasizes conjunctive constraints over linear averages.

major comments (2)
  1. [Abstract] Abstract: the abstract asserts strong conclusions on an 8% open-closed deficit, reliability gaps, and superior human alignment of gated metrics, yet supplies no experimental details, statistical tests, model lists, or validation of the consensus pipeline, leaving the central claims unsupported in the available text.
  2. [Methods (Circular Peer-Review consensus pipeline)] Circular Peer-Review consensus pipeline (methods description): the pipeline for constructing golden captions is presented as model-driven and iterative; without an explicit independent human validation step that directly compares the resulting Must-Right/Easy-Wrong rubrics against fresh human perceptual judgments on the 1,038 images, the reported human-alignment advantage risks circular reinforcement and cannot be treated as independent evidence.
minor comments (1)
  1. [Abstract] The abstract refers to 'extensive evaluation' and 'conventional benchmarks' without naming the specific models, datasets, or baseline metrics used for the alignment comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract asserts strong conclusions on an 8% open-closed deficit, reliability gaps, and superior human alignment of gated metrics, yet supplies no experimental details, statistical tests, model lists, or validation of the consensus pipeline, leaving the central claims unsupported in the available text.

    Authors: The abstract is intended as a concise summary of the key contributions and findings. The experimental details, including model lists, statistical tests, and validation of the consensus pipeline, are fully described in the Methods and Results sections of the manuscript. To strengthen the abstract and better support the claims within it, we will revise the abstract to include brief references to the evaluation scale, the models evaluated, and the nature of the statistical validation. revision: yes

  2. Referee: [Methods (Circular Peer-Review consensus pipeline)] Circular Peer-Review consensus pipeline (methods description): the pipeline for constructing golden captions is presented as model-driven and iterative; without an explicit independent human validation step that directly compares the resulting Must-Right/Easy-Wrong rubrics against fresh human perceptual judgments on the 1,038 images, the reported human-alignment advantage risks circular reinforcement and cannot be treated as independent evidence.

    Authors: We acknowledge the importance of ensuring the human-alignment validation is independent of the model-driven pipeline. The Circular Peer-Review is used for scalable rubric generation, but the human alignment is assessed via separate experiments involving human judges. To mitigate any perception of circularity, we will revise the Methods section to explicitly detail the independent human validation steps used to verify the rubrics against fresh human perceptual judgments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces PerceptionRubrics as a new framework with rubrics derived from golden captions via a described Circular Peer-Review consensus pipeline. No equations, fitted parameters, or self-citations are shown reducing the central claims (gated metrics out-aligning benchmarks, or perceptual fidelity as prerequisite) to inputs by construction. The pipeline is presented as a novel method for creating evaluation criteria, with the alignment claims resting on the resulting rubrics rather than any self-referential loop or renamed known result. The derivation remains self-contained against external benchmarks as no load-bearing step collapses to a fit or prior author work by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that human perception can be decomposed into atomic, instance-specific rubrics that a consensus process can reliably extract; no free parameters or invented physical entities are described.

axioms (1)
  • domain assumption Human perception of dense visual scenes can be faithfully represented by instance-specific Must-Right and Easy-Wrong rubrics derived from consensus captions.
    Invoked to support the claim of human-aligned rigor and the superiority of gated metrics.

pith-pipeline@v0.9.1-grok · 5781 in / 1374 out tokens · 59588 ms · 2026-06-29T03:51:32.847916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 1 linked inside Pith

  1. [1]

    Visual Complexity

    Springer, 2024. OpenAI. Hello gpt-4o, 2024. URL https://openai. com/index/hello-gpt-4o/. OpenAI. Introducing gpt-5.2, 2025a. URL https://openai.com/index/ introducing-gpt-5-2/. OpenAI. Gpt-oss-120b and gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025b. URL https:// arxiv.org/abs/2508.10925. OpenAI. Introducing GPT-5.4, 2026a. URL https://opena...

  2. [2]

    You must evaluate the **density** and **semantic depth ** of the text

    **Do NOT ** give a high score simply because the image contains text. You must evaluate the **density** and **semantic depth ** of the text. 13 PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

  3. [3]

    **Severely penalize ** low-quality images: images that are blurry, noisy, contain scribbled handwriting, or have excessive empty backgrounds should receive low scores

  4. [4]

    Please score based on the following strict standards (1-10 points):

    If the majority of the image is white space or a single background, the score must be determined by the richness of the subject content, not by the image dimensions. Please score based on the following strict standards (1-10 points):

  5. [5]

    - **1-3 Points (Low) **: Minimalist composition, massive white space, simple handwriting, single isolated objects, blurry snapshots, low-resolution screenshots

    Visual Complexity: - Definition: The quantity of independent visual elements (objects, lines, textures), spatial occupancy, and clarity of details within the image. - **1-3 Points (Low) **: Minimalist composition, massive white space, simple handwriting, single isolated objects, blurry snapshots, low-resolution screenshots. - **4-7 Points (Medium) **: Cle...

  6. [6]

    a red apple

    Informativeness: - Definition: The amount of information when the image is translated into a text description, the richness of context, and its knowledge value. - **1-3 Points (Low) **: Simple mathematical formulas, single words/numbers, scribbles without context, illegible content, generic decorative patterns, extremely low information entropy. - **4-7 P...

  7. [7]

    **Undeniable Visibility: ** Only select elements that are clearly visible and prominent in the image

  8. [8]

    Ignore background clutter or minor details

    **Essentiality:** Only select elements that are critical to the image’s core meaning. Ignore background clutter or minor details

  9. [9]

    car" instead of

    **Verifiability:** Each rubric must be a binary (Pass/Fail) check. ### WORKFLOW INSTRUCTIONS **Step 1: Rubric Generation Strategy (Semantic Generalization) ** Apply the following abstract rules to ensure the rubrics are robust to varying levels of descriptive detail: * ** Entity Abstraction: ** Identify the fundamental semantic category of the dominant ob...

  10. [10]

    **Undeniable Visibility: ** Only select elements that are clearly visible and prominent

  11. [11]

    Submit" button,

    **Functional Criticality: ** Only select elements that are essential for operating or navigating the interface (e.g., "Submit" button, "Back" arrow). Ignore decorative banners or ads

  12. [12]

    The response identifies the magnifying glass as a ’Search’ button/feature

    **Verifiability:** Each rubric must be a binary (Pass/Fail) check. ### WORKFLOW INSTRUCTIONS **Step 1: Rubric Generation Strategy (Interaction & Structure) ** Apply the following abstract rules to ensure the rubrics cover the interface’s functionality: * ** Functional Semantics: ** Identify interactive elements by their function, not just their shape. Map...

  13. [13]

    **Ground Truth Caption (GT): ** A factual, accurate description of the image

  14. [14]

    These responses may contain hallucinations, perceptual errors, or correct details

    **Model Response Pool: ** A collection of captions generated by various VLMs. These responses may contain hallucinations, perceptual errors, or correct details . Your goal is to identify **common or severe perceptual errors ** in the ‘Response Pool‘ by comparing them against the ‘Ground Truth‘, and then formulate strict criteria to penalize these errors. ...

  15. [15]

    Focus on: * ** Hallucinations:** Objects mentioned in responses but not present in the GT

    **Analyze Errors: ** Scan the ‘Model Response Pool‘ to find discrepancies against the ‘Ground Truth‘. Focus on: * ** Hallucinations:** Objects mentioned in responses but not present in the GT . 17 PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception * ** Attribute Errors: ** Wrong colors, shapes, materials, or textures. * ** Counting/Q...

  16. [16]

    red helmet

    **Filter for Perception (Crucial): ** * ** INCLUDE:** Visual perception issues (e.g., calling a "red helmet" a "blue helmet"; seeing "3 people" instead of "4"; reading "STOP" as "SHOP"). * ** EXCLUDE:** Knowledge gaps or Entity linking issues. If the model fails to recognize a specific character (e.g., "Genshin Impact character") but correctly describes t...

  17. [17]

    The response must NOT

    **Formulate Rubrics: ** * Convert the identified high-frequency or severe errors into **Binary Checklists**. * If models frequently hallucinate an object, create a **Negative Constraint ** (e.g., "The response must NOT..."). * If models get an attribute wrong, create a **Positive Constraint ** (e.g., " The response must identify..."). ### Rubric Style Gui...

  18. [18]

    {response_1} 18 PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

  19. [19]

    Expert Visual Truth Adjudicator

    {response_8} Please generate the perception rubrics based on the analysis of the responses above. C.3. Panel of Judges Prompt To ensure the objectivity and correctness of the generated rubrics, a panel of models (Gemini-3-Pro (Team, 2025) , GPT-5.2 (OpenAI, 2025a) , Seed-1.8 (ByteDance-Seed, 2026b) ) performs a cross-verification using the following promp...

  20. [20]

    **Factuality:** Are there hallucinations? (e.g., objects, colors, or text that don’t exist)

  21. [21]

    **Spatial Precision: ** Are positional relationships (left, right, above, behind) accurate?

  22. [22]

    **Attribute Accuracy: ** Are textures, materials, lighting, and colors correctly identified?

  23. [23]

    **Detail Density: ** Does the caption capture nuanced elements without being redundant? **Task Workflow: **

  24. [24]

    **Independent Verification: ** Analyze the image first, then audit each Candidate (1, 2, and 3) individually

  25. [25]

    Inspect the image to resolve these

    **Conflict Resolution: ** Identify discrepancies between candidates (e.g., Candidate 1 says ’vintage’, Candidate 2 says ’modern’). Inspect the image to resolve these

  26. [26]

    **Ranking:** Select the "Best" baseline based on the highest fidelity to the visual evidence. **Input Candidates: ** [Candidate 1]: {candidate_1_text} [Candidate 2]: {candidate_2_text} [Candidate 3]: {candidate_3_text} **Strict Output Format: ** You must output your response in valid XML format only. No preamble, no markdown formatting outside the XML, an...

  27. [27]

    **Model Caption: ** The text description generated by the model

  28. [28]

    bottom-line

    **Group A (Critical Rubrics): ** A list of fundamental perception criteria. These are "bottom-line" facts

  29. [29]

    ### Judgment Logic For each rubric in both groups, determine if the **Model Caption ** complies with the requirement

    **Group B (Granular Rubrics): ** A list of fine-grained or high-frequency error checks. ### Judgment Logic For each rubric in both groups, determine if the **Model Caption ** complies with the requirement. * ** True (Pass): ** The caption explicitly meets the criteria or implies it without ambiguity. * ** False (Fail): ** The caption contradicts the crite...

  30. [30]

    Must identify the car as red

    **Positive Constraints ** (e.g., "Must identify the car as red"): * Pass: "A red car is parked..." * Fail: "A blue car..." (Contradiction) OR "A car is parked..." (Missing specific detail)

  31. [31]

    Must NOT mention a dog

    **Negative Constraints ** (e.g., "Must NOT mention a dog"): * Pass: "A cat sits on the mat." (No dog mentioned). * Fail: "A dog and a cat..." (Hallucination detected). ### Crucial Requirement You must evaluate **Group A ** and **Group B ** independently and return the results in separate lists. The order of boolean results in the output must strictly matc...