Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation
Pith reviewed 2026-06-26 18:23 UTC · model grok-4.3
The pith
A hardened VLM-as-3D-judge protocol reaches parity with the base generator but no adaptation exceeds the 65 percent win-rate target.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that converting the de-biased VLM-as-3D-judge from ranking to optimization requires explicit hardening against circularity and saturation, after which lightweight parameter-efficient adaptations on public data match but do not surpass the strong base generator, with the mechanistic limit that base samples carry essentially no learnable preference.
What carries the argument
The hardened VLM-as-3D-judge that separates training and evaluation models, corrects position bias, and repairs three failure modes (image overload, geometry-hiding splat renders, reference-free judging) to supply an independent optimization signal.
If this is right
- Independent base samples carry essentially no learnable preference, requiring quality-contrastive construction for any signal.
- Conditioning repair under severe degradation is the only locus that moves geometry; other adaptations wash out through the sampler.
- Matching a strong public-data base with cheap adaptation shows that exceeding it requires more than lightweight PEFT on public data.
- The hardened judge protocol functions as a reusable independent evaluator for 3D generation quality.
Where Pith is reading between the lines
- The saturation on clean inputs implies the judge may be most useful when inputs are deliberately degraded or paired with lower-quality contrasts.
- The result could extend to testing whether heavier adaptation techniques or private data sources would be needed to surpass the base.
- The protocol may apply to other single-image generation domains where cheap proxies fail but a hardened VLM can supply directional preference.
Load-bearing premise
The judge supplies an independent, non-saturated optimization signal that can be used to specialize the generator without the signal being washed out by the sampler or already maximized on clean base outputs.
What would settle it
An adaptation method that produces a win-rate of 65 percent or higher against the base generator on the n=8 test objects would falsify the result that no method clears the target.
read the original abstract
A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a hardened de-biased VLM-as-3D-judge protocol (distinct training judge Qwen2.5-VL-7B and evaluation judge InternVL3-8B, with position-bias correction and fixes for image overload, splat rendering, and reference-free judging) to optimize the TRELLIS single-image-to-3D generator on furniture assets via six lightweight PEFT adaptation methods. It reports that independent base samples show no learnable preference (0.94 order-flip rate), that the most effective method (conditioner repair under severe degradation) reaches only parity (0.50 win-rate) with the base, and that no method exceeds the >=65% target; the result is attributed to judge saturation on clean inputs and signal washout through the sampler. Calibration evidence is provided (clear-gap win-rates 0.83-1.0; base-vs-base ~0.5), and win-rates are described as directional at n=8 objects.
Significance. If the empirical findings hold after addressing sample-size limitations, the work would usefully document the practical barriers to turning a ranking-grade VLM judge into an optimization signal for 3D generation: clean base outputs already saturate the judge, flow-DiT fine-tuning erases preference information, and only targeted conditioning repair moves geometry. The separation of judges and the calibration protocol constitute reusable methodological contributions that future work can adopt. The negative result on public-data lightweight adaptation also supplies a concrete baseline indicating that exceeding strong open generators will require either larger-scale data, architectural changes, or stronger preference signals.
major comments (2)
- [Results / abstract] Results section (and abstract): All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.
- [Methods / evaluation protocol] § on adaptation methods and evaluation protocol: The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.
minor comments (2)
- [Results] The manuscript should explicitly state the exact number of objects, prompts, and renderings used for each win-rate comparison and whether the same 8 objects were used across all conditions.
- [Calibration] Clarify whether the 0.94 order-flip rate on base samples was measured on the same n=8 objects or a larger held-out set.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the methodological contributions of the de-biased judge protocol. We address the two major comments below.
read point-by-point responses
-
Referee: [Results / abstract] All win-rate claims rest on n=8 objects. For a binomial proportion, the 95% CI around an observed 0.50 is approximately [0.24, 0.76]; an observed 0.625 still overlaps substantially with 0.50. The manuscript reports no standard errors, p-values, multiple-comparison corrections, or power analysis, yet concludes that 'no method clears the >=65% win-rate target' and that conditioner repair 'reaches parity.' This sample size is load-bearing for the central empirical claim.
Authors: We agree the sample size limits statistical power and will add 95% binomial confidence intervals, standard errors, and an explicit discussion of overlap with 0.5 to the results section and abstract. The manuscript already qualifies results as directional at n=8; we will further temper language around the >=65% target and parity claim to reflect uncertainty. A note on the absence of formal hypothesis testing will be included. We cannot expand to larger n within this study. revision: partial
-
Referee: [Methods / evaluation protocol] The abstract states that 'exact adaptation implementations, dataset sizes, or statistical tests' are not detailed; without these, it is impossible to assess whether the six methods were implemented comparably or whether the reported directional win-rates could be reproduced. This directly affects verifiability of the claim that the judge supplies an independent optimization signal.
Authors: The full manuscript already describes the six PEFT methods, input regimes, and evaluation protocol in the methods section. To improve verifiability we will expand the methods and supplementary material with exact hyperparameters, dataset sizes, and any statistical considerations used. The abstract will be revised to indicate that full implementation details are provided in the paper. revision: yes
- Increasing the evaluation set beyond n=8 objects is not feasible due to the computational cost of 3D generation and VLM judging.
Circularity Check
No significant circularity; empirical adaptation results are independent of judge construction
full rationale
The paper's central finding—that no adaptation method exceeds the 65% win-rate target and the best reaches only parity—is an empirical outcome measured on n=8 objects using a hardened judge protocol. The protocol explicitly separates the training judge (Qwen2.5-VL-7B) from the evaluation judge (InternVL3-8B) and reports base-vs-base win-rates near 0.5 as calibration. These steps prevent the optimization signal from being self-referential. No derivation reduces a claimed prediction to a fitted parameter by construction, no uniqueness theorem is imported from overlapping-author prior work, and the negative result is not forced by the inputs. The work is self-contained against the reported public-model benchmarks and internal calibration checks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ali Asaria, Tony Salomone, and Deep Gandhi. A Cross-Model VLM-Judge Protocol for Single- Image 3D Mesh Quality (and Why Cheap Proxies Fall Short). arXiv:2606.18451 [cs.LG], 2026. URL https://arxiv.org/abs/2606.18451. Companion work; introduces the cross-model VLM-as-3D-judge evaluation protocol adopted here
Pith/arXiv arXiv 2026
-
[2]
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong, and Jiaolong Yang. Structured 3D Latents for Scalable and Versatile 3D Generation
-
[3]
8 A De-biased VLM-as-3D-Judge Protocol
URLhttps://arxiv.org/abs/2412.01506. 8 A De-biased VLM-as-3D-Judge Protocol
-
[4]
Manning, and Chelsea Finn
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems (NeurIPS), 2023
2023
-
[5]
Diffusion Model Alignment Using Direct Preference Optimization
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. Diffusion Model Alignment Using Direct Preference Optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. URLhttps://arxiv.org/abs/2311.12908
arXiv 2024
-
[6]
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
Ruining Li, Chuanxia Zheng, Christian Rupprecht, and Andrea Vedaldi. DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness. 2025. URLhttps://arxiv. org/abs/2503.22677
arXiv 2025
-
[7]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2209.03003
Pith/arXiv arXiv 2023
-
[8]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow Matching for Generative Modeling. InInternational Conference on Learning Representations (ICLR), 2023. URLhttps://arxiv.org/abs/2210.02747
Pith/arXiv arXiv 2023
-
[9]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...
Pith/arXiv arXiv 2024
-
[10]
DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization
Zhenglin Zhou, Xiaobo Xia, Fan Ma, Hehe Fan, Yi Yang, and Tat-Seng Chua. DreamDPO: Aligning Text-to-3D Generation with Human Preferences via Direct Preference Optimization
-
[11]
URLhttps://arxiv.org/abs/2502.04370
-
[12]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR), 2022. URL https://arxiv. org/abs/2106.09685
Pith/arXiv arXiv 2022
-
[13]
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, and James Thorne. ORPO: Monolithic Preference Optimization without Reference Model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024. URLhttps://arxiv.org/abs/2403.07691
Pith/arXiv arXiv 2024
-
[14]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. URL https://arxiv.org...
Pith/arXiv arXiv 2023
-
[15]
Large Language Models are not Fair Evaluators
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large Language Models are not Fair Evaluators. 2023. URL https://arxiv.org/abs/2305.17926
Pith/arXiv arXiv 2023
-
[16]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-VL Technical Report. a...
Pith/arXiv arXiv 2025
-
[17]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, Zhangwei Gao, Erfei Cui, Xuehui Wang, Yue Cao, Yangzhou Liu, Xingguang Wei, Hongjie Zhang, Haomin Wang, Weiye Xu, Hao Li, Jiahao Wang, Nianchen Deng, Songze Li, Yinan He, Tan Jiang, Jiapeng Luo, Yi Wang, Conghui He, Botian Shi, Xingcheng Zh...
Pith/arXiv arXiv 2025
-
[18]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
2024
-
[19]
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Mark Boss, Zixuan Huang, Aaryaman Vasishta, and Varun Jampani. SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement. arXiv:2408.00653 [cs.CV], 2024. URLhttps://arxiv.org/abs/2408.00653
arXiv 2024
-
[20]
TripoSR: Fast 3D Object Reconstruction from a Single Image
Dmitry Tochilkin, David Pankratz, Zexiang Liu, Zixuan Huang, Adam Letts, Yangguang Li, Ding Liang, Christian Laforte, Varun Jampani, and Yan-Pei Cao. TripoSR: Fast 3D Object Reconstruction from a Single Image. arXiv:2403.02151 [cs.CV], 2024. URLhttps: //arxiv.org/abs/2403.02151
Pith/arXiv arXiv 2024
-
[21]
3D-FUTURE: 3D Furniture Shape with TextURE
Huan Fu, Rongfei Jia, Lin Gao, Mingming Gong, Binqiang Zhao, Steve Maybank, and Dacheng Tao. 3D-FUTURE: 3D Furniture Shape with TextURE. InInternational Journal of Computer Vision (IJCV), 2021. URLhttps://arxiv.org/abs/2009.09633
arXiv 2021
-
[22]
Tianchang Shen, Jacob Munkberg, Jon Hasselgren, Kangxue Yin, Zian Wang, Wenzheng Chen, Zan Gojcic, Sanja Fidler, Nicholas Sharp, and Jun Gao. Flexible Isosurface Extraction for Gradient-Based Mesh Optimization.ACM Transactions on Graphics (TOG), 42(4), 2023. URL https://arxiv.org/abs/2308.05371. 10
arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.