Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

Jiajun Wu; Maneesh Agrawala; R. Kenny Jones; Sharon Zhang

arxiv: 2606.05268 · v1 · pith:VR7VPYHGnew · submitted 2026-06-03 · 💻 cs.GR · cs.LG

Aggregating LLM-Based Weak Verifiers for Spatial Layout Generation

Sharon Zhang , R. Kenny Jones , Jiajun Wu , Maneesh Agrawala This is my paper

Pith reviewed 2026-06-28 03:06 UTC · model grok-4.3

classification 💻 cs.GR cs.LG

keywords LLM verifiersspatial layout generationweak learninglayout verification DSL3D room layout2D poster designaggregating weak verifiersnatural language feedback

0 comments

The pith

Aggregating LLM-generated weak verifiers in a layout DSL yields a strong verifier that raises F1-scores by up to 7X over direct LLM judges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a pipeline that prompts an LLM to synthesize multiple imperfect verifier programs in a layout verification DSL for checking if a spatial layout matches a task description. Each verifier provides only a partial check, but techniques from weak learning combine their outputs into one stronger verifier. The combination weights are learned from roughly ten human-labeled example layouts. This aggregated verifier outperforms the standard approach of using LLMs as direct judges, with F1-scores improving by as much as seven times on 3D room layout and 2D poster design tasks. The same strong verifier also supplies natural language feedback that raises the quality of layouts produced by a base generator by up to 66.2 percent according to human evaluation.

Core claim

The paper establishes that synthesizing a collection of verifier programs in a layout verification DSL with an LLM, then aggregating their responses through weak learning on a small set of human examples, yields a strong verifier. This verifier outperforms direct LLM judges on matching layouts to task descriptions, as measured by higher F1 scores, and supports better layout generation via natural language feedback.

What carries the argument

The pipeline that asks an LLM to synthesize verifier programs in a layout verification DSL and learns aggregation weights via weak learning from approximately 10 labeled examples.

If this is right

The strong verifier improves layout generation quality by up to 66.2% when used to supply natural language feedback to a base generator.
The approach applies across both 3D room layout tasks and 2D poster design tasks.
Aggregation weights learned from about 10 examples suffice to outperform direct LLM judges.
F1-scores increase by up to 7 times relative to the status-quo of using LLM judges directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could extend to other structured generation tasks if similar domain-specific verification languages are defined.
The reduced label requirement may make high-quality verification practical for new layout domains without large annotation efforts.
Iterative use of the verifier feedback could be combined with optimization loops in existing layout systems.

Load-bearing premise

The LLM-generated verifiers supply sufficiently diverse checks so that weak learning can learn reliable aggregation weights from only about 10 human-labeled examples.

What would settle it

A test showing that the F1 score of the aggregated verifier does not exceed the F1 score of a set of direct LLM judges on held-out 3D room layout or 2D poster examples.

Figures

Figures reproduced from arXiv: 2606.05268 by Jiajun Wu, Maneesh Agrawala, R. Kenny Jones, Sharon Zhang.

**Figure 1.** Figure 1: We introduce a pipeline to verify the outputs of spatial layout generators against specific task descriptions. Our approach builds a collection of [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The four stages of our verification pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Verifier-guided layout generation with Detailed feedback. A single layout example is generated by iteratively sampling a layout, verifying the output with our strong Weaver verifier, and re-generating using the verifier feedback (red boxes) if the verifier response is False. We repeat this until the layout passes or until we reach a maximum number of iterations [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the first negative layout, the [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In both negative layouts there [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the first negative layout, the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: The LLM judge incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the first negative layout the [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the first negative layout, the [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 1.** Figure 1: Holodeck generations for the task description in [PITH_FULL_IMAGE:figures/full_fig_p019_1.png] view at source ↗

**Figure 2.** Figure 2: Holodeck generations for the task description in [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

**Figure 3.** Figure 3: Holodeck generations for the task description in [PITH_FULL_IMAGE:figures/full_fig_p020_3.png] view at source ↗

**Figure 4.** Figure 4: Holodeck generations for the task description in [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Holodeck generations for the task description in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: FlairGPT floor plan generations of five task descriptions in the 3D Rooms domain ( [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: The LLM judges incorrectly reject the three positive layouts on the left and accept the negative layouts on the right. In negative [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

**Figure 8.** Figure 8: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p029_11.png] view at source ↗

**Figure 12.** Figure 12: The LLM judges incorrectly reject the three positive layouts on the left and accept the negative layout on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 16.** Figure 16: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In both [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: The LLM judges incorrectly reject all three positive layouts on the left, even though they clearly have desk and chairs set up. [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: The LLM judge incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗

**Figure 23.** Figure 23: The LLM judges incorrectly reject the three positive layouts on the left, even though they satisfy all layout criteria. [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗

**Figure 24.** Figure 24: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

**Figure 26.** Figure 26: The LLM judges incorrectly reject the two positive layouts on the left and incorrectly accept the two negative layouts on the [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In [PITH_FULL_IMAGE:figures/full_fig_p035_27.png] view at source ↗

**Figure 28.** Figure 28: The LLM judges incorrectly reject the two positive layouts on the left and accept the two negative layouts on the right. In the [PITH_FULL_IMAGE:figures/full_fig_p036_28.png] view at source ↗

**Figure 29.** Figure 29: The LLM judges vote overly negative, incorrectly rejecting the two positive layouts on the left, even though they satisfy all [PITH_FULL_IMAGE:figures/full_fig_p036_29.png] view at source ↗

**Figure 30.** Figure 30: The LLM judges vote overly negative, incorrectly rejecting the two positive layouts on the left, even though they satisfy all [PITH_FULL_IMAGE:figures/full_fig_p037_30.png] view at source ↗

**Figure 31.** Figure 31: The LLM judges vote overly negative, incorrectly rejecting the two positive layouts on the left, even though they satisfy all [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

**Figure 32.** Figure 32: The LLM judges vote overly negative, incorrectly rejecting the two positive layouts on the left, even though they satisfy all [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗

**Figure 33.** Figure 33: Performance of Logistic Regression (blue) and Top-1 (purple) as dev set size increases. In most cases, increasing the dev [PITH_FULL_IMAGE:figures/full_fig_p040_33.png] view at source ↗

**Figure 34.** Figure 34: Comparison between Naive Majority and Weaver on tasks with low vs. high recall. For each of the 26 tasks, we plot the [PITH_FULL_IMAGE:figures/full_fig_p041_34.png] view at source ↗

**Figure 35.** Figure 35: 3D layouts generated by our detailed feedback generator for five different task descriptions. [PITH_FULL_IMAGE:figures/full_fig_p043_35.png] view at source ↗

read the original abstract

We present a pipeline for building and aggregating task-specific, LLM-generated weak (imperfect) verifiers into a strong verifier for spatial layout domains. Given a task description, our pipeline asks an LLM to synthesize a collection of verifier programs using a layout verification DSL. Each individual LLM-generated verifier usually provides an imperfect check for a match between the layout and the corresponding task description. We show that by aggregating the responses of many such verifiers we can produce a stronger verifier. Moreover, by applying techniques from weak learning, our pipeline can learn how to aggregate the weak verifiers from a very sparse set of human labeled example layouts (about 10). We find that the strong verifiers produced by our pipeline outperform the status-quo approach of using a set of LLM judges to directly check whether a layout matches a task description, raising F1-scores by up to 7X across a variety of 3D room layout and 2D poster design tasks. We also demonstrate that verifier-guided layout generation using natural language feedback from our strong verifiers improves layout quality of a base layout generator by up to 66.2% according to a human evaluator.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea—LLM-synthesized task-specific verifiers in a DSL, aggregated via weak learning on ~10 labels—looks like a practical step for layout verification, but the abstract leaves the experimental claims uncheckable.

read the letter

The new piece is the pipeline that prompts an LLM to emit multiple verifier programs in a layout DSL, then applies weak learning to combine their outputs into a stronger judge trained on roughly 10 human-labeled examples. It reports this beats direct LLM judging by large F1 margins on room layouts and poster tasks, and that the resulting feedback improves a base generator by 66% in human ratings.

That combination of DSL verifier synthesis plus sparse-label aggregation is not just another LLM-as-judge trick, and the weak-learning angle is a reasonable way to get more signal without heavy annotation. If the verifiers turn out diverse enough, the approach could be useful for graphics tools that need reliable spatial checks.

The main soft spot is the complete absence of experimental detail: no count of verifiers, no description of the exact aggregation method, no baselines, no significance tests, no checks on prompt sensitivity or data splits. The stress-test concern about learning stable weights from 10 examples is real—if the verifiers are correlated, the small sample will not separate signal from noise. Without those numbers the 7X F1 claim cannot be evaluated.

This is for researchers building LLM feedback loops for design or layout tasks. The idea is distinct enough and the claims are concrete enough that it deserves a serious referee, even though the current version is too thin to judge soundness.

Referee Report

2 major / 2 minor

Summary. The paper introduces a pipeline that prompts an LLM to synthesize multiple weak verifiers as programs in a layout verification DSL for spatial tasks (3D room layouts, 2D poster design). These verifiers are aggregated via weak-learning techniques trained on roughly 10 human-labeled examples to produce a strong verifier. The resulting verifier is claimed to outperform direct LLM judges (F1 gains up to 7X) and, when used for natural-language feedback, to improve base layout generators by up to 66.2% per human evaluation.

Significance. If the empirical claims hold under scrutiny, the work offers a practical route to task-specific verification with minimal labeled data by exploiting LLM-generated DSL programs and weak learning. The multi-task evaluation and the use of a DSL for verifiable checks are positive elements; reproducible code or explicit aggregation procedures would further strengthen it.

major comments (2)

[Abstract / §3] Abstract and §3 (method): The central claim that aggregation weights can be learned reliably from ~10 human-labeled layouts rests on the unstated assumptions that the LLM-generated verifiers are sufficiently diverse and that their errors are not highly correlated. No count of verifiers, no description of the weak-learning procedure (boosting, weighted voting, etc.), and no cross-validation or stability results for the 10-example regime are supplied; with more than a handful of verifiers this sample size supplies too few degrees of freedom for stable estimation.
[Abstract / Experiments] Abstract and experimental section: The reported F1 gains of up to 7X and the 66.2% human-evaluated improvement lack any mention of baseline implementations, statistical significance tests, prompt-sensitivity controls, or data-split details. These omissions make it impossible to verify whether the gains are robust or sensitive to unstated experimental choices.

minor comments (2)

[§2] The DSL definition and the exact syntax of the generated verifier programs should be presented with at least one concrete example to allow replication.
[Related Work] Standard weak-learning references (e.g., boosting literature) are missing; adding them would clarify the aggregation technique.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional clarity on assumptions, procedures, and experimental rigor would strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (method): The central claim that aggregation weights can be learned reliably from ~10 human-labeled layouts rests on the unstated assumptions that the LLM-generated verifiers are sufficiently diverse and that their errors are not highly correlated. No count of verifiers, no description of the weak-learning procedure (boosting, weighted voting, etc.), and no cross-validation or stability results for the 10-example regime are supplied; with more than a handful of verifiers this sample size supplies too few degrees of freedom for stable estimation.

Authors: We agree the assumptions should be stated explicitly and the procedure detailed. The revised manuscript will report the number of verifiers synthesized per task, describe the aggregation method (weighted combination of verifier outputs learned via regularized logistic regression on the sparse labels), and include leave-one-out cross-validation results demonstrating weight stability in the 10-example setting. We will also discuss the diversity of LLM-generated verifiers and note the risk of correlated errors as a limitation. revision: yes
Referee: [Abstract / Experiments] Abstract and experimental section: The reported F1 gains of up to 7X and the 66.2% human-evaluated improvement lack any mention of baseline implementations, statistical significance tests, prompt-sensitivity controls, or data-split details. These omissions make it impossible to verify whether the gains are robust or sensitive to unstated experimental choices.

Authors: We will expand the experimental section to specify the LLM judge baselines (same model with varied prompts), report statistical significance via bootstrap resampling or paired tests on the F1 scores, include prompt-sensitivity analysis with variance across prompt variants, and detail the train/test splits (10 labeled examples for aggregation learning, separate held-out sets for evaluation). These additions will allow readers to assess robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical pipeline with no derivations or self-referential fits

full rationale

The paper describes an empirical method for synthesizing LLM-generated verifiers in a DSL, then aggregating them via weak learning on ~10 human labels to produce a stronger verifier. No equations, first-principles derivations, or fitted quantities are presented that reduce to their own inputs by construction. Results rest on F1-score comparisons and human evaluations against baselines, with no self-citation chains, uniqueness theorems, or ansatzes invoked as load-bearing. The weak-learning step is a standard application of existing techniques and does not redefine its own aggregation weights as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly relies on LLM synthesis capabilities and standard weak learning theory assumed from prior work.

pith-pipeline@v0.9.1-grok · 5739 in / 1170 out tokens · 37170 ms · 2026-06-28T03:06:45.027742+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 10 canonical work pages

[1]

Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers , author =. Conference on Neural Information Processing Systems , year =. doi:10.48550/arXiv.2506.18203 , url =

work page doi:10.48550/arxiv.2506.18203
[2]

3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , year=

Fu, Huan and Cai, Bowen and Gao, Lin and Zhang, Ling-Xiao and Wang, Jiaming and Li, Cao and Zeng, Qixun and Sun, Chengyue and Jia, Rongfei and Zhao, Binqiang and Zhang, Hao , booktitle=. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , year=
[3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Despoina Paschalidou and Amlan Kar and Maria Shugrina and Karsten Kreis and Andreas Geiger and Sanja Fidler , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[4]

Training complex models with multi-task weak supervision

Ratner, Alexander and Hancock, Braden and Dunnmon, Jared and Sala, Frederic and Pandey, Shreyash and R \'e , Christopher. Training complex models with multi-task weak supervision. Proc. Conf. AAAI Artif. Intell
[5]

and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'

Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'. Snorkel: rapid training data creation with weak supervision , year =. Proc. VLDB Endow. , month = nov, pages =. doi:10.14778/3157794.3157797 , abstract =

work page doi:10.14778/3157794.3157797
[6]

Training complex models with multi-task weak supervision , year =

Ratner, Alexander and Hancock, Braden and Dunnmon, Jared and Sala, Frederic and Pandey, Shreyash and R\'. Training complex models with multi-task weak supervision , year =. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Ed...

work page doi:10.1609/aaai.v33i01.33014763
[7]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yang, Yue and Sun, Fan-Yun and Weihs, Luca and VanderBilt, Eli and Herrasti, Alvaro and Han, Winson and Wu, Jiajun and Haber, Nick and Krishna, Ranjay and Liu, Lingjie and Callison-Burch, Chris and Yatskar, Mark and Kembhavi, Aniruddha and Clark, Christopher , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

2024
[8]

arXiv preprint arXiv:2501.04648 , year=

FlairGPT: Repurposing LLMs for Interior Designs , author=. arXiv preprint arXiv:2501.04648 , year=

arXiv
[9]

arXiv preprint arXiv:2307.05663 , year=

Objaverse-XL: A Universe of 10M+ 3D Objects , author=. arXiv preprint arXiv:2307.05663 , year=

Pith/arXiv arXiv
[10]

and Scholkopf, B

Chapelle, O. and Scholkopf, B. and Zien, Eds., A. , journal=. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , year=

2006
[11]

Data programming: creating large training sets, quickly , year =

Ratner, Alexander and Sa, Christopher De and Wu, Sen and Selsam, Daniel and R\'. Data programming: creating large training sets, quickly , year =. Proceedings of the 30th International Conference on Neural Information Processing Systems , pages =
[12]

and Chen, Mayee F

Fu, Daniel Y. and Chen, Mayee F. and Sala, Frederic and Hooper, Sarah M. and Fatahalian, Kayvon and R\'. Fast and three-rious: speeding up weak supervision with triplet methods , year =. Proceedings of the 37th International Conference on Machine Learning , articleno =
[13]

2023 , eprint=

JudgeLM: Fine-tuned Large Language Models are Scalable Judges , author=. 2023 , eprint=

2023
[14]

arXiv preprint arXiv:2309.01219 , year=

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. arXiv preprint arXiv:2309.01219 , year=

Pith/arXiv arXiv
[15]

CoRR, abs/2312.08935 , year=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. CoRR, abs/2312.08935 , year=

Pith/arXiv arXiv
[16]

The Thirty-Eighth Annual Conference on Neural Information Processing Systems , year=

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models , author=. The Thirty-Eighth Annual Conference on Neural Information Processing Systems , year=
[17]

2024 , archivePrefix=

JudgeBench: A Benchmark for Evaluating LLM-Based Judges , author=. 2024 , archivePrefix=

2024
[18]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023
[19]

2024 , eprint=

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models , author=. 2024 , eprint=

2024
[20]

2026 , eprint=

Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators , author=. 2026 , eprint=

2026
[21]

2025 , eprint=

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs , author=. 2025 , eprint=

2025
[22]

arXiv preprint arXiv:2305.15393 , year=

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models , author=. arXiv preprint arXiv:2305.15393 , year=

arXiv
[23]

arXiv preprint arXiv: 2506.00742 , year =

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary , author =. arXiv preprint arXiv: 2506.00742 , year =

arXiv
[24]

2025 , eprint=

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation , author=. 2025 , eprint=

2025
[25]

arXiv preprint arXiv:2410.12844 , year=

TextLap: Customizing Language Models for Text-to-Layout Planning , author=. arXiv preprint arXiv:2410.12844 , year=

arXiv
[26]

Document Analysis and Recognition – ICDAR 2025: 19th International Conference, Wuhan, China, September 16–21, 2025, Proceedings, Part I , pages =

Zhang, Xilin and Wang, Hao and Dai, Jianbiao and Zhu, Pinpin , title =. Document Analysis and Recognition – ICDAR 2025: 19th International Conference, Wuhan, China, September 16–21, 2025, Proceedings, Part I , pages =. 2025 , isbn =. doi:10.1007/978-3-032-04614-7_12 , abstract =

work page doi:10.1007/978-3-032-04614-7_12 2025
[27]

Jones, B. T. and Zhang, Z. and H. A Solver-Aided Hierarchical Language for LLM-Driven CAD Design , url =. Computer Graphics Forum , keywords =. 2025 , bdsk-url-1 =. doi:https://doi.org/10.1111/cgf.70250 , eprint =

work page doi:10.1111/cgf.70250 2025
[28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wu, Ronghuan and Su, Wanchao and Liao, Jing , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025
[29]

2024 , eprint=

BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement , author=. 2024 , eprint=

2024
[30]

CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation , year=

Li, Jiahao and Ma, Weijian and Li, Xueyang and Lou, Yunzhong and Zhou, Guichun and Zhou, Xiangdong , booktitle=. CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation , year=
[31]

2025 , issue_date =

Ma, Jiaju and Agrawala, Maneesh , title =. 2025 , issue_date =. doi:10.1145/3731209 , journal =

work page doi:10.1145/3731209 2025
[32]

D ream S ync: Aligning Text-to-Image Generation with Image Understanding Feedback

Sun, Jiao and Fu, Deqing and Hu, Yushi and Wang, Su and Rassin, Royi and Juan, Da-Cheng and Alon, Dana and Herrmann, Charles and Steenkiste, Sjoerd Van and Krishna, Ranjay and Rashtchian, Cyrus. D ream S ync: Aligning Text-to-Image Generation with Image Understanding Feedback. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of th...

work page doi:10.18653/v1/2025.naacl-long.304 2025
[33]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Hu, Ziniu and Iscen, Ahmet and Jain, Aashi and Kipf, Thomas and Yue, Yisong and Ross, David A and Schmid, Cordelia and Fathi, Alireza , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024
[34]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021
[35]

The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=
[36]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023
[37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Agent-as-a-Judge: Evaluate Agents with Agents , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[38]

2024 , eprint=

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation , author=. 2024 , eprint=

2024
[39]

2024 , eprint=

Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach , author=. 2024 , eprint=

2024
[40]

Weak supervision from high-level abstractions , author=
[41]

LiveBench: A Challenging, Contamination-Free

Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum , booktitle=. LiveB...
[42]

arXiv preprint arXiv:2404.01291 , year=

Evaluating Text-to-Visual Generation with Image-to-Text Generation , author=. arXiv preprint arXiv:2404.01291 , year=

arXiv
[43]

arXiv preprint arXiv:2402.07207 , year=

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting , author=. arXiv preprint arXiv:2402.07207 , year=

arXiv
[44]

ACM SIGGRAPH 2023 Conference Proceedings , articleno =

Para, Wamiq Reyaz and Guerrero, Paul and Mitra, Niloy and Wonka, Peter , title =. ACM SIGGRAPH 2023 Conference Proceedings , articleno =. 2023 , isbn =. doi:10.1145/3588432.3591561 , abstract =

work page doi:10.1145/3588432.3591561 2023
[45]

Kenny and Fu, Kailiang and Aguina-Kang, Rio and Morris, Stewart and Ritchie, Daniel , title =

Gumin, Maxim and Han, Do Heon and Yoo, Seung Jean and Ganeshan, Aditya and Jones, R. Kenny and Fu, Kailiang and Aguina-Kang, Rio and Morris, Stewart and Ritchie, Daniel , title =. Proceedings of the SIGGRAPH Asia 2025 Conference Papers , articleno =. 2025 , isbn =. doi:10.1145/3757377.3763930 , abstract =

work page doi:10.1145/3757377.3763930 2025
[46]

arXiv , year =

Qihang Zhang and Chaoyang Wang and Aliaksandr Siarohin and Peiye Zhuang and Yinghao Xu and Ceyuan Yang and Dahua Lin and Bo Dai and Bolei Zhou and Sergey Tulyakov and Hsin-Ying Lee , title =. arXiv , year =
[47]

and Ritchie, Daniel , title =

Wang, Kai and Savva, Manolis and Chang, Angel X. and Ritchie, Daniel , title =. ACM Trans. Graph. , month = jul, articleno =. 2018 , issue_date =. doi:10.1145/3197517.3201362 , abstract =

work page doi:10.1145/3197517.3201362 2018
[48]

2021 International Conference on 3D Vision (3DV) , year=

SceneFormer: Indoor Scene Generation with Transformers , author=. 2021 International Conference on 3D Vision (3DV) , year=

2021
[49]

The Fourteenth International Conference on Learning Representations , year=

Do 3D Large Language Models Really Understand 3D Spatial Relationships? , author=. The Fourteenth International Conference on Learning Representations , year=
[50]

2026 , note=

LLM-as-a-Verifier: A General-Purpose Verification Framework , author=. 2026 , note=

2026
[51]

2026 , eprint=

Maillard, L. 2026 , eprint=

2026
[52]

and Chang, Angel X

Tam, Hou In Ivan and Pun, Hou In Derek and Wang, Austin T. and Chang, Angel X. and Savva, Manolis , year =
[53]

CVPR , year =

Tong Wu and Guandao Yang and Zhibing Li and Kai Zhang and Ziwei Liu and Leonidas Guibas and Dahua Lin and Gordon Wetzstein , title =. CVPR , year =
[54]

2026 , eprint=

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes , author=. 2026 , eprint=

2026
[55]

2026 , eprint=

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning , author=. 2026 , eprint=

2026

[1] [1]

Shrinking the generation-verification gap with weak verifiers.arXiv preprint arXiv:2506.18203,

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers , author =. Conference on Neural Information Processing Systems , year =. doi:10.48550/arXiv.2506.18203 , url =

work page doi:10.48550/arxiv.2506.18203

[2] [2]

3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , year=

Fu, Huan and Cai, Bowen and Gao, Lin and Zhang, Ling-Xiao and Wang, Jiaming and Li, Cao and Zeng, Qixun and Sun, Chengyue and Jia, Rongfei and Zhao, Binqiang and Zhang, Hao , booktitle=. 3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics , year=

[3] [3]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Despoina Paschalidou and Amlan Kar and Maria Shugrina and Karsten Kreis and Andreas Geiger and Sanja Fidler , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

[4] [4]

Training complex models with multi-task weak supervision

Ratner, Alexander and Hancock, Braden and Dunnmon, Jared and Sala, Frederic and Pandey, Shreyash and R \'e , Christopher. Training complex models with multi-task weak supervision. Proc. Conf. AAAI Artif. Intell

[5] [5]

and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'

Ratner, Alexander and Bach, Stephen H. and Ehrenberg, Henry and Fries, Jason and Wu, Sen and R\'. Snorkel: rapid training data creation with weak supervision , year =. Proc. VLDB Endow. , month = nov, pages =. doi:10.14778/3157794.3157797 , abstract =

work page doi:10.14778/3157794.3157797

[6] [6]

Training complex models with multi-task weak supervision , year =

Ratner, Alexander and Hancock, Braden and Dunnmon, Jared and Sala, Frederic and Pandey, Shreyash and R\'. Training complex models with multi-task weak supervision , year =. Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Ed...

work page doi:10.1609/aaai.v33i01.33014763

[7] [7]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Yang, Yue and Sun, Fan-Yun and Weihs, Luca and VanderBilt, Eli and Herrasti, Alvaro and Han, Winson and Wu, Jiajun and Haber, Nick and Krishna, Ranjay and Liu, Lingjie and Callison-Burch, Chris and Yatskar, Mark and Kembhavi, Aniruddha and Clark, Christopher , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

2024

[8] [8]

arXiv preprint arXiv:2501.04648 , year=

FlairGPT: Repurposing LLMs for Interior Designs , author=. arXiv preprint arXiv:2501.04648 , year=

arXiv

[9] [9]

arXiv preprint arXiv:2307.05663 , year=

Objaverse-XL: A Universe of 10M+ 3D Objects , author=. arXiv preprint arXiv:2307.05663 , year=

Pith/arXiv arXiv

[10] [10]

and Scholkopf, B

Chapelle, O. and Scholkopf, B. and Zien, Eds., A. , journal=. Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews] , year=

2006

[11] [11]

Data programming: creating large training sets, quickly , year =

Ratner, Alexander and Sa, Christopher De and Wu, Sen and Selsam, Daniel and R\'. Data programming: creating large training sets, quickly , year =. Proceedings of the 30th International Conference on Neural Information Processing Systems , pages =

[12] [12]

and Chen, Mayee F

Fu, Daniel Y. and Chen, Mayee F. and Sala, Frederic and Hooper, Sarah M. and Fatahalian, Kayvon and R\'. Fast and three-rious: speeding up weak supervision with triplet methods , year =. Proceedings of the 37th International Conference on Machine Learning , articleno =

[13] [13]

2023 , eprint=

JudgeLM: Fine-tuned Large Language Models are Scalable Judges , author=. 2023 , eprint=

2023

[14] [14]

arXiv preprint arXiv:2309.01219 , year=

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models , author=. arXiv preprint arXiv:2309.01219 , year=

Pith/arXiv arXiv

[15] [15]

CoRR, abs/2312.08935 , year=

Math-shepherd: Verify and reinforce llms step-by-step without human annotations , author=. CoRR, abs/2312.08935 , year=

Pith/arXiv arXiv

[16] [16]

The Thirty-Eighth Annual Conference on Neural Information Processing Systems , year=

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models , author=. The Thirty-Eighth Annual Conference on Neural Information Processing Systems , year=

[17] [17]

2024 , archivePrefix=

JudgeBench: A Benchmark for Evaluating LLM-Based Judges , author=. 2024 , archivePrefix=

2024

[18] [18]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

2023

[19] [19]

2024 , eprint=

Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models , author=. 2024 , eprint=

2024

[20] [20]

2026 , eprint=

Towards Acyclic Preference Evaluation of Language Models via Multiple Evaluators , author=. 2026 , eprint=

2026

[21] [21]

2025 , eprint=

iVISPAR -- An Interactive Visual-Spatial Reasoning Benchmark for VLMs , author=. 2025 , eprint=

2025

[22] [22]

arXiv preprint arXiv:2305.15393 , year=

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models , author=. arXiv preprint arXiv:2305.15393 , year=

arXiv

[23] [23]

arXiv preprint arXiv: 2506.00742 , year =

ArtiScene: Language-Driven Artistic 3D Scene Generation Through Image Intermediary , author =. arXiv preprint arXiv: 2506.00742 , year =

arXiv

[24] [24]

2025 , eprint=

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation , author=. 2025 , eprint=

2025

[25] [25]

arXiv preprint arXiv:2410.12844 , year=

TextLap: Customizing Language Models for Text-to-Layout Planning , author=. arXiv preprint arXiv:2410.12844 , year=

arXiv

[26] [26]

Document Analysis and Recognition – ICDAR 2025: 19th International Conference, Wuhan, China, September 16–21, 2025, Proceedings, Part I , pages =

Zhang, Xilin and Wang, Hao and Dai, Jianbiao and Zhu, Pinpin , title =. Document Analysis and Recognition – ICDAR 2025: 19th International Conference, Wuhan, China, September 16–21, 2025, Proceedings, Part I , pages =. 2025 , isbn =. doi:10.1007/978-3-032-04614-7_12 , abstract =

work page doi:10.1007/978-3-032-04614-7_12 2025

[27] [27]

Jones, B. T. and Zhang, Z. and H. A Solver-Aided Hierarchical Language for LLM-Driven CAD Design , url =. Computer Graphics Forum , keywords =. 2025 , bdsk-url-1 =. doi:https://doi.org/10.1111/cgf.70250 , eprint =

work page doi:10.1111/cgf.70250 2025

[28] [28]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Wu, Ronghuan and Su, Wanchao and Liao, Jing , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

2025

[29] [29]

2024 , eprint=

BlenderLLM: Training Large Language Models for Computer-Aided Design with Self-improvement , author=. 2024 , eprint=

2024

[30] [30]

CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation , year=

Li, Jiahao and Ma, Weijian and Li, Xueyang and Lou, Yunzhong and Zhou, Guichun and Zhou, Xiangdong , booktitle=. CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation , year=

[31] [31]

2025 , issue_date =

Ma, Jiaju and Agrawala, Maneesh , title =. 2025 , issue_date =. doi:10.1145/3731209 , journal =

work page doi:10.1145/3731209 2025

[32] [32]

D ream S ync: Aligning Text-to-Image Generation with Image Understanding Feedback

Sun, Jiao and Fu, Deqing and Hu, Yushi and Wang, Su and Rassin, Royi and Juan, Da-Cheng and Alon, Dana and Herrmann, Charles and Steenkiste, Sjoerd Van and Krishna, Ranjay and Rashtchian, Cyrus. D ream S ync: Aligning Text-to-Image Generation with Image Understanding Feedback. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of th...

work page doi:10.18653/v1/2025.naacl-long.304 2025

[33] [33]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Hu, Ziniu and Iscen, Ahmet and Jain, Aashi and Kipf, Thomas and Yue, Yisong and Ross, David A and Schmid, Cordelia and Fathi, Alireza , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

2024

[34] [34]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

2021

[35] [35]

The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

Generative Verifiers: Reward Modeling as Next-Token Prediction , author=. The 4th Workshop on Mathematical Reasoning and AI at NeurIPS'24 , year=

[36] [36]

2023 , eprint=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. 2023 , eprint=

2023

[37] [37]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Agent-as-a-Judge: Evaluate Agents with Agents , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025

[38] [38]

2024 , eprint=

TICKing All the Boxes: Generated Checklists Improve LLM Evaluation and Generation , author=. 2024 , eprint=

2024

[39] [39]

2024 , eprint=

Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach , author=. 2024 , eprint=

2024

[40] [40]

Weak supervision from high-level abstractions , author=

[41] [41]

LiveBench: A Challenging, Contamination-Free

Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum , booktitle=. LiveB...

[42] [42]

arXiv preprint arXiv:2404.01291 , year=

Evaluating Text-to-Visual Generation with Image-to-Text Generation , author=. arXiv preprint arXiv:2404.01291 , year=

arXiv

[43] [43]

arXiv preprint arXiv:2402.07207 , year=

Gala3d: Towards text-to-3d complex scene generation via layout-guided generative gaussian splatting , author=. arXiv preprint arXiv:2402.07207 , year=

arXiv

[44] [44]

ACM SIGGRAPH 2023 Conference Proceedings , articleno =

Para, Wamiq Reyaz and Guerrero, Paul and Mitra, Niloy and Wonka, Peter , title =. ACM SIGGRAPH 2023 Conference Proceedings , articleno =. 2023 , isbn =. doi:10.1145/3588432.3591561 , abstract =

work page doi:10.1145/3588432.3591561 2023

[45] [45]

Kenny and Fu, Kailiang and Aguina-Kang, Rio and Morris, Stewart and Ritchie, Daniel , title =

Gumin, Maxim and Han, Do Heon and Yoo, Seung Jean and Ganeshan, Aditya and Jones, R. Kenny and Fu, Kailiang and Aguina-Kang, Rio and Morris, Stewart and Ritchie, Daniel , title =. Proceedings of the SIGGRAPH Asia 2025 Conference Papers , articleno =. 2025 , isbn =. doi:10.1145/3757377.3763930 , abstract =

work page doi:10.1145/3757377.3763930 2025

[46] [46]

arXiv , year =

Qihang Zhang and Chaoyang Wang and Aliaksandr Siarohin and Peiye Zhuang and Yinghao Xu and Ceyuan Yang and Dahua Lin and Bo Dai and Bolei Zhou and Sergey Tulyakov and Hsin-Ying Lee , title =. arXiv , year =

[47] [47]

and Ritchie, Daniel , title =

Wang, Kai and Savva, Manolis and Chang, Angel X. and Ritchie, Daniel , title =. ACM Trans. Graph. , month = jul, articleno =. 2018 , issue_date =. doi:10.1145/3197517.3201362 , abstract =

work page doi:10.1145/3197517.3201362 2018

[48] [48]

2021 International Conference on 3D Vision (3DV) , year=

SceneFormer: Indoor Scene Generation with Transformers , author=. 2021 International Conference on 3D Vision (3DV) , year=

2021

[49] [49]

The Fourteenth International Conference on Learning Representations , year=

Do 3D Large Language Models Really Understand 3D Spatial Relationships? , author=. The Fourteenth International Conference on Learning Representations , year=

[50] [50]

2026 , note=

LLM-as-a-Verifier: A General-Purpose Verification Framework , author=. 2026 , note=

2026

[51] [51]

2026 , eprint=

Maillard, L. 2026 , eprint=

2026

[52] [52]

and Chang, Angel X

Tam, Hou In Ivan and Pun, Hou In Derek and Wang, Austin T. and Chang, Angel X. and Savva, Manolis , year =

[53] [53]

CVPR , year =

Tong Wu and Guandao Yang and Zhibing Li and Kai Zhang and Ziwei Liu and Leonidas Guibas and Dahua Lin and Gordon Wetzstein , title =. CVPR , year =

[54] [54]

2026 , eprint=

SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes , author=. 2026 , eprint=

2026

[55] [55]

2026 , eprint=

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning , author=. 2026 , eprint=

2026