Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Lorenzo Shaikewitz; Luca Carlone; Rajat Talak; Xihang Yu

arxiv: 2602.08058 · v3 · pith:OD5MJS3Anew · submitted 2026-02-08 · 💻 cs.CV · cs.AI· cs.RO· cs.SY· eess.SY

Picasso: Holistic Scene Reconstruction with Physics-Constrained Sampling

Xihang Yu , Rajat Talak , Lorenzo Shaikewitz , Luca Carlone This is my paper

Pith reviewed 2026-05-16 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.ROcs.SYeess.SY

keywords scene reconstructionphysics-constrained samplingobject contact graphrejection samplingpose and shape estimationphysical plausibilitymulti-object interactionsocclusion handling

0 comments

The pith

Picasso reconstructs multi-object scenes by jointly enforcing geometry, non-penetration, and physics through contact-graph-guided rejection sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that separate per-object pose and shape estimation produces geometrically faithful but physically invalid scenes, such as interpenetrating or unstable object arrangements, especially under occlusion and sensor noise. It argues that reliable reconstruction instead requires holistic reasoning that accounts for object interactions and physical plausibility so the resulting models can be used directly in simulators for planning and control. Picasso implements this idea with a fast rejection sampler that first infers an object contact graph and then uses it to bias sample generation toward valid configurations. The authors release a new benchmark of ten real-world contact-rich scenes together with a physical-plausibility metric and demonstrate that the method yields more stable and human-aligned results than prior techniques on both this dataset and YCB-V. If correct, the work shows that physics constraints can be folded into the core estimation loop without sacrificing speed or accuracy.

Core claim

Picasso is a reconstruction pipeline that builds multi-object scenes by considering geometry, non-penetration, and physics together. It relies on a fast rejection sampling method that reasons over multi-object interactions by leveraging an inferred object contact graph to guide samples. The resulting estimates are both geometrically consistent with sensor data and physically plausible, allowing direct import into simulators without manual correction.

What carries the argument

The central mechanism is physics-constrained rejection sampling guided by an inferred object contact graph that directs the sampler toward non-penetrating and stable configurations.

If this is right

Reconstructed scenes can be imported directly into simulators to predict dynamic behavior without corrective post-processing.
Performance gains appear in contact-rich environments where inter-object constraints dominate the solution space.
The same pipeline improves results on established benchmarks such as YCB-V while adding physical validity guarantees.
Digital twins built from these reconstructions support more reliable simulation-based planning for contact-rich robotic tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Jointly optimizing the contact graph together with the pose estimates rather than inferring it first could further reduce rejection rates on ambiguous scenes.
Extending the sampler to incorporate temporal consistency across video frames would allow reconstruction of moving scenes without separate tracking.
The physical-plausibility metric introduced in the benchmark could serve as a training signal for learning-based reconstructors that currently optimize only geometric error.
Scaling the approach to scenes with dozens of objects will likely require more efficient graph inference or learned proposal distributions to keep the rejection sampler tractable.

Load-bearing premise

The inferred object contact graph is accurate enough to steer sampling toward valid solutions without excluding good configurations or requiring an impractical number of rejections.

What would settle it

A controlled experiment in which the contact-graph inference is deliberately corrupted on an otherwise solvable scene and the sampler either fails to return any valid configuration within a fixed budget or returns only interpenetrating or unstable arrangements.

Figures

Figures reproduced from arXiv: 2602.08058 by Lorenzo Shaikewitz, Luca Carlone, Rajat Talak, Xihang Yu.

**Figure 1.** Figure 1: We propose Picasso, an approach to build multi-object scene reconstructions by accounting for object geometry, nonpenetration, and physics (i.e., objects should be in a stable equilibrium for the scene to be static). We also release the Picasso dataset: a collection of 10 contact-rich real-world scenes we use to test physical plausibility of scene reconstructions. The figure shows the digital twins genera… view at source ↗

**Figure 2.** Figure 2: An example illustrating that a 3D scene reconstruction [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Sample image and corresponding contact scene graph [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Conceptual illustration of the loss landscape on the [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Visualization of 3D scene reconstruction from the Pi [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: VLM prompt for contact scene graph generation. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of Picasso Dataset. Left: RGB images. Middle: Depth maps. Right: Reconstructed 3D models. TABLE VI: Evaluation results on the ADD-S (×10−3 m) and ADD-S (AUC %) metrics for the YCB-V dataset. CRISPSyn+Picasso w/o phy: CRISP-Syn+Picasso with physics constraints turned off. ADD-S ↓ ADD-S (AUC %) ↑ Method Mean Median 1 cm 2 cm 3 cm CRISP-Syn +Picasso w/o phy 8.35 3.04 51.8 68.9 77.1 CRISP-Syn +Picas… view at source ↗

**Figure 10.** Figure 10: A failure case due to noisy and partial depth point [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Approximation on contact scene graph inference can [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Human evaluation and SPS comparison between SAM3D and SAM3D+Picasso. Top: SAM3D. Bottom: SAM3D+Picasso. For both Experts and Public, SAM3D+Picasso achieves higher physics plausibility. TABLE X: Human evaluation of physics plausibility and SPS across 12 YCB-Video trajectories. Human scores are on a 1-7 scale (higher is better) across 83 participants. SPS is averaged across 3 frames per trajectory. S: SAM… view at source ↗

read the original abstract

In the presence of occlusions and measurement noise, geometrically accurate scene reconstructions -- which fit the sensor data -- can still be physically incorrect. For instance, when estimating the poses and shapes of objects in the scene and importing the resulting estimates into a simulator, small errors might translate to implausible configurations including object interpenetration or unstable equilibrium. This makes it difficult to predict the dynamic behavior of the scene using a digital twin, an important step in simulation-based planning and control of contact-rich behaviors. In this paper, we posit that object pose and shape estimation requires reasoning holistically over the scene (instead of reasoning about each object in isolation), accounting for object interactions and physical plausibility. Towards this goal, our first contribution is Picasso, a physics-constrained reconstruction pipeline that builds multi-object scene reconstructions by considering geometry, non-penetration, and physics. Picasso relies on a fast rejection sampling method that reasons over multi-object interactions, leveraging an inferred object contact graph to guide samples. Second, we propose the Picasso dataset, a collection of 10 contact-rich real-world scenes with ground truth annotations, as well as a metric to quantify physical plausibility, which we open-source as part of our benchmark. Finally, we provide an extensive evaluation of Picasso on our newly introduced dataset and on the YCB-V dataset, and show it largely outperforms the state of the art while providing reconstructions that are both physically plausible and more aligned with human intuition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Picasso adds contact-graph guided rejection sampling to enforce physical plausibility in multi-object scene reconstruction and ships a 10-scene dataset plus metric, but the small scale and missing ablations leave the gains hard to judge.

read the letter

The main point is that Picasso reconstructs scenes by sampling poses and shapes while using an inferred contact graph to reject physically invalid configurations like interpenetrations or unstable stacks. This holistic step replaces isolated per-object fitting and produces outputs that work better when dropped into a simulator. They also release a new 10-scene real-world dataset with ground-truth annotations and a physical-plausibility metric, plus results on YCB-V that beat prior methods on their metric and look more intuitive to people.

Referee Report

3 major / 2 minor

Summary. The manuscript claims that Picasso, a physics-constrained scene reconstruction pipeline, produces physically plausible multi-object reconstructions by using fast rejection sampling guided by an inferred object contact graph. It introduces a new 10-scene real-world dataset with ground-truth annotations and a physical plausibility metric, demonstrating outperformance over prior methods on this dataset and on YCB-V while yielding results more aligned with human intuition.

Significance. If the results hold, the work could advance simulation-based planning and control by enabling more reliable digital twins for contact-rich scenes. The new dataset and plausibility metric are valuable open contributions that address a gap in evaluating physical correctness beyond geometric fit. The holistic treatment of object interactions via the contact graph is a promising direction, though its robustness remains to be fully substantiated.

major comments (3)

[§5] §5 (Experiments): No ablation study isolates the contribution of the inferred contact graph to sampling efficiency or reconstruction quality. Without removing or replacing this component, it is impossible to determine whether the reported gains in physical plausibility derive from the graph-guided rejection sampling or from other elements of the pipeline.
[§5.2] §5.2 and Table 2: The evaluation provides no quantitative analysis of contact-graph inference accuracy, rejection rates, or failure cases in contact-rich scenes. This leaves the central assumption—that the graph inferred from noisy geometry reliably guides sampling without excessive rejections or exclusion of valid configurations—unsupported by direct evidence.
[§5.1] §5.1: Baseline comparisons lack full details on implementation, hyper-parameter tuning, and error bars on the new plausibility metric. The claim of outperformance is therefore only moderately supported, as variance and reproducibility cannot be assessed.

minor comments (2)

[Figure 3] Figure 3 and §4.2: The contact-graph visualization would benefit from explicit annotation of false-positive/negative edges to illustrate inference errors on real data.
[§3] §3: Notation for the rejection-sampling acceptance probability could be clarified with a short pseudocode block to avoid ambiguity in the multi-object interaction term.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that additional ablations, quantitative analyses of the contact graph, and greater transparency in baseline comparisons will strengthen the paper. We will incorporate these elements in the revised version. Below we address each major comment point by point.

read point-by-point responses

Referee: [§5] §5 (Experiments): No ablation study isolates the contribution of the inferred contact graph to sampling efficiency or reconstruction quality. Without removing or replacing this component, it is impossible to determine whether the reported gains in physical plausibility derive from the graph-guided rejection sampling or from other elements of the pipeline.

Authors: We agree that an ablation isolating the contact graph's contribution is valuable. In the revised manuscript, we will add an ablation comparing the full Picasso pipeline to a variant using rejection sampling without contact-graph guidance. We will report differences in sampling efficiency (rejection rates and runtime) and reconstruction quality (geometric accuracy and physical plausibility metrics) on the Picasso dataset and YCB-V to clarify the graph's role. revision: yes
Referee: [§5.2] §5.2 and Table 2: The evaluation provides no quantitative analysis of contact-graph inference accuracy, rejection rates, or failure cases in contact-rich scenes. This leaves the central assumption—that the graph inferred from noisy geometry reliably guides sampling without excessive rejections or exclusion of valid configurations—unsupported by direct evidence.

Authors: We will add a new analysis subsection in the revision. This will include quantitative metrics on contact-graph inference accuracy (precision/recall against ground-truth contacts from our dataset annotations), average rejection rates during sampling, and a discussion of observed failure cases in contact-rich scenes. These results will directly support the reliability of the graph-guided approach. revision: yes
Referee: [§5.1] §5.1: Baseline comparisons lack full details on implementation, hyper-parameter tuning, and error bars on the new plausibility metric. The claim of outperformance is therefore only moderately supported, as variance and reproducibility cannot be assessed.

Authors: We acknowledge the need for greater reproducibility. In the revised manuscript, we will expand the baseline section with full implementation details, specific hyper-parameter values and tuning procedures for each method, and error bars (standard deviations over multiple runs) for the physical plausibility metric on both datasets. This will allow proper assessment of variance and strengthen the outperformance claims. revision: yes

Circularity Check

0 steps flagged

New rejection sampling and dataset avoid circular derivation

full rationale

The paper introduces Picasso as a novel physics-constrained pipeline relying on rejection sampling guided by an inferred contact graph, plus a new 10-scene dataset and physical plausibility metric. No equations or claims reduce by construction to prior fitted parameters; evaluations on the new dataset and YCB-V provide independent content. Minor self-citations may exist for background but are not load-bearing for the central reconstruction claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard rigid-body physics and contact assumptions plus the claim that rejection sampling guided by an inferred graph can efficiently locate valid configurations; no new free parameters or invented entities are introduced beyond conventional sampling hyperparameters.

axioms (1)

domain assumption Rigid-body non-penetration and equilibrium constraints are sufficient to define physical plausibility for the target scenes
Invoked to justify rejection of samples that violate interpenetration or stability

pith-pipeline@v0.9.0 · 5577 in / 1305 out tokens · 72756 ms · 2026-05-16T05:52:47.524285+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reconstruction by Generation: 3D Multi-Object Scene Reconstruction from Sparse Observations
cs.CV 2026-04 unverdicted novelty 6.0

RecGen achieves state-of-the-art 3D multi-object scene reconstruction from sparse RGB-D views by combining compositional synthetic scene generation with strong 3D shape priors, outperforming SAM3D by 30%+ in shape qua...