Hylos: Operability Contracts for Model-Native Spatial Intelligence

Christopher Da Silva

arxiv: 2605.24728 · v1 · pith:G45QIAJOnew · submitted 2026-05-23 · 💻 cs.AI

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Christopher Da Silva This is my paper

Pith reviewed 2026-06-30 13:04 UTC · model grok-4.3

classification 💻 cs.AI

keywords operability contractsspatial transactions3D scene managementmodel-native spatial intelligencecausal repairfoundation models for 3Dcontract-bounded AIspatial AI evaluation

0 comments

The pith

Spatial AI outputs become usable only when edits route through SpatialTransactions that enforce scene invariants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that foundation models now generate visually plausible 3D objects and scenes, yet this output cannot serve as reliable input for CAD, robotics, or simulation because it lacks structured knowledge of entities, constraints, actions, and effects. Hylos supplies that structure by keeping scene-scale operability state and forcing every durable change to pass through a SpatialTransaction that resolves references, checks admissibility, protects invariants, and projects consequences before any commit occurs. A focused study on causal repair illustrates the difference: a visible misalignment is traced upstream to its controlling placement structure, a supported action is chosen, and the change is validated rather than applied directly to the symptom. If the approach holds, spatial AI evaluation moves from image metrics to whether generated 3D can act as stable substrate for manufacturing, inspection, and interactive authoring.

Core claim

Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The causal-repair study shows a visible misalignment on a dependent component resolved by selecting and validating an upstream placement action instead of editing the visible geometry directly.

What carries the argument

SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes.

If this is right

Generated 3D can serve as reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.
Causal repair becomes possible by tracing visible symptoms through scene dependencies to upstream supported actions.
Systems return explicit outcomes such as capability-gap when an attempted action lacks support.
Evaluation of spatial AI shifts from visual quality alone to whether output supports validated downstream operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same contract mechanism could support multi-user editing by serializing all changes through shared SpatialTransactions.
New benchmarks could measure success by the fraction of generated scenes that survive export to simulation without manual cleanup.
The approach implies that provenance and capability-gap tracking should be first-class outputs of any 3D foundation model.

Load-bearing premise

It is feasible to maintain comprehensive scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs while SpatialTransactions reliably resolve references, check admissibility, protect invariants, and project effects.

What would settle it

A generated scene in which a SpatialTransaction accepts a change that later produces an invariant violation detectable in a downstream CAD export or robotics simulation.

Figures

Figures reproduced from arXiv: 2605.24728 by Christopher Da Silva.

**Figure 1.** Figure 1: Operability contract continuum. Current models select bounded graph operations; transitional systems wrap neural assets with recovered structure; future systems emit candidate model-native artifacts that still pass through runtime ingestion and validation. dependency structure rather than applying a local visual edit. The implemented prototype substrate is broader than that single public anchor: it include… view at source ↗

**Figure 2.** Figure 2: Spatial transaction boundary. Proposed changes from models, users, imports, or backends are not treated as spatial truth until they pass reference resolution, admissibility checks, invariant checks, realization/projection, and effect-diff evaluation. The same boundary can commit, defer for review, roll back, or produce a typed capability gap [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Causal repair chain. The key empirical stress test is not whether a visible component can be moved, but whether the agent routes a visible symptom through candidate causal interpretations to the supported upstream driver and rejects weaker local edits [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Dual-stream token-level spatial validity guardrail. A model-native spatial generator can emit geometric and symbolic-operability streams while a synchronous linter or invariant checker masks invalid next-token candidates before broken spatial artifacts are rendered, exported, simulated, or committed. More formally, let a generated spatial artifact be: Aspatial = (G, S, C, H, P, U) where G is the geometric … view at source ↗

**Figure 5.** Figure 5: Latent execution auditing and test-time repair. Candidate artifacts that fail execution audits can be repaired by optimizing against task preservation, violation penalties, and proximity constraints, then re-decoded and re-audited before runtime ingestion [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Parametric neural edit handles. Instead of relying only on classical CAD feature trees, a model-native artifact may expose embedded handles whose local deformations are predicted through a neural Jacobian or equivalent sensitivity field and checked against constraints before commit [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Self-supervised operability loop. Generated artifacts are ingested, projected into downstream environments, scored as functional successes or violations, and converted into structured signals that improve future model behavior and runtime coverage [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

read the original abstract

Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hylos proposes operability contracts and SpatialTransactions to make generated 3D usable beyond visuals, but the core mechanisms for state and transactions remain undemonstrated.

read the letter

The main takeaway is that this paper frames a systems architecture called Hylos around operability contracts, scene-scale state tracking, and SpatialTransactions as commit boundaries for spatial AI. Those specific elements are presented as new and aimed at turning visually plausible 3D into reliable input for CAD, robotics, and simulation.

The work does a clear job naming the gap. Foundation models produce geometry that looks right but lacks the tracked entities, constraints, provenance, and effect projections needed for downstream engineering tasks. The causal repair example illustrates the point by tracing a visible misalignment back to an upstream placement rather than patching the symptom directly.

The soft spot is the lack of concrete mechanisms. The paper assumes it is feasible to maintain comprehensive operability state over objects, assertions, solver jobs, effect diffs and the rest, and that SpatialTransactions can resolve references, protect invariants, and project effects reliably. The artifact study stays at the level of symptom tracing without showing how state gets populated, how admissibility is checked under uncertainty, or what the invariant definitions actually are. That assumption carries the central claim.

This is for researchers thinking about systems architectures that connect 3D generation to real engineering workflows. It could prompt useful discussion in a reading group on what operability criteria should look like. I would not cite it yet because there are no results or implementations to build on. A serious editor should send it to peer review so the authors can develop the transaction logic and state management details.

Referee Report

2 major / 1 minor

Summary. The paper claims that visually plausible 3D output from foundation models is not yet operable for downstream tasks such as CAD, robotics, and simulation, and introduces Hylos as a systems architecture that maintains scene-scale operability state over objects, assemblies, surface anchors, assertions, solver jobs, and effect diffs. Changes are routed through SpatialTransactions that resolve references, check admissibility, protect invariants, and project effects, returning commit/review/rollback outcomes. The claim is illustrated by a focused artifact study on causal repair, in which a visible misalignment symptom is traced to an upstream placement structure rather than edited directly.

Significance. If the proposed mechanisms for maintaining comprehensive operability state and executing reliable SpatialTransactions can be realized, the work would provide a concrete architectural path for converting model-generated spatial content into reliable substrate for engineering and interactive applications, shifting evaluation criteria from visual quality alone to operational contract satisfaction. The position/preprint framing with one illustrative study means the contribution is primarily conceptual rather than a validated implementation.

major comments (2)

[Abstract / focused artifact study] Abstract and artifact study description: the central claim that SpatialTransactions can reliably resolve references, check admissibility, protect invariants, and project effects at scene scale rests on the feasibility of maintaining operability state over the listed elements (objects, assemblies, assertions, solver jobs, effect diffs, etc.). The causal-repair illustration traces a symptom to an upstream placement but supplies no specification of state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic, leaving the load-bearing systems assumption unverified.
[Abstract] The manuscript frames itself as a systems/position preprint rather than a broad benchmark, yet the broader claim that generated 3D can become reliable substrate for CAD/robotics/etc. requires at least a minimal concrete mechanism or pseudocode for the transaction boundary to be defensible; the current description remains at the level of named entities without reduction to implementable rules.

minor comments (1)

The invented terms SpatialTransaction and operability state are introduced without accompanying formal notation, state-transition diagram, or pseudocode, which reduces clarity for readers attempting to assess the proposal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the two major comments point by point below, maintaining the position-preprint framing of the work.

read point-by-point responses

Referee: [Abstract / focused artifact study] Abstract and artifact study description: the central claim that SpatialTransactions can reliably resolve references, check admissibility, protect invariants, and project effects at scene scale rests on the feasibility of maintaining operability state over the listed elements (objects, assemblies, assertions, solver jobs, effect diffs, etc.). The causal-repair illustration traces a symptom to an upstream placement but supplies no specification of state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic, leaving the load-bearing systems assumption unverified.

Authors: We agree that the focused artifact study is illustrative and does not supply specifications for state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic. As explicitly framed in the manuscript, this is a systems/position preprint whose contribution is the architectural outline of operability contracts and SpatialTransactions rather than a verified implementation. The causal-repair example demonstrates dependency tracing and upstream repair at the conceptual level only; full verification of the load-bearing mechanisms would require a separate prototype paper. revision: no
Referee: [Abstract] The manuscript frames itself as a systems/position preprint rather than a broad benchmark, yet the broader claim that generated 3D can become reliable substrate for CAD/robotics/etc. requires at least a minimal concrete mechanism or pseudocode for the transaction boundary to be defensible; the current description remains at the level of named entities without reduction to implementable rules.

Authors: The manuscript intentionally adopts the position-preprint framing with one illustrative study. Reducing the transaction boundary to pseudocode or implementable rules would convert the work into a systems-implementation contribution, which lies outside the stated scope. The named entities and enumerated transaction outcomes (commit/review/rollback/deferral/capability-gap) are presented at the architectural level to define the necessary contract boundaries; we do not claim they constitute a complete mechanism. revision: no

Circularity Check

0 steps flagged

No circularity: architectural proposal with no equations or self-referential reductions

full rationale

The paper presents Hylos as a systems/position preprint introducing an architectural framework for operability contracts. It describes scene-scale state maintenance and SpatialTransactions at a conceptual level without any equations, fitted parameters, quantitative predictions, or load-bearing self-citations. The causal repair study is an illustrative example rather than a derivation that reduces to its own inputs. No steps match the enumerated circularity patterns; the central claims rest on the proposed design itself rather than reducing by construction to prior fitted quantities or author citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger reflects high-level concepts introduced without implementation details. No free parameters are mentioned. The architecture introduces new entities without independent evidence or falsifiable handles provided.

axioms (1)

domain assumption Foundation models can describe, reconstruct, and generate 3D objects and environments but require an additional operability layer to become useful to agents.
Stated as the opening premise in the abstract.

invented entities (2)

SpatialTransaction no independent evidence
purpose: Commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit/review/rollback/deferral/capability-gap outcomes.
New mechanism introduced to handle durable spatial changes.
operability state no independent evidence
purpose: Maintains scene-scale state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs.
New state representation introduced for contract-bounded spatial intelligence.

pith-pipeline@v0.9.1-grok · 5788 in / 1361 out tokens · 32372 ms · 2026-06-30T13:04:54.490626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 11 canonical work pages · 5 internal anchors

[1]

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Introduction Spatial AI is often framed as a generation problem: can a model produce a convincing object, product assembly, mesh, room, video-consistent world, or neural field? That framing is necessary but incomplete. Agents do not merely look at space. They must inspect objects, reason over parts and assemblies, route through environments, simulate cons...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

The receiving assembly looks laterally wrong relative to the body. Fix the physical placement

The Operability Gap In Generated 3D Generated 3D systems increasingly produce visually rich assets: meshes, neural radiance fields, Gaussian splats, embodied environments, and video-consistent worlds. These outputs expand the perceptual and creative surface of spatial AI. However, a spatial artifact can be visually impressive while remaining operationally...
[3]

Its claim is not that any one area is missing entirely

Related Work And Positioning This work sits at the intersection of semantic scene representation, embodied-agent environments, tool- using language models, programmatic geometry, and generated 3D objects, assemblies, and worlds. Its claim is not that any one area is missing entirely. The claim is that agents need an operability layer between semantic perc...
[4]

The graph records what the system believes about the scene

Hylos Runtime Contract Hylos treats spatial state as an operable graph plus a transaction runtime. The graph records what the system believes about the scene. The transaction runtime governs how that belief may change. The core invariant is: No spatial output becomes scene truth until Hylos can type it, reference it, validate it, diff it, and commit it. T...
[5]

Evidence-Grounded Interaction Today The current prototype implements a scene-scale operability substrate, not only a local interaction representation. The substrate includes scene assets, entity hypotheses, surface anchors, spatial assertions, action candidates, solver jobs, shared actuator invocations, spatial marks, work artifacts, capability gaps, and ...
[6]

A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome

Evaluation Method: Repair As Causal Stress Test The evaluation uses blind forward replay over a repair task. A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome. This tests whether the agent can reason from evidence and contracts rather than from hidden tas...
[7]

Can the agent identify that the visible symptom is not necessarily the correct edit target? Hylos: Operability Contracts for Model-Native Spatial Intelligence 11
[8]

Can it select an upstream causal interaction when the scene dependencies support that interpretation?
[9]

Does validation prevent unsupported geometry changes and force deferral when support is missing?
[10]

These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader benchmark over the existing substrate

Can a new generic spatial alternative resolve an ambiguity without becoming a product-specific rule? 6.2 Baselines And Conditions The public evaluation is organized around conceptual controls rather than a large benchmark suite. These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader b...
[11]

It is that the model did not directly move the visible dependent component

Result: Causal Repair Through The Operability Contract The successful replay followed this causal chain: visual observation -> diagnostic evidence for lateral placement mismatch -> dependency structure identifies an upstream placement driver -> declared interaction space permits changing that driver -> supported geometric alternative is selected -> valida...
[12]

It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts

From Wrapped Neural Assets To Model-Native Spatial Artifacts The current transaction architecture is not the end state. It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts. 8.1 Stage 1: Transaction-Safe Explicit Lowering In the current regime, the model selects or proposes bounded graph o...
[13]

Scientific Evaluation Program The causal repair study is a minimal public empirical anchor, not a complete validation program. The current prototype already exercises more than the repair family through internal fixtures for mutation, frame transforms, support-region changes, multi-region consequence reasoning, and variant generation. The evaluation progr...
[14]

center this particular component

Discussion 10.1 Architecture Is Not The Final Product The current Hylos runtime is already a working substrate for reliable spatial interaction. The larger thesis is broader: spatial intelligence should produce operable artifacts by default. The transaction layer is therefore not a retreat from model-native generation. It is the reliability boundary that ...
[15]

The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth

Current Boundaries And Evaluation Scope This work is an architecture and artifact-study contribution built on an implemented prototype substrate. The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth. The paper reports a focused causal repair artifact because it makes the abstra...
[16]

The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage

Scaling And Public Evaluation Roadmap The next research step is to turn the existing Hylos substrate into a broad public evaluation program for operable physical 3D. The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage. Hylos already exercises the core pattern: scene-scale operability state, action candidates...
[17]

Relation graph coverage at object and environment scale:expand reporting over existing relation and assertion structures to cover containment, adjacency, attachment, articulation, support, clearance, flow, actuation, part-level dependencies, assembly constraints, and environment-level causal links
[18]

Constructive authoring benchmarks:package existing authoring, placement, mutation, resizing, and variant-generation scenarios into public tests for intent-to-topology conversion across surfaces, openings, attachments, object features, constraints, and realization outcomes
[19]

Causal and goal graph evaluations:report how the assertion/action/solver substrate links observed issues and desired outcomes to plausible drivers, requirements, validators, and admissible interactions across repair, authoring, inspection, optimization, and routing tasks
[20]

T ransaction graph standardization:formalize the current action and transaction substrate into standardized preconditions, protected invariants, effect assertions, rollback semantics, audit records, and backend realization contracts across object-level and scene-level operations
[21]

Evidence acquisition benchmarks:measure when bounded visual or geometric evidence improves spatial reasoning under controlled ambiguity, and when the correct behavior is review, deferral, or additional acquisition
[22]

Uncertainty , review, and capability-gap reporting:standardize runtime uncertainty, review, deferral, unresolved assertions, solver status, and capability-gap outputs so they can be scored consistently across transaction families
[23]

Cross-representation adapter suites:extend existing realization and preview projection paths into a formal suite for display, CAD/export, simulation, robotics, manufacturing, inspection, sales Hylos: Operability Contracts for Model-Native Spatial Intelligence 24 visualization, training environments, and audit views
[24]

Structure recovery benchmarks:evaluate recovery over imported meshes, splats, scans, collider- backed substrates, generated assets, and neural representations, measuring whether recovered entities, frames, surfaces, relationships, uncertainty states, and action candidates support downstream operation
[25]

Model-native artifact contracts:define output formats, training objectives, and ingestion checks for artifacts that jointly expose geometry, topology, constraints, handles, provenance, uncertainty, and audit hooks
[26]

Human-in-the-loop operation studies:evaluate whether candidate interpretations, effect diffs, review states, and capability-gap explanations improve trust, correction speed, and repeated spatial- operation success
[27]

Visual 3D generation is not enough

Conclusion Hylos reframes spatial foundation-model interaction as an operability problem. Visual 3D generation is not enough. A spatial artifact - whether an object, assembly, route, scene, or environment - becomes useful to agents only when it can be inspected, modified, validated, projected, audited, and committed through a reliable runtime contract. Th...
[28]

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese. “3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera.”IEEE/CVF International Conference on Computer Vision (ICCV), 2019.https://arxiv.org/abs/1910.02527

work page arXiv 2019
[29]

Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone. “Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs.”International Journal of Robotics Research, 40(12-14):1510-1546, 2021.https://arxiv.org/abs/2101.06894

work page arXiv 2021
[30]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. “ReAct: Synergizing Reasoning and Acting in Language Models.”International Conference on Learning Representations (ICLR), 2023.https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. “Toolformer: Language Models Can Teach Themselves to Use Tools.”Advances in Neural Information Processing Systems (NeurIPS), 2023.https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.”Conference on Robot Learning (CoRL), 2022.https://arxiv.org/abs/2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

Code as Policies: Language Model Programs for Embodied Control

J. Liang et al. “Code as Policies: Language Model Programs for Embodied Control.”IEEE Interna- tional Conference on Robotics and Automation (ICRA), 2023. https://arxiv.org/abs/2209.07753

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Sha- Hylos: Operability Contracts for Model-Native Spatial Intelligence 25 peAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis

R. K. Jones, T. Barton, X. Xu, K. Wang, E. Jiang, P. Guerrero, N. J. Mitra, and D. Ritchie. “Sha- Hylos: Operability Contracts for Model-Native Spatial Intelligence 25 peAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis.”ACM Transactions on Graphics, 39(6), 2020.https://arxiv.org/abs/2009.08026

work page arXiv 2020
[35]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

M. Deitke et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.”Advances in Neural Information Processing Systems (NeurIPS), 2022.https://arxiv.org/abs/2206.06994

work page arXiv 2022
[36]

Objaverse: A Universe of Annotated 3D Objects

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. “Objaverse: A Universe of Annotated 3D Objects.”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://arxiv.org/abs/ 2212.08051

work page arXiv 2023
[37]

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Y. Yang et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. https: //arxiv.org/abs/2312.09067

work page arXiv 2024
[38]

Marble: A Multimodal World Model

World Labs. “Marble: A Multimodal World Model.” Product and technical overview, 2025.https: //www.worldlabs.ai/blog/marble-world-model

2025
[39]

Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System

C. DaSilva. “Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System.” Internal technical report, 2026

2026

[1] [1]

Hylos: Operability Contracts for Model-Native Spatial Intelligence

Introduction Spatial AI is often framed as a generation problem: can a model produce a convincing object, product assembly, mesh, room, video-consistent world, or neural field? That framing is necessary but incomplete. Agents do not merely look at space. They must inspect objects, reason over parts and assemblies, route through environments, simulate cons...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

The receiving assembly looks laterally wrong relative to the body. Fix the physical placement

The Operability Gap In Generated 3D Generated 3D systems increasingly produce visually rich assets: meshes, neural radiance fields, Gaussian splats, embodied environments, and video-consistent worlds. These outputs expand the perceptual and creative surface of spatial AI. However, a spatial artifact can be visually impressive while remaining operationally...

[3] [3]

Its claim is not that any one area is missing entirely

Related Work And Positioning This work sits at the intersection of semantic scene representation, embodied-agent environments, tool- using language models, programmatic geometry, and generated 3D objects, assemblies, and worlds. Its claim is not that any one area is missing entirely. The claim is that agents need an operability layer between semantic perc...

[4] [4]

The graph records what the system believes about the scene

Hylos Runtime Contract Hylos treats spatial state as an operable graph plus a transaction runtime. The graph records what the system believes about the scene. The transaction runtime governs how that belief may change. The core invariant is: No spatial output becomes scene truth until Hylos can type it, reference it, validate it, diff it, and commit it. T...

[5] [5]

Evidence-Grounded Interaction Today The current prototype implements a scene-scale operability substrate, not only a local interaction representation. The substrate includes scene assets, entity hypotheses, surface anchors, spatial assertions, action candidates, solver jobs, shared actuator invocations, spatial marks, work artifacts, capability gaps, and ...

[6] [6]

A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome

Evaluation Method: Repair As Causal Stress Test The evaluation uses blind forward replay over a repair task. A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome. This tests whether the agent can reason from evidence and contracts rather than from hidden tas...

[7] [7]

Can the agent identify that the visible symptom is not necessarily the correct edit target? Hylos: Operability Contracts for Model-Native Spatial Intelligence 11

[8] [8]

Can it select an upstream causal interaction when the scene dependencies support that interpretation?

[9] [9]

Does validation prevent unsupported geometry changes and force deferral when support is missing?

[10] [10]

These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader benchmark over the existing substrate

Can a new generic spatial alternative resolve an ambiguity without becoming a product-specific rule? 6.2 Baselines And Conditions The public evaluation is organized around conceptual controls rather than a large benchmark suite. These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader b...

[11] [11]

It is that the model did not directly move the visible dependent component

Result: Causal Repair Through The Operability Contract The successful replay followed this causal chain: visual observation -> diagnostic evidence for lateral placement mismatch -> dependency structure identifies an upstream placement driver -> declared interaction space permits changing that driver -> supported geometric alternative is selected -> valida...

[12] [12]

It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts

From Wrapped Neural Assets To Model-Native Spatial Artifacts The current transaction architecture is not the end state. It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts. 8.1 Stage 1: Transaction-Safe Explicit Lowering In the current regime, the model selects or proposes bounded graph o...

[13] [13]

Scientific Evaluation Program The causal repair study is a minimal public empirical anchor, not a complete validation program. The current prototype already exercises more than the repair family through internal fixtures for mutation, frame transforms, support-region changes, multi-region consequence reasoning, and variant generation. The evaluation progr...

[14] [14]

center this particular component

Discussion 10.1 Architecture Is Not The Final Product The current Hylos runtime is already a working substrate for reliable spatial interaction. The larger thesis is broader: spatial intelligence should produce operable artifacts by default. The transaction layer is therefore not a retreat from model-native generation. It is the reliability boundary that ...

[15] [15]

The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth

Current Boundaries And Evaluation Scope This work is an architecture and artifact-study contribution built on an implemented prototype substrate. The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth. The paper reports a focused causal repair artifact because it makes the abstra...

[16] [16]

The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage

Scaling And Public Evaluation Roadmap The next research step is to turn the existing Hylos substrate into a broad public evaluation program for operable physical 3D. The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage. Hylos already exercises the core pattern: scene-scale operability state, action candidates...

[17] [17]

Relation graph coverage at object and environment scale:expand reporting over existing relation and assertion structures to cover containment, adjacency, attachment, articulation, support, clearance, flow, actuation, part-level dependencies, assembly constraints, and environment-level causal links

[18] [18]

Constructive authoring benchmarks:package existing authoring, placement, mutation, resizing, and variant-generation scenarios into public tests for intent-to-topology conversion across surfaces, openings, attachments, object features, constraints, and realization outcomes

[19] [19]

Causal and goal graph evaluations:report how the assertion/action/solver substrate links observed issues and desired outcomes to plausible drivers, requirements, validators, and admissible interactions across repair, authoring, inspection, optimization, and routing tasks

[20] [20]

T ransaction graph standardization:formalize the current action and transaction substrate into standardized preconditions, protected invariants, effect assertions, rollback semantics, audit records, and backend realization contracts across object-level and scene-level operations

[21] [21]

Evidence acquisition benchmarks:measure when bounded visual or geometric evidence improves spatial reasoning under controlled ambiguity, and when the correct behavior is review, deferral, or additional acquisition

[22] [22]

Uncertainty , review, and capability-gap reporting:standardize runtime uncertainty, review, deferral, unresolved assertions, solver status, and capability-gap outputs so they can be scored consistently across transaction families

[23] [23]

Cross-representation adapter suites:extend existing realization and preview projection paths into a formal suite for display, CAD/export, simulation, robotics, manufacturing, inspection, sales Hylos: Operability Contracts for Model-Native Spatial Intelligence 24 visualization, training environments, and audit views

[24] [24]

Structure recovery benchmarks:evaluate recovery over imported meshes, splats, scans, collider- backed substrates, generated assets, and neural representations, measuring whether recovered entities, frames, surfaces, relationships, uncertainty states, and action candidates support downstream operation

[25] [25]

Model-native artifact contracts:define output formats, training objectives, and ingestion checks for artifacts that jointly expose geometry, topology, constraints, handles, provenance, uncertainty, and audit hooks

[26] [26]

Human-in-the-loop operation studies:evaluate whether candidate interpretations, effect diffs, review states, and capability-gap explanations improve trust, correction speed, and repeated spatial- operation success

[27] [27]

Visual 3D generation is not enough

Conclusion Hylos reframes spatial foundation-model interaction as an operability problem. Visual 3D generation is not enough. A spatial artifact - whether an object, assembly, route, scene, or environment - becomes useful to agents only when it can be inspected, modified, validated, projected, audited, and committed through a reliable runtime contract. Th...

[28] [28]

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese. “3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera.”IEEE/CVF International Conference on Computer Vision (ICCV), 2019.https://arxiv.org/abs/1910.02527

work page arXiv 2019

[29] [29]

Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs

A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone. “Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs.”International Journal of Robotics Research, 40(12-14):1510-1546, 2021.https://arxiv.org/abs/2101.06894

work page arXiv 2021

[30] [30]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. “ReAct: Synergizing Reasoning and Acting in Language Models.”International Conference on Learning Representations (ICLR), 2023.https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Toolformer: Language Models Can Teach Themselves to Use Tools

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. “Toolformer: Language Models Can Teach Themselves to Use Tools.”Advances in Neural Information Processing Systems (NeurIPS), 2023.https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

M. Ahn et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.”Conference on Robot Learning (CoRL), 2022.https://arxiv.org/abs/2204.01691

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

Code as Policies: Language Model Programs for Embodied Control

J. Liang et al. “Code as Policies: Language Model Programs for Embodied Control.”IEEE Interna- tional Conference on Robotics and Automation (ICRA), 2023. https://arxiv.org/abs/2209.07753

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Sha- Hylos: Operability Contracts for Model-Native Spatial Intelligence 25 peAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis

R. K. Jones, T. Barton, X. Xu, K. Wang, E. Jiang, P. Guerrero, N. J. Mitra, and D. Ritchie. “Sha- Hylos: Operability Contracts for Model-Native Spatial Intelligence 25 peAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis.”ACM Transactions on Graphics, 39(6), 2020.https://arxiv.org/abs/2009.08026

work page arXiv 2020

[35] [35]

ProcTHOR: Large-Scale Embodied AI Using Procedural Generation

M. Deitke et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.”Advances in Neural Information Processing Systems (NeurIPS), 2022.https://arxiv.org/abs/2206.06994

work page arXiv 2022

[36] [36]

Objaverse: A Universe of Annotated 3D Objects

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. “Objaverse: A Universe of Annotated 3D Objects.”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://arxiv.org/abs/ 2212.08051

work page arXiv 2023

[37] [37]

Holodeck: Language Guided Generation of 3D Embodied AI Environments

Y. Yang et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. https: //arxiv.org/abs/2312.09067

work page arXiv 2024

[38] [38]

Marble: A Multimodal World Model

World Labs. “Marble: A Multimodal World Model.” Product and technical overview, 2025.https: //www.worldlabs.ai/blog/marble-world-model

2025

[39] [39]

Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System

C. DaSilva. “Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System.” Internal technical report, 2026

2026