Hylos: Operability Contracts for Model-Native Spatial Intelligence
Pith reviewed 2026-06-30 13:04 UTC · model grok-4.3
The pith
Spatial AI outputs become usable only when edits route through SpatialTransactions that enforce scene invariants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The causal-repair study shows a visible misalignment on a dependent component resolved by selecting and validating an upstream placement action instead of editing the visible geometry directly.
What carries the argument
SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes.
If this is right
- Generated 3D can serve as reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.
- Causal repair becomes possible by tracing visible symptoms through scene dependencies to upstream supported actions.
- Systems return explicit outcomes such as capability-gap when an attempted action lacks support.
- Evaluation of spatial AI shifts from visual quality alone to whether output supports validated downstream operations.
Where Pith is reading between the lines
- The same contract mechanism could support multi-user editing by serializing all changes through shared SpatialTransactions.
- New benchmarks could measure success by the fraction of generated scenes that survive export to simulation without manual cleanup.
- The approach implies that provenance and capability-gap tracking should be first-class outputs of any 3D foundation model.
Load-bearing premise
It is feasible to maintain comprehensive scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs while SpatialTransactions reliably resolve references, check admissibility, protect invariants, and project effects.
What would settle it
A generated scene in which a SpatialTransaction accepts a change that later produces an invariant violation detectable in a downstream CAD export or robotics simulation.
Figures
read the original abstract
Foundation models can increasingly describe, reconstruct, and generate 3D objects, assemblies, scenes, and environments, but visually plausible spatial output is not yet operable 3D. A generated object or environment becomes useful to an agent only when the system can identify its entities, frames, surfaces, constraints, provenance, admissible actions, expected effects, and validation failures. This paper introduces Hylos, a systems architecture for contract-bounded spatial intelligence. Hylos maintains scene-scale operability state over objects, assemblies, assets, surface anchors, assertions, action candidates, solver jobs, shared actuator invocations, capability gaps, and effect diffs. Durable spatial changes are routed through a SpatialTransaction: a commit boundary that resolves references, checks admissibility, protects invariants, projects effects, and returns commit, review, rollback, deferral, or capability-gap outcomes. The paper is framed as a systems/position preprint with a focused artifact study rather than a broad benchmark. The study examines causal repair: a visible misalignment appears on a dependent component, while the supported repair lies upstream in the placement structure that controls it. The successful interaction traces the symptom through scene dependencies, selects a supported upstream interaction, and applies a validated change instead of directly editing visible geometry. The broader claim is that spatial AI should be evaluated not only by visual quality, but by whether generated or edited 3D can become reliable substrate for CAD, robotics, simulation, inspection, manufacturing, and interactive world authoring.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that visually plausible 3D output from foundation models is not yet operable for downstream tasks such as CAD, robotics, and simulation, and introduces Hylos as a systems architecture that maintains scene-scale operability state over objects, assemblies, surface anchors, assertions, solver jobs, and effect diffs. Changes are routed through SpatialTransactions that resolve references, check admissibility, protect invariants, and project effects, returning commit/review/rollback outcomes. The claim is illustrated by a focused artifact study on causal repair, in which a visible misalignment symptom is traced to an upstream placement structure rather than edited directly.
Significance. If the proposed mechanisms for maintaining comprehensive operability state and executing reliable SpatialTransactions can be realized, the work would provide a concrete architectural path for converting model-generated spatial content into reliable substrate for engineering and interactive applications, shifting evaluation criteria from visual quality alone to operational contract satisfaction. The position/preprint framing with one illustrative study means the contribution is primarily conceptual rather than a validated implementation.
major comments (2)
- [Abstract / focused artifact study] Abstract and artifact study description: the central claim that SpatialTransactions can reliably resolve references, check admissibility, protect invariants, and project effects at scene scale rests on the feasibility of maintaining operability state over the listed elements (objects, assemblies, assertions, solver jobs, effect diffs, etc.). The causal-repair illustration traces a symptom to an upstream placement but supplies no specification of state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic, leaving the load-bearing systems assumption unverified.
- [Abstract] The manuscript frames itself as a systems/position preprint rather than a broad benchmark, yet the broader claim that generated 3D can become reliable substrate for CAD/robotics/etc. requires at least a minimal concrete mechanism or pseudocode for the transaction boundary to be defensible; the current description remains at the level of named entities without reduction to implementable rules.
minor comments (1)
- The invented terms SpatialTransaction and operability state are introduced without accompanying formal notation, state-transition diagram, or pseudocode, which reduces clarity for readers attempting to assess the proposal.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address the two major comments point by point below, maintaining the position-preprint framing of the work.
read point-by-point responses
-
Referee: [Abstract / focused artifact study] Abstract and artifact study description: the central claim that SpatialTransactions can reliably resolve references, check admissibility, protect invariants, and project effects at scene scale rests on the feasibility of maintaining operability state over the listed elements (objects, assemblies, assertions, solver jobs, effect diffs, etc.). The causal-repair illustration traces a symptom to an upstream placement but supplies no specification of state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic, leaving the load-bearing systems assumption unverified.
Authors: We agree that the focused artifact study is illustrative and does not supply specifications for state population, reference resolution under geometric uncertainty, invariant definitions, or effect-projection logic. As explicitly framed in the manuscript, this is a systems/position preprint whose contribution is the architectural outline of operability contracts and SpatialTransactions rather than a verified implementation. The causal-repair example demonstrates dependency tracing and upstream repair at the conceptual level only; full verification of the load-bearing mechanisms would require a separate prototype paper. revision: no
-
Referee: [Abstract] The manuscript frames itself as a systems/position preprint rather than a broad benchmark, yet the broader claim that generated 3D can become reliable substrate for CAD/robotics/etc. requires at least a minimal concrete mechanism or pseudocode for the transaction boundary to be defensible; the current description remains at the level of named entities without reduction to implementable rules.
Authors: The manuscript intentionally adopts the position-preprint framing with one illustrative study. Reducing the transaction boundary to pseudocode or implementable rules would convert the work into a systems-implementation contribution, which lies outside the stated scope. The named entities and enumerated transaction outcomes (commit/review/rollback/deferral/capability-gap) are presented at the architectural level to define the necessary contract boundaries; we do not claim they constitute a complete mechanism. revision: no
Circularity Check
No circularity: architectural proposal with no equations or self-referential reductions
full rationale
The paper presents Hylos as a systems/position preprint introducing an architectural framework for operability contracts. It describes scene-scale state maintenance and SpatialTransactions at a conceptual level without any equations, fitted parameters, quantitative predictions, or load-bearing self-citations. The causal repair study is an illustrative example rather than a derivation that reduces to its own inputs. No steps match the enumerated circularity patterns; the central claims rest on the proposed design itself rather than reducing by construction to prior fitted quantities or author citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Foundation models can describe, reconstruct, and generate 3D objects and environments but require an additional operability layer to become useful to agents.
invented entities (2)
-
SpatialTransaction
no independent evidence
-
operability state
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Hylos: Operability Contracts for Model-Native Spatial Intelligence
Introduction Spatial AI is often framed as a generation problem: can a model produce a convincing object, product assembly, mesh, room, video-consistent world, or neural field? That framing is necessary but incomplete. Agents do not merely look at space. They must inspect objects, reason over parts and assemblies, route through environments, simulate cons...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
The receiving assembly looks laterally wrong relative to the body. Fix the physical placement
The Operability Gap In Generated 3D Generated 3D systems increasingly produce visually rich assets: meshes, neural radiance fields, Gaussian splats, embodied environments, and video-consistent worlds. These outputs expand the perceptual and creative surface of spatial AI. However, a spatial artifact can be visually impressive while remaining operationally...
-
[3]
Its claim is not that any one area is missing entirely
Related Work And Positioning This work sits at the intersection of semantic scene representation, embodied-agent environments, tool- using language models, programmatic geometry, and generated 3D objects, assemblies, and worlds. Its claim is not that any one area is missing entirely. The claim is that agents need an operability layer between semantic perc...
-
[4]
The graph records what the system believes about the scene
Hylos Runtime Contract Hylos treats spatial state as an operable graph plus a transaction runtime. The graph records what the system believes about the scene. The transaction runtime governs how that belief may change. The core invariant is: No spatial output becomes scene truth until Hylos can type it, reference it, validate it, diff it, and commit it. T...
-
[5]
Evidence-Grounded Interaction Today The current prototype implements a scene-scale operability substrate, not only a local interaction representation. The substrate includes scene assets, entity hypotheses, surface anchors, spatial assertions, action candidates, solver jobs, shared actuator invocations, spatial marks, work artifacts, capability gaps, and ...
-
[6]
A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome
Evaluation Method: Repair As Causal Stress Test The evaluation uses blind forward replay over a repair task. A canonical scene is perturbed, the system is asked to repair it, and the resulting scene is compared against the expected causal and geometric outcome. This tests whether the agent can reason from evidence and contracts rather than from hidden tas...
-
[7]
Can the agent identify that the visible symptom is not necessarily the correct edit target? Hylos: Operability Contracts for Model-Native Spatial Intelligence 11
-
[8]
Can it select an upstream causal interaction when the scene dependencies support that interpretation?
-
[9]
Does validation prevent unsupported geometry changes and force deferral when support is missing?
-
[10]
These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader benchmark over the existing substrate
Can a new generic spatial alternative resolve an ambiguity without becoming a product-specific rule? 6.2 Baselines And Conditions The public evaluation is organized around conceptual controls rather than a large benchmark suite. These controls isolate the design choices needed for the causal repair proof and define the comparison structure for a broader b...
-
[11]
It is that the model did not directly move the visible dependent component
Result: Causal Repair Through The Operability Contract The successful replay followed this causal chain: visual observation -> diagnostic evidence for lateral placement mismatch -> dependency structure identifies an upstream placement driver -> declared interaction space permits changing that driver -> supported geometric alternative is selected -> valida...
-
[12]
It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts
From Wrapped Neural Assets To Model-Native Spatial Artifacts The current transaction architecture is not the end state. It is a reliability scaffold for the transition from explicit graph operations to future model-native spatial artifacts. 8.1 Stage 1: Transaction-Safe Explicit Lowering In the current regime, the model selects or proposes bounded graph o...
-
[13]
Scientific Evaluation Program The causal repair study is a minimal public empirical anchor, not a complete validation program. The current prototype already exercises more than the repair family through internal fixtures for mutation, frame transforms, support-region changes, multi-region consequence reasoning, and variant generation. The evaluation progr...
-
[14]
center this particular component
Discussion 10.1 Architecture Is Not The Final Product The current Hylos runtime is already a working substrate for reliable spatial interaction. The larger thesis is broader: spatial intelligence should produce operable artifacts by default. The transaction layer is therefore not a retreat from model-native generation. It is the reliability boundary that ...
-
[15]
The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth
Current Boundaries And Evaluation Scope This work is an architecture and artifact-study contribution built on an implemented prototype substrate. The main boundary is not the absence of scene-scale operability machinery; it is the current public packaging and benchmark breadth. The paper reports a focused causal repair artifact because it makes the abstra...
-
[16]
The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage
Scaling And Public Evaluation Roadmap The next research step is to turn the existing Hylos substrate into a broad public evaluation program for operable physical 3D. The emphasis is scale, formalization, standardization, benchmark release, and cross-domain coverage. Hylos already exercises the core pattern: scene-scale operability state, action candidates...
-
[17]
Relation graph coverage at object and environment scale:expand reporting over existing relation and assertion structures to cover containment, adjacency, attachment, articulation, support, clearance, flow, actuation, part-level dependencies, assembly constraints, and environment-level causal links
-
[18]
Constructive authoring benchmarks:package existing authoring, placement, mutation, resizing, and variant-generation scenarios into public tests for intent-to-topology conversion across surfaces, openings, attachments, object features, constraints, and realization outcomes
-
[19]
Causal and goal graph evaluations:report how the assertion/action/solver substrate links observed issues and desired outcomes to plausible drivers, requirements, validators, and admissible interactions across repair, authoring, inspection, optimization, and routing tasks
-
[20]
T ransaction graph standardization:formalize the current action and transaction substrate into standardized preconditions, protected invariants, effect assertions, rollback semantics, audit records, and backend realization contracts across object-level and scene-level operations
-
[21]
Evidence acquisition benchmarks:measure when bounded visual or geometric evidence improves spatial reasoning under controlled ambiguity, and when the correct behavior is review, deferral, or additional acquisition
-
[22]
Uncertainty , review, and capability-gap reporting:standardize runtime uncertainty, review, deferral, unresolved assertions, solver status, and capability-gap outputs so they can be scored consistently across transaction families
-
[23]
Cross-representation adapter suites:extend existing realization and preview projection paths into a formal suite for display, CAD/export, simulation, robotics, manufacturing, inspection, sales Hylos: Operability Contracts for Model-Native Spatial Intelligence 24 visualization, training environments, and audit views
-
[24]
Structure recovery benchmarks:evaluate recovery over imported meshes, splats, scans, collider- backed substrates, generated assets, and neural representations, measuring whether recovered entities, frames, surfaces, relationships, uncertainty states, and action candidates support downstream operation
-
[25]
Model-native artifact contracts:define output formats, training objectives, and ingestion checks for artifacts that jointly expose geometry, topology, constraints, handles, provenance, uncertainty, and audit hooks
-
[26]
Human-in-the-loop operation studies:evaluate whether candidate interpretations, effect diffs, review states, and capability-gap explanations improve trust, correction speed, and repeated spatial- operation success
-
[27]
Visual 3D generation is not enough
Conclusion Hylos reframes spatial foundation-model interaction as an operability problem. Visual 3D generation is not enough. A spatial artifact - whether an object, assembly, route, scene, or environment - becomes useful to agents only when it can be inspected, modified, validated, projected, audited, and committed through a reliable runtime contract. Th...
-
[28]
3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera
I. Armeni, Z.-Y. He, J. Gwak, A. R. Zamir, M. Fischer, J. Malik, and S. Savarese. “3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera.”IEEE/CVF International Conference on Computer Vision (ICCV), 2019.https://arxiv.org/abs/1910.02527
-
[29]
Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs
A. Rosinol, A. Violette, M. Abate, N. Hughes, Y. Chang, J. Shi, A. Gupta, and L. Carlone. “Kimera: From SLAM to Spatial Perception with 3D Dynamic Scene Graphs.”International Journal of Robotics Research, 40(12-14):1510-1546, 2021.https://arxiv.org/abs/2101.06894
-
[30]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. “ReAct: Synergizing Reasoning and Acting in Language Models.”International Conference on Learning Representations (ICLR), 2023.https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. “Toolformer: Language Models Can Teach Themselves to Use Tools.”Advances in Neural Information Processing Systems (NeurIPS), 2023.https://arxiv.org/abs/2302.04761
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
M. Ahn et al. “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances.”Conference on Robot Learning (CoRL), 2022.https://arxiv.org/abs/2204.01691
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
Code as Policies: Language Model Programs for Embodied Control
J. Liang et al. “Code as Policies: Language Model Programs for Embodied Control.”IEEE Interna- tional Conference on Robotics and Automation (ICRA), 2023. https://arxiv.org/abs/2209.07753
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
R. K. Jones, T. Barton, X. Xu, K. Wang, E. Jiang, P. Guerrero, N. J. Mitra, and D. Ritchie. “Sha- Hylos: Operability Contracts for Model-Native Spatial Intelligence 25 peAssembly: Learning to Generate Programs for 3D Shape Structure Synthesis.”ACM Transactions on Graphics, 39(6), 2020.https://arxiv.org/abs/2009.08026
-
[35]
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation
M. Deitke et al. “ProcTHOR: Large-Scale Embodied AI Using Procedural Generation.”Advances in Neural Information Processing Systems (NeurIPS), 2022.https://arxiv.org/abs/2206.06994
-
[36]
Objaverse: A Universe of Annotated 3D Objects
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. “Objaverse: A Universe of Annotated 3D Objects.”IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. https://arxiv.org/abs/ 2212.08051
-
[37]
Holodeck: Language Guided Generation of 3D Embodied AI Environments
Y. Yang et al. “Holodeck: Language Guided Generation of 3D Embodied AI Environments.” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. https: //arxiv.org/abs/2312.09067
-
[38]
Marble: A Multimodal World Model
World Labs. “Marble: A Multimodal World Model.” Product and technical overview, 2025.https: //www.worldlabs.ai/blog/marble-world-model
2025
-
[39]
Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System
C. DaSilva. “Evidence-Grounded Spatial Reasoning with a Prototype Semantic-Spatial Research System.” Internal technical report, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.