pith. sign in

arxiv: 2604.22160 · v1 · submitted 2026-04-24 · 💻 cs.CV · cs.AI

GenMatter: Perceiving Physical Objects with Generative Matter Models

Pith reviewed 2026-05-08 12:48 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords generative modelmotion perceptionobject segmentationstructure from motionparticle clusteringhuman visionscene understandingGibbs sampling
0
0 comments X

The pith

A generative model groups motion cues and appearance features into particles and clusters to perceive physical objects across dots, camouflage, and video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a generative model that hierarchically groups low-level motion cues and high-level appearance features first into particles, defined as small Gaussians for local matter, and then into clusters that represent coherently and independently moveable physical entities. Inference relies on a hardware-accelerated parallelized block Gibbs sampling algorithm to recover stable particle motions and groupings. The model is shown to operate on random dot kinematograms, stylized textured scenes, and naturalistic RGB videos. Validation demonstrates it reproduces human object perception with graded uncertainty, recovers 3D structure from motion for segmentation in camouflaged cases, and tracks deforming objects by their underlying 3D matter.

Core claim

The central claim is that a unified generative model representing local matter as particles and whole entities as clusters, inferred via parallelized block Gibbs sampling, recovers human-like detection of independently moveable physical objects from motion. This holds when tested on 2D random dot kinematograms matching human uncertainty, on Gestalt-style camouflaged rotating objects yielding correct 3D structure and 2D segmentation, and on naturalistic RGB videos tracking the 3D matter of deforming objects.

What carries the argument

Hierarchical grouping of motion cues and appearance features into particles (small Gaussians representing local matter) and clusters (for coherently and independently moveable physical entities), inferred by parallelized block Gibbs sampling.

If this is right

  • The model captures graded uncertainty in human object perception on 2D random dot kinematograms under ambiguous conditions.
  • It recovers correct 3D structure from motion to produce accurate 2D object segmentation on camouflaged rotating objects.
  • It tracks the moving 3D matter that makes up deforming objects in naturalistic RGB videos for robust scene understanding.
  • The same framework applies uniformly to sparse dots, stylized textures, and real video inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The particle representation of local matter could support extensions that predict how clusters interact under basic physical rules.
  • Success on both sparse and rich inputs suggests the grouping approach might transfer to other sensory grouping tasks driven by motion coherence.
  • Explicit modeling of uncertainty through the generative process may improve handling of noisy or partial observations compared to deterministic alternatives.

Load-bearing premise

That the hierarchical grouping of motion cues and appearance features into particles and clusters will reliably identify coherently and independently moveable physical entities, and that the sampling will produce stable motion and grouping estimates.

What would settle it

A new set of random dot kinematogram stimuli with known human uncertainty levels where the model's inferred object boundaries or uncertainty estimates deviate substantially from psychophysical measurements.

Figures

Figures reproduced from arXiv: 2604.22160 by Arijit Dasgupta, Eric Li, Joshua B. Tenenbaum, Mathieu Huot, Thomas O'Connell, Vikash Mansinghka, William T. Freeman, Yoni Friedman.

Figure 1
Figure 1. Figure 1: GenMatter is a generative model of moving matter. Conditioned on motion and appearance features extracted from RGB video, inference inverts this hierarchical generative model to group observations into particles (small Gaussians representing local regions of matter), themselves grouped into clusters (coherently and independently moveable physical entities). A hardware-accelerated inference algorithm based … view at source ↗
Figure 2
Figure 2. Figure 2: GenMatter inference pipeline. RGB video is preprocessed (gray arrows) to extract dense depth and optical flow, lifting each pixel to a 3D point tagged with its velocity. The blue box depicts the GenMatter generative model (Sec. 3), which represents a scene as a hierarchy of clusters and particles that emit moving 3D points. Black arrows indicate the generative direction (clusters generate particles, which … view at source ↗
Figure 3
Figure 3. Figure 3: ARAP vs. GenMatter on a two-object scene. ARAP view at source ↗
Figure 4
Figure 4. Figure 4: GenMatter closely tracks human perceptual judgments on random dot kinematograms. (a) GenMatter accuracy (%) vs. participant accuracy (%) across 27 stimuli (r 2 = 0.86). Each blue circle represents a same-object stimulus and each orange triangle a different-object stimulus. (b) An internal view of the inferred posterior: red points belong to the moving object, blue points to the background, and black points… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on camouflaged stimuli. Probe point segmentation on scene 16, texture 01. The depth estimate is uninformative, and the flow estimate shows that on-axis rotation causes opposing motion at top vs. bottom (blue vs. red). GenMatter correctly segments the moving object, while FlowSAM segments the initial frame correctly but degrades over time. SegAnyMo fails to detect any object in the sc… view at source ↗
Figure 6
Figure 6. Figure 6: Per-point particle assignment visualization. Each data point is colored by its assigned particle’s RGB color (left), motion direction (middle), and appearance features (right). Gaussian parti￾cles are shown in the second row. Distinct patterns across motion and appearance demonstrate that GenMatter integrates complemen￾tary information sources for faithful matter representation view at source ↗
Figure 7
Figure 7. Figure 7: Instructions shown to all participants. This task was allowed to be conducted on either a desktop or a tablet. The 11 videos view at source ↗
Figure 8
Figure 8. Figure 8: Visual descriptions of the familiarization trials, shown to all 150 participants view at source ↗
Figure 9
Figure 9. Figure 9: Example Gestalt Stimuli with Different Textures. First frame of scene 00000 rendered with seven different texture patterns. The Gestalt experiment uses 20 scenes (00000–00019), each rendered with these seven textures (00, 07, 13, 16, 21, 22, 25), yielding 140 total stimuli to evaluate structure-from-motion segmentation. In the paper, these textures are referenced sequentially as (00, 01, 02, 03, 04, 05, 06… view at source ↗
Figure 10
Figure 10. Figure 10: First frame visualizations of RGB 3D inference (Part 1). Each image shows the initial particle distribution and segmentation for view at source ↗
Figure 11
Figure 11. Figure 11: First frame visualizations of RGB 3D inference (Part 2). Each image shows the initial particle distribution and segmentation for view at source ↗
Figure 12
Figure 12. Figure 12: First frame visualizations of RGB 3D inference (Part 3). Each image shows the initial particle distribution and segmentation for view at source ↗
read the original abstract

Human visual perception offers valuable insights for understanding computational principles of motion-based scene interpretation. Humans robustly detect and segment moving entities that constitute independently moveable chunks of matter, whether observing sparse moving dots, textured surfaces, or naturalistic scenes. In contrast, existing computer vision systems lack a unified approach that works across these diverse settings. Inspired by principles of human perception, we propose a generative model that hierarchically groups low-level motion cues and high-level appearance features into particles (small Gaussians representing local matter), and groups particles into clusters capturing coherently and independently moveable physical entities. We develop a hardware-accelerated inference algorithm based on parallelized block Gibbs sampling to recover stable particle motion and groupings. Our model operates on different kinds of inputs (random dots, stylized textures, or naturalistic RGB video), enabling it to work across settings where biological vision succeeds but existing computer vision approaches do not. We validate this unified framework across three domains: on 2D random dot kinematograms, our approach captures human object perception including graded uncertainty across ambiguous conditions; on a Gestalt-inspired dataset of camouflaged rotating objects, our approach recovers correct 3D structure from motion and thereby accurate 2D object segmentation; and on naturalistic RGB videos, our model tracks the moving 3D matter that makes up deforming objects, enabling robust object-level scene understanding. This work thus establishes a general framework for motion-based perception grounded in principles of human vision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes GenMatter, a generative hierarchical model for motion-based object perception inspired by human vision. Low-level motion cues and high-level appearance features are grouped into particles (small Gaussians representing local matter) which are then clustered into entities that move coherently and independently. Inference uses a hardware-accelerated parallelized block Gibbs sampler. The model is claimed to handle inputs ranging from random dots to RGB video and is validated on three domains: capturing graded human-like uncertainty on 2D random dot kinematograms, recovering 3D structure from motion for segmentation on camouflaged rotating objects, and tracking 3D matter in deforming objects from naturalistic videos.

Significance. If the quantitative validations and sampler reliability hold, the work could provide a unified generative framework bridging perceptual psychology and computer vision for handling ambiguity, camouflage, and deformation across input densities. The particle-cluster representation and parallel sampling approach are conceptually promising strengths.

major comments (2)
  1. [Abstract] Abstract: the central validation claims (human-like graded uncertainty on dot kinematograms, correct 3D structure recovery on camouflaged objects, and 3D matter tracking on RGB video) are asserted without any quantitative metrics, error bars, baselines, ablation results, or dataset details. These claims are load-bearing for the paper's contribution yet cannot be assessed from the provided text.
  2. [Inference Algorithm] Inference section: the parallelized block Gibbs sampling is presented as recovering stable particle motion and groupings, but no convergence diagnostics, multiple-chain variance, initialization sensitivity tests, or hyperparameter ablations are reported. All three domain validations depend on the sampler outputs, so this is a load-bearing concern for result reliability.
minor comments (1)
  1. [Model] Model description: the exact generative process, including how particles are defined as Gaussians, the hierarchical clustering objective, and any free parameters, would benefit from explicit equations or pseudocode for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications from the full paper and indicating revisions where the concerns identify opportunities to strengthen the presentation of results and methods.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central validation claims (human-like graded uncertainty on dot kinematograms, correct 3D structure recovery on camouflaged objects, and 3D matter tracking on RGB video) are asserted without any quantitative metrics, error bars, baselines, ablation results, or dataset details. These claims are load-bearing for the paper's contribution yet cannot be assessed from the provided text.

    Authors: We agree that the abstract, as written, presents the validation domains at a high level without numerical results. The full manuscript contains the requested details in the Experiments section and supplementary material: for dot kinematograms we report Pearson correlations with human uncertainty ratings across ambiguity levels; for camouflaged objects we provide 3D structure recovery errors and 2D segmentation IoU scores against ground truth and baselines; for RGB videos we include 3D tracking errors and object segmentation metrics on deforming objects. Dataset sizes, splits, and preprocessing are specified in each subsection. To make these claims more immediately assessable, we have revised the abstract to incorporate concise quantitative highlights (e.g., correlation values and IoU ranges) while preserving brevity. revision: yes

  2. Referee: [Inference Algorithm] Inference section: the parallelized block Gibbs sampling is presented as recovering stable particle motion and groupings, but no convergence diagnostics, multiple-chain variance, initialization sensitivity tests, or hyperparameter ablations are reported. All three domain validations depend on the sampler outputs, so this is a load-bearing concern for result reliability.

    Authors: The original manuscript described the parallelized block Gibbs sampler and its hardware implementation but did not report explicit convergence diagnostics. We acknowledge this as a valid concern given the dependence of all results on the inferred particles and clusters. In the revised manuscript we have added a dedicated subsection on inference reliability that includes: (i) Gelman-Rubin potential scale reduction factors computed across four independent chains for representative runs in each domain, (ii) trace plots and autocorrelation diagnostics for particle positions and cluster assignments, (iii) sensitivity results to different random initializations, and (iv) hyperparameter ablations on temperature, block size, and prior strengths. These additions confirm stable recovery of motion and groupings and are now referenced from the main validation sections. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation and validation are independent of inputs by construction

full rationale

The paper introduces a generative hierarchical model that groups motion cues and appearance features into particles (Gaussians) and clusters, using parallelized block Gibbs sampling for inference, then validates it empirically on three distinct domains (random dot kinematograms, camouflaged objects, RGB video) against human perception benchmarks and tracking tasks. No equations, fitted parameters, or self-citations are presented in the provided text as load-bearing steps that reduce predictions or outputs to the model's own definitions or inputs. The framework is described as a new unified approach inspired by human vision principles, with the inference algorithm and validations standing as separate contributions rather than tautological renamings or self-referential fits. This is the expected non-finding for a proposal paper whose central claims rest on external empirical validation rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities with independent evidence are detailed. Particles and clusters are model constructs defined to represent matter but lack external falsifiable handles in the text.

pith-pipeline@v0.9.0 · 5584 in / 1325 out tokens · 91087 ms · 2026-05-08T12:48:22.259367+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

72 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    On seeing stuff: the perception of materi- als by humans and machines

    Edward H Adelson. On seeing stuff: the perception of materi- als by humans and machines. InHuman vision and electronic imaging VI, pages 1–12. spie, 2001. 3

  2. [2]

    Particle markov chain monte carlo methods.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010

    Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods.Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. 5

  3. [3]

    k-means++: The ad- vantages of careful seeding

    David Arthur and Sergei Vassilvitskii. k-means++: The ad- vantages of careful seeding. Technical report, Stanford, 2006. 5

  4. [4]

    Dreamweaver: Learning compositional world models from pixels

    Junyeob Baek, Yi-Fu Wu, Gautam Singh, and Sungjin Ahn. Dreamweaver: Learning compositional world models from pixels. InThe Thirteenth International Conference on Learn- ing Representations. 3

  5. [5]

    Object discovery from motion- guided tokens

    Zhipeng Bao, Pavel Tokmakov, Yu-Xiong Wang, Adrien Gaidon, and Martial Hebert. Object discovery from motion- guided tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22972– 22981, 2023. 3

  6. [6]

    Simulation as an engine of physical scene understand- ing.Proceedings of the National Academy of Sciences, 110 (45):18327–18332, 2013

    Peter W Battaglia, Jessica B Hamrick, and Joshua B Tenen- baum. Simulation as an engine of physical scene understand- ing.Proceedings of the National Academy of Sciences, 110 (45):18327–18332, 2013. 2

  7. [7]

    GenJAX: Probabilistic Programming with Gen, built on top of JAX

    McCoy Becker, Mathieu Huot, Sam Ritchie, and Colin Smith. GenJAX: Probabilistic Programming with Gen, built on top of JAX. 4

  8. [8]

    Probabilistic programming with programmable variational inference.Proceedings of the ACM on Program- ming Languages, 8(PLDI):2123–2147, 2024

    McCoy R Becker, Alexander K Lew, Xiaoyan Wang, Matin Ghavami, Mathieu Huot, Martin C Rinard, and Vikash K Mansinghka. Probabilistic programming with programmable variational inference.Proceedings of the ACM on Program- ming Languages, 8(PLDI):2123–2147, 2024. 2

  9. [9]

    Probabilistic programming with programmable variational inference.Proceedings of the ACM on Program- ming Languages, 8(PLDI):2123–2147, 2024

    McCoy R Becker, Alexander K Lew, Xiaoyan Wang, Matin Ghavami, Mathieu Huot, Martin C Rinard, and Vikash K Mansinghka. Probabilistic programming with programmable variational inference.Proceedings of the ACM on Program- ming Languages, 8(PLDI):2123–2147, 2024. 4

  10. [10]

    Hierarchical structure is employed by humans during visual motion perception.Proceedings of the National Academy of Sciences, 117(39):24581–24589, 2020

    Johannes Bill, Hrag Pailian, Samuel J Gershman, and Jan Drugowitsch. Hierarchical structure is employed by humans during visual motion perception.Proceedings of the National Academy of Sciences, 117(39):24581–24589, 2020. 2

  11. [11]

    Pymunk, 2024

    Victor Blomqvist. Pymunk, 2024. An easy-to-use pythonic rigid body 2D physics library. 5

  12. [12]

    JAX: composable transformations of Python+NumPy programs, 2018

    James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs, 2018. 4

  13. [13]

    Mad- bayes: Map-based asymptotic derivations from bayes

    Tamara Broderick, Brian Kulis, and Michael Jordan. Mad- bayes: Map-based asymptotic derivations from bayes. In International Conference on Machine Learning, pages 226–

  14. [14]

    Video depth anything: Consistent depth estimation for super-long videos,

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos.arXiv preprint arXiv:2501.12375, 2025. 6, 7

  15. [15]

    Seurat: From moving points to depth.arXiv preprint arXiv:2504.14687, 2025

    Seokju Cho, Jiahui Huang, Seungryong Kim, and Joon-Young Lee. Seurat: From moving points to depth.arXiv preprint arXiv:2504.14687, 2025. 3

  16. [16]

    To- wards segmenting anything that moves

    Achal Dave, Pavel Tokmakov, and Deva Ramanan. To- wards segmenting anything that moves. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019. 3

  17. [17]

    Tapir: Tracking any point with per-frame initialization and temporal refinement

    Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, and Andrew Zisserman. Tapir: Tracking any point with per-frame initialization and temporal refinement. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 10061– 10072, 2023. 2, 7

  18. [18]

    On sequential monte carlo sampling methods for bayesian filter- ing.Statistics and computing, 10:197–208, 2000

    Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential monte carlo sampling methods for bayesian filter- ing.Statistics and computing, 10:197–208, 2000. 5

  19. [19]

    Visual shape perception as bayesian inference of 3d object-centered shape representa- tions.Psychological review, 124(6):740, 2017

    Goker Erdogan and Robert A Jacobs. Visual shape perception as bayesian inference of 3d object-centered shape representa- tions.Psychological review, 124(6):740, 2017. 2

  20. [20]

    Attend, infer, repeat: Fast scene understanding with generative models

    SM Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al. Attend, infer, repeat: Fast scene understanding with generative models. Advances in neural information processing systems, 29, 2016. 2

  21. [21]

    Benchmarking human mid-level scene understanding.Journal of Vision, 23 (9):5798–5798, 2023

    Yoni Friedman, Thomas O’Connell, Daniel Bear, Jiajun Wu, Judy Fan, Josh Tenenbaum, and Dan Yamins. Benchmarking human mid-level scene understanding.Journal of Vision, 23 (9):5798–5798, 2023. 2, 6

  22. [22]

    Chapman and Hall/CRC,

    Andrew Gelman, John B Carlin, Hal S Stern, and Donald B Rubin.Bayesian data analysis. Chapman and Hall/CRC,

  23. [23]

    CRC Press, 2013

    Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin.Bayesian Data Analysis. CRC Press, 2013. 4

  24. [24]

    3dp3: 3d scene perception via probabilistic programming.Advances in Neural Information Processing Systems, 34:9600–9612,

    Nishad Gothoskar, Marco Cusumano-Towner, Ben Zinberg, Matin Ghavamizadeh, Falk Pollok, Austin Garrett, Josh Tenenbaum, Dan Gutfreund, and Vikash Mansinghka. 3dp3: 3d scene perception via probabilistic programming.Advances in Neural Information Processing Systems, 34:9600–9612,

  25. [25]

    Bayes3d: fast learning and inference in structured generative models of 3d objects and scenes

    Nishad Gothoskar, Matin Ghavami, Eric Li, Aidan Curtis, Michael Noseworthy, Karen Chung, Brian Patton, William T Freeman, Joshua B Tenenbaum, Mirko Klukas, and Vikash K Mansinghka. Bayes3d: fast learning and inference in struc- 9 tured generative models of 3d objects and scenes.arXiv preprint arXiv:2312.08715, 2023. 2

  26. [26]

    Seeing faces in things: A model and dataset for pareidolia

    Mark Hamilton, Simon Stent, Vasha DuTell, Anne Harrington, Jennifer Corbett, Ruth Rosenholtz, and William T Freeman. Seeing faces in things: A model and dataset for pareidolia. In European Conference on Computer Vision, pages 377–395. Springer, 2024. 2

  27. [27]

    Particle video revisited: Tracking through occlusions using point trajectories

    Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vision, pages 59–75. Springer, 2022. 2

  28. [28]

    Segment any motion in videos

    Nan Huang, Wenzhao Zheng, Chenfeng Xu, Kurt Keutzer, Shanghang Zhang, Angjoo Kanazawa, and Qianqian Wang. Segment any motion in videos. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 3406– 3416, 2025. 6

  29. [29]

    A split-merge markov chain monte carlo procedure for the dirichlet process mixture model

    Sonia Jain and Radford M Neal. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of computational and Graphical Statistics, 13(1):158– 182, 2004. 4

  30. [30]

    A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5): 922–923, 1976

    Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors.Foundations of Crystallography, 32(5): 922–923, 1976. 5

  31. [31]

    Cotracker3: Simpler and better point tracking by pseudo-labelling real videos

    Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo- labelling real videos.arXiv preprint arXiv:2410.11831, 2024. 2, 7

  32. [32]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean Conference on Computer Vision, pages 18–35. Springer, 2024. 2

  33. [33]

    Tapvid-3d: A benchmark for tracking any point in 3d

    Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Jo˜ao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Do- ersch. Tapvid-3d: A benchmark for tracking any point in 3d. arXiv preprint arXiv:2407.05921, 2024. 2

  34. [34]

    Brian Kulis and Michael I. Jordan. Revisiting k-means: new algorithms via bayesian nonparametrics. InProceedings of the 29th International Coference on International Conference on Machine Learning, page 1131–1138, Madison, WI, USA,

  35. [35]

    Picture: A probabilistic programming language for scene perception

    Tejas D Kulkarni, Pushmeet Kohli, Joshua B Tenenbaum, and Vikash Mansinghka. Picture: A probabilistic programming language for scene perception. InProceedings of the ieee conference on computer vision and pattern recognition, pages 4390–4399, 2015. 2

  36. [36]

    Smcp3: Sequential monte carlo with probabilistic program proposals

    Alexander K Lew, George Matheos, Tan Zhi-Xuan, Matin Ghavamizadeh, Nishad Gothoskar, Stuart Russell, and Vikash K Mansinghka. Smcp3: Sequential monte carlo with probabilistic program proposals. InInternational confer- ence on artificial intelligence and statistics, pages 7061–7088. PMLR, 2023. 2

  37. [37]

    Dessie: Disentanglement for artic- ulated 3d horse shape and pose estimation from images

    Ci Li, Yi Yang, Zehang Weng, Elin Hernlund, Silvia Zuffi, and Hedvig Kjellstr ¨om. Dessie: Disentanglement for artic- ulated 3d horse shape and pose estimation from images. In Proceedings of the Asian Conference on Computer Vision, pages 764–783, 2024. 3

  38. [38]

    Learning the 3d fauna of the web

    Zizhang Li, Dor Litvak, Ruining Li, Yunzhi Zhang, Tomas Jakab, Christian Rupprecht, Shangzhe Wu, Andrea Vedaldi, and Jiajun Wu. Learning the 3d fauna of the web. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9752–9762, 2024. 3

  39. [39]

    Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. In2024 International Conference on 3D Vision (3DV), pages 800–809. IEEE, 2024. 3

  40. [40]

    Approximate bayesian image interpretation using generative probabilistic graphics programs.Advances in neural information processing systems, 26, 2013

    Vikash K Mansinghka, Tejas D Kulkarni, Yura N Perov, and Josh Tenenbaum. Approximate bayesian image interpretation using generative probabilistic graphics programs.Advances in neural information processing systems, 26, 2013. 2

  41. [41]

    Conjugate bayesian analysis of the gaussian distribution.def, 1(2σ2):16, 2007

    Kevin P Murphy. Conjugate bayesian analysis of the gaussian distribution.def, 1(2σ2):16, 2007. 4

  42. [42]

    Motion perception: From detection to interpretation.Annual review of vision science, 4(1):501–523,

    Shin’ya Nishida, Takahiro Kawabe, Masataka Sawayama, and Taiki Fukiage. Motion perception: From detection to interpretation.Annual review of vision science, 4(1):501–523,

  43. [43]

    Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects

    Atsuhiro Noguchi, Umar Iqbal, Jonathan Tremblay, Tatsuya Harada, and Orazio Gallo. Watch it move: Unsupervised discovery of 3d joints for re-posing of articulated objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3677–3687, 2022. 3

  44. [44]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 7

  45. [45]

    Prolific participant recruitment platform

    Prolific. Prolific participant recruitment platform. https: //www.prolific.com, 2024. Version used: May 2024. London, UK. 5

  46. [46]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment any- thing in images and videos.arXiv preprint arXiv:2408.00714,

  47. [47]

    Humor: 3d human motion model for robust pose estimation

    Davis Rempe, Tolga Birdal, Aaron Hertzmann, Jimei Yang, Srinath Sridhar, and Leonidas J Guibas. Humor: 3d human motion model for robust pose estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 11488–11499, 2021. 2

  48. [48]

    Disentangling object category representations driven by dynamic and static visual input.Journal of Neuro- science, 43(4):621–634, 2023

    Sophia Robert, Leslie G Ungerleider, and Maryam Vaziri- Pashkam. Disentangling object category representations driven by dynamic and static visual input.Journal of Neuro- science, 43(4):621–634, 2023. 2, 5

  49. [49]

    Particle video: Long-range motion estimation using point trajectories.International journal of computer vision, 80:72–91, 2008

    Peter Sand and Seth Teller. Particle video: Long-range motion estimation using point trajectories.International journal of computer vision, 80:72–91, 2008. 2

  50. [50]

    Modeling ex- pectation violation in intuitive physics with coarse probabilis- tic object representations.Advances in neural information processing systems, 32, 2019

    Kevin Smith, Lingjie Mei, Shunyu Yao, Jiajun Wu, Elizabeth Spelke, Josh Tenenbaum, and Tomer Ullman. Modeling ex- pectation violation in intuitive physics with coarse probabilis- tic object representations.Advances in neural information processing systems, 32, 2019. 2

  51. [51]

    Principles of object perception.Cognitive science, 14(1):29–56, 1990

    Elizabeth S Spelke. Principles of object perception.Cognitive science, 14(1):29–56, 1990. 2

  52. [52]

    Machine learning modelling for multi-order 10 human visual motion processing.Nature Machine Intelli- gence, 7(7):1037–1052, 2025

    Zitang Sun, Yen-Ju Chen, Yung-Hao Yang, Yuan Li, and Shin’ya Nishida. Machine learning modelling for multi-order 10 human visual motion processing.Nature Machine Intelli- gence, 7(7):1037–1052, 2025. 2

  53. [53]

    Matthias Tangemann, Matthias K ¨ummerer, and Matthias Bethge. Object segmentation from common fate: Motion energy processing enables human-like zero-shot generaliza- tion to random dot stimuli.Advances in Neural Information Processing Systems, 37:137135–137160, 2024. 2

  54. [54]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InComputer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020. 6, 7

  55. [55]

    Diffusion with forward models: Solv- ing stochastic inverse problems without direct supervision

    Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Josh Tenenbaum, Fr´edo Durand, Bill Freeman, and Vincent Sitzmann. Diffusion with forward models: Solv- ing stochastic inverse problems without direct supervision. Advances in Neural Information Processing Systems, 36: 12349–12362, 2023. 2

  56. [56]

    Mind games: Game engines as an architecture for intuitive physics.Trends in cognitive sci- ences, 21(9):649–665, 2017

    Tomer D Ullman, Elizabeth Spelke, Peter Battaglia, and Joshua B Tenenbaum. Mind games: Game engines as an architecture for intuitive physics.Trends in cognitive sci- ences, 21(9):649–665, 2017. 2

  57. [57]

    Discovering and using spelke segments.arXiv preprint arXiv:2507.16038,

    Rahul Venkatesh, Klemen Kotar, Lilian Naing Chen, Seung- woo Kim, Luca Thomas Wheeler, Jared Watrous, Ashley Xu, Gia Ancone, Wanhee Lee, Honglin Chen, et al. Discovering and using spelke segments.arXiv preprint arXiv:2507.16038,

  58. [58]

    Template-free articulated gaussian splatting for real- time reposable dynamic view synthesis.arXiv preprint arXiv:2412.05570, 2024

    Diwen Wan, Yuxiang Wang, Ruijie Lu, and Gang Zeng. Template-free articulated gaussian splatting for real- time reposable dynamic view synthesis.arXiv preprint arXiv:2412.05570, 2024. 3

  59. [59]

    Probabilistic simulation supports gen- eralizable intuitive physics

    Haoliang Wang, Khaled Jedoui, Rahul Venkatesh, Felix Je- didja Binder, Josh Tenenbaum, Judith E Fan, Daniel Yamins, and Kevin A Smith. Probabilistic simulation supports gen- eralizable intuitive physics. InProceedings of the Annual Meeting of the Cognitive Science Society, 2024. 2

  60. [60]

    Tracking everything everywhere all at once

    Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking everything everywhere all at once. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19795–19806, 2023. 2

  61. [61]

    Galileo: Perceiving physical object prop- erties by integrating a physics engine with deep learning

    Jiajun Wu, Ilker Yildirim, Joseph J Lim, Bill Freeman, and Josh Tenenbaum. Galileo: Perceiving physical object prop- erties by integrating a physics engine with deep learning. Advances in neural information processing systems, 28, 2015. 2

  62. [62]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20406–20417, 2024. 3

  63. [63]

    Moving object segmentation: All you need is sam (and flow)

    Junyu Xie, Charig Yang, Weidi Xie, and Andrew Zisserman. Moving object segmentation: All you need is sam (and flow). InProceedings of the Asian conference on computer vision, pages 162–178, 2024. 3, 5, 6

  64. [64]

    Efficient inverse graphics in biological face processing.Science advances, 6(10):eaax5979, 2020

    Ilker Yildirim, Mario Belledonne, Winrich Freiwald, and Josh Tenenbaum. Efficient inverse graphics in biological face processing.Science advances, 6(10):eaax5979, 2020. 2

  65. [65]

    Perception of 3d shape integrates intuitive physics and analysis-by-synthesis

    Ilker Yildirim, Max H Siegel, Amir A Soltani, Shraman Ray Chaudhuri, and Joshua B Tenenbaum. Perception of 3d shape integrates intuitive physics and analysis-by-synthesis. Nature Human Behaviour, 8(2):320–335, 2024. 2

  66. [66]

    3d neural embedding likelihood: Probabilistic inverse graphics for robust 6d pose estimation

    Guangyao Zhou, Nishad Gothoskar, Lirui Wang, Joshua B Tenenbaum, Dan Gutfreund, Miguel L´azaro-Gredilla, Dileep George, and Vikash K Mansinghka. 3d neural embedding likelihood: Probabilistic inverse graphics for robust 6d pose estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21625–21636, 2023. 2 11 A. Overview o...

  67. [67]

    Cluster-level variables: {µH k ,Σ H k ,R k,t k, π H k }K k=1

  68. [68]

    , L do 7:Compute particle mean: µB ℓ ← 1 |Bℓ| P n:zBn =ℓ xn 8:end for 9:foreach cluster k= 1,

    Particle-level variables: {µB ℓ ,Σ B ℓ ,v ℓ,Σ V ℓ , z H ℓ , π B ℓ }L ℓ=1 12 Algorithm 2Clustering Algorithm for GenMatter via Small-Variance Asymptotics 1:Input: 2:Number of clusters and particlesK, L 3:Data point positions {xn,v n}N n=1, 4:Initialize:Assign data points to particles zB n , particles to clusters zH ℓ 5:repeat 6:foreach particle ℓ= 1, . . ....

  69. [69]

    Data point-level variables: {zB n }N n=1 For each of these variables, we independently describe each of the Gibbs updates. C.1.1. Data point-to-Particle Assignments (zB 1:N ) We update each data point’s particle assignmentzB n for n= 1, . . . , N, using the conditional: p(zB n =ℓ|x n,v n,rest)∝π B(ℓ)· N(x n |µ B ℓ ,Σ B ℓ ) · N(v n |v ℓ,Σ V ℓ ) The prior i...

  70. [70]

    Fluent in English as the study is conducted in English

  71. [71]

    Explicitly declared to not have color-blindness, as this study requires each participant to distinguish the red and green probes clearly from the rest of the points in the stimuli

  72. [72]

    The instructions as viewed on Prolific for this study can be seen in Figure 7

    Has normal to corrected vision, as this study requires clear vision of the stimuli. The instructions as viewed on Prolific for this study can be seen in Figure 7. We used Google Forms to conduct the data collection. The instructions were repeated in the Google Form and each participant saw two familiarization trials with feedback on the correct answer. Fi...