Understanding Human Actions through the Lens of Executable Models

Manisha Dubey; N. Siddharth; Rimvydas Rubavicius; Subramanian Ramamoorthy

arxiv: 2604.18064 · v1 · submitted 2026-04-20 · 💻 cs.AI

Understanding Human Actions through the Lens of Executable Models

Rimvydas Rubavicius , Manisha Dubey , N. Siddharth , Subramanian Ramamoorthy This is my paper

Pith reviewed 2026-05-10 04:51 UTC · model grok-4.3

classification 💻 cs.AI

keywords human action understandingexecutable modelsdomain-specific languageneuro-symbolic modelingaction segmentationanomaly detectionmotion capturezero-shot inference

0 comments

The pith

Representing human actions as executable motion programs improves data efficiency and reveals intuitive relationships in action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that human actions can be understood more effectively by expressing them as structured, underspecified programs rather than as flat categories or end-to-end learned patterns. This matters for human-centred systems because recognising what someone is doing also requires judging how they are doing it and how one action relates to others. The proposed approach turns these programs into reward signals that support zero-shot policy inference, then composes the resulting policies into a single neuro-symbolic model that respects the original program structure. Experiments on motion-capture recordings demonstrate that the executable models need less data to segment action sequences and to flag anomalies than conventional task-specific networks. A reader who accepts the premise would therefore expect future action-understanding systems to generalise more readily across people, environments, and tasks.

Core claim

The authors introduce EXACT, a domain-specific language that encodes human motions as underspecified motion programs. These programs are interpreted as reward-generating functions that enable zero-shot policy inference via forward-backwards representations. Because the programs are compositional, individual policies can be combined into an executable neuro-symbolic model whose structure mirrors the program. When this pipeline is applied to motion-capture data, it performs human action segmentation and anomaly detection with greater data efficiency and produces more intuitive relationships among actions than monolithic, task-specific baselines.

What carries the argument

EXACT, a domain-specific language whose underspecified motion programs act as reward-generating functions for zero-shot policy inference and whose compositional structure directly supports neuro-symbolic model construction.

If this is right

Action segmentation can exploit program structure to identify boundaries and sub-parts without additional supervision.
Anomaly detection becomes possible by comparing observed motion against the reward landscape defined by the program.
New actions can be handled by composing existing program elements rather than retraining an entire model.
Overall learning requires fewer examples because the model reuses modular policies across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Robots could acquire new skills by inferring the underlying motion program from a single human demonstration instead of collecting large trajectory datasets.
The same program representation might extend to language-conditioned action planning, where instructions are parsed into executable reward functions.
Real-time monitoring systems could flag unsafe or inefficient executions by running the inferred program forward and measuring reward deviation.

Load-bearing premise

Human motions can be represented as underspecified motion programs that function as reward-generating functions for zero-shot policy inference and whose compositional structure supports effective neuro-symbolic modeling.

What would settle it

If, on a held-out collection of motion-capture sequences, the EXACT-based executable models show no gain in data efficiency for segmentation or anomaly detection relative to standard neural-network baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.18064 by Manisha Dubey, N. Siddharth, Rimvydas Rubavicius, Subramanian Ramamoorthy.

**Figure 2.** Figure 2: Architecture of parser. We follow an encoder-decoder setup where we [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: AUROC matrices for the 5 actions from each dataset with target action [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action's execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EXACT gives a workable DSL for turning motions into executable reward programs, with solid data-efficiency gains on segmentation and anomaly detection versus monolithic baselines.

read the letter

The main thing to know is that this paper introduces EXACT, a domain-specific language that represents human motions as underspecified programs. Those programs are interpreted as reward functions for zero-shot policy inference via forward-backwards representations, then composed into neuro-symbolic models that exploit the program structure. On motion-capture data the approach improves data efficiency and surfaces intuitive action relationships compared with task-specific monolithic models for segmentation and anomaly detection. The pipeline is presented with internal consistency and the efficiency claims rest on explicit baseline comparisons rather than vague assertions. What stands out is the executable framing: actions are not just classified but can be run as policies, and the compositional layer lets smaller programs build larger ones without full retraining. The evaluation stays grounded in concrete tasks and data, which makes the contribution easier to assess. The soft spots are limited. The core assumption that motions can be usefully captured by underspecified programs holds up in the reported experiments, but real-world sensor noise or greater variability in execution could expose gaps in the reward generation step. The work is also tied to motion-capture data, so broader generalization to other modalities is not yet shown. No load-bearing circularity or internal contradiction appears in the central argument. This is for people working on neuro-symbolic methods or robotics who need interpretable action models that support composition. A reader already interested in executable representations or data-efficient policy learning will find usable ideas and results here. The thinking is clear and the evidence is proportionate to the claims, so the paper deserves a serious referee. I would send it out for peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces the EXACT domain-specific language for representing human motions as underspecified motion programs, which are interpreted as reward-generating functions to enable zero-shot policy inference via forward-backwards representations. It leverages the compositional structure of these programs to construct executable neuro-symbolic models and evaluates the approach on motion-capture data for human action segmentation and anomaly detection, claiming improved data efficiency and more intuitive capture of action relationships relative to monolithic, task-specific baselines.

Significance. If the empirical results hold, the work provides a structured, interpretable alternative to black-box action recognition by combining symbolic program structure with neural policy inference. This could improve data efficiency in human-centered AI systems and support applications requiring assessment of action quality or compositional generalization, such as robotics and rehabilitation monitoring. The explicit use of executable models as reward functions and the neuro-symbolic composition are notable strengths.

major comments (1)

[Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.

minor comments (2)

[Introduction] The abstract and introduction use the term 'underspecified motion programs' without a concise definition or example until later sections; adding a brief illustrative program in the introduction would improve accessibility.
[Methods] Notation for forward-backwards representations and reward functions could be standardized with a single table or appendix to avoid scattered definitions across sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive suggestion for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.

Authors: We agree that explicit statistical analysis strengthens the empirical claims. Our original experiments were run with 5 independent random seeds per condition across the motion-capture datasets, and the reported improvements were consistent; however, we did not include formal significance tests or effect-size reporting in the submitted manuscript. In the revised version we will add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) with p-values, together with Cohen’s d effect sizes, for the key data-efficiency metrics. These results will be inserted into the results section and the corresponding tables/figures, confirming that the observed gains exceed what would be expected from variability alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces the novel EXACT domain-specific language to represent motions as underspecified programs interpreted as reward functions, then applies forward-backwards zero-shot inference and neuro-symbolic composition for segmentation and anomaly detection tasks. These steps rely on explicit new definitions and empirical comparisons to monolithic baselines rather than reducing any claimed result to a fitted parameter, self-definition, or self-citation chain by construction. Existing policy-inference concepts are referenced as background but do not bear the load of the central claims, which remain independently supported by the reported data-efficiency gains and compositional modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about representing motions as programs and the feasibility of zero-shot inference; no free parameters are mentioned and the only invented element is the new language itself.

axioms (2)

domain assumption Human motions can be represented as underspecified motion programs interpreted as reward-generating functions.
This is the foundational representation step for policy inference.
domain assumption Forward-backwards representations enable zero-shot policy inference from such programs.
This enables the executable aspect without task-specific training.

invented entities (1)

EXACT domain-specific language no independent evidence
purpose: To represent human motions as underspecified executable motion programs.
Newly introduced construct in the paper.

pith-pipeline@v0.9.0 · 5489 in / 1403 out tokens · 54038 ms · 2026-05-10T04:51:27.405351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 2 internal anchors

[1]

Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

Baker, C.L., Saxe, R., Tenenbaum, J.B.: Action understanding as inverse planning. Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

work page 2009
[2]

arXiv preprint arXiv:2101.07123 , year=

Blier, L., Tallec, C., Ollivier, Y.: Learning successor states and goal-dependent values: A mathematical viewpoint. CoRRabs/2101.07123(2021)

work page arXiv 2021
[3]

Bonnetto, H

Bonnetto, A., Qi, H., Leong, F., Tashkovska, M., Rad, M., Shokur, S., Hummel, F., Micera, S., Pollefeys, M., Mathis, A.: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models. CoRR abs/2506.01608(2025)

work page arXiv 2025
[4]

Bowers, M., Olausson, T.X., Wong, L., Grand, G., Tenenbaum, J.B., Ellis, K., Solar-Lezama, A.: Top-down synthesis for library learning. Proc. ACM Program. Lang.7, 1182–1213 (2023)

work page 2023
[5]

Na- ture Reviews Cancer20, 343 – 354 (2020)

Clarke, M.A., Fisher, J.: Executable cancer models: successes and challenges. Na- ture Reviews Cancer20, 343 – 354 (2020)

work page 2020
[6]

Cambridge University Press (1967)

Craik, K.J.W.: The Nature of Explanation. Cambridge University Press (1967)

work page 1967
[7]

Davidson, G., Todd, G., Togelius, J., Gureckis, T.M., Lake, B.M.: Goals as reward- producing programs. Nat. Mac. Intell.7(2), 205–220 (2025)

work page 2025
[8]

Neural Comput.5(4), 613–624 (1993)

Dayan, P.: Improving generalization for temporal difference learning: The successor representation. Neural Comput.5(4), 613–624 (1993)

work page 1993
[9]

IEEE Trans

Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern techniques. IEEE Trans. Pattern Anal. Mach. Intell.46(2), 1011–1030 (2024)

work page 2024
[10]

In: International Conference on Machine Learning, ICML (2024)

Du, Y., Kaelbling, L.P.: Compositional generative modeling: A single model is not all you need. In: International Conference on Machine Learning, ICML (2024)

work page 2024
[11]

Nature Communications13(1), 5024 (8 2022)

Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J.B., O’Donnell, T.J.: Syn- thesizing theories of human language with bayesian program induction. Nature Communications13(1), 5024 (8 2022)

work page 2022
[12]

Nature Biotechnology25, 1239–1249 (2007)

Fisher, J., Henzinger, T.A.: Executable cell biology. Nature Biotechnology25, 1239–1249 (2007)

work page 2007
[13]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

Georgescu, M., Barbalau, A., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: Anomaly detection in video via self-supervised and multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

work page 2021
[14]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

work page 2024
[15]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

work page 2022
[16]

In: ACM International Conference on Multimedia, MM (2020)

Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Ac- tion2motion: Conditioned generation of 3D human motions. In: ACM International Conference on Multimedia, MM (2020)

work page 2020
[17]

In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

work page 2010
[18]

ICT Express12(1), 32–49 (2026)

Heo, S., Moon, J., Jung, S.K.: Action recognition: A comprehensive survey of tasks, methods, and challenges. ICT Express12(1), 32–49 (2026)

work page 2026
[19]

In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

work page 2023
[20]

In: International Conference on Learning Representations, ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations, ICLR (2022)

work page 2022
[21]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5-coder technical report. CoRRabs/2409.12186(2024)

work page internal anchor Pith review arXiv 2024
[22]

Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

Johnson-Laird, P.N.: Mental models and human reasoning. Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

work page 2010
[23]

bioRxiv (2025)

Kozlova, E., Bonnetto, A., Mathis, A.: Dlc2action: A deep learning-based toolbox for automated behavior segmentation. bioRxiv (2025)

work page 2025
[24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

Kulal,S.,Mao,J.,Aiken,A.,Wu,J.:Hierarchicalmotionunderstandingviamotion programs. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

work page 2021
[25]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

Kulal, S., Mao, J., Aiken, A., Wu, J.: Programmatic concept learning for human motion description and synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

work page 2022
[26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

work page 2017
[27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Li,J.,Cao,J.,Zhang,H.,Rempe,D.,Kautz,J.,Iqbal,U.,Yuan,Y.:Genmo:Agen- eralist model for human motion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

work page 2025
[28]

IEEE Trans

Li, S., Farha, Y.A., Liu, Y., Cheng, M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.45(6), 6647–6658 (2023)

work page 2023
[29]

In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

work page 2024
[30]

In: Conference on Neural Information Processing Systems, NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems, NeurIPS (2023)

work page 2023
[31]

ACM Trans

Loper,M.,Mahmood,N.,Romero,J.,Pons-Moll,G.,Black,M.J.:SMPL:askinned multi-person linear model. ACM Trans. Graph.34(6) (2015)

work page 2015
[32]

In: International Conference on Learning Representations, ICLR (2024)

Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Univer- sal humanoid motion representations for physics-based control. In: International Conference on Learning Representations, ICLR (2024)

work page 2024
[33]

In: International Conference on Computer Vision, ICCV (2023)

Luo, Z., Cao, J., Winkler, A.W., Kitani, K., Xu, W.: Perpetual humanoid control for real-time simulated avatars. In: International Conference on Computer Vision, ICCV (2023)

work page 2023
[34]

Morgan & Claypool Publishers (2021)

Mirsky, R., Keren, S., Geib, C.W.: Introduction to Symbolic Plan and Goal Recog- nition. Morgan & Claypool Publishers (2021)

work page 2021
[35]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRRabs/1807.03748(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2025) 16 R

Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A highly- detailed egocentric video dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR...

work page 2025
[37]

In: International Conference on Computer Vision, ICCV (2021)

Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: International Conference on Computer Vision, ICCV (2021)

work page 2021
[38]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

work page 2024
[39]

In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

work page 2022
[40]

Journal of Industrial Information Integration30(2022)

Sharma, A., Kosasih, E.,Zhang, J.,Brintrup,A., Calinescu,A.: Digitaltwins:State of the art theory and practice, challenges, and open research questions. Journal of Industrial Information Integration30(2022)

work page 2022
[41]

Sensors25(13) (2025)

Shin,J.,Hassan,N.,Miah,A.S.M.,Nishimura,S.:Acomprehensivemethodological survey of human activity recognition across diverse data modalities. Sensors25(13) (2025)

work page 2025
[42]

In: AAAI Conference on Artificial Intelligence

Singhania,D.,Rahaman,R.,Yao,A.:Iterativecontrast-classifyforsemi-supervised temporal action segmentation. In: AAAI Conference on Artificial Intelligence. vol. 36, pp. 2262–2270 (2022)

work page 2022
[43]

The MIT Press (2024)

Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press (2024)

work page 2024
[44]

In: International Conference on Learning Representations, ICLR (2025)

Tirinzoni, A., Touati, A., Farebrother, J., Guzek, M., Kanervisto, A., Xu, Y., Lazaric, A., Pirotta, M.: Zero-shot whole-body humanoid control via behavioral foundation models. In: International Conference on Learning Representations, ICLR (2025)

work page 2025
[45]

Transactions of Machine Learning Research (2025)

Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: LLM generation with grammar augmentation. Transactions of Machine Learning Research (2025)

work page 2025
[46]

Frontiers Robotics AI8, 643010 (2021)

Van-Horenbeke, F.A., Peer, A.: Activity, plan, and goal recognition: A review. Frontiers Robotics AI8, 643010 (2021)

work page 2021
[47]

Frontiers Robotics AI2, 28 (2015)

Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Frontiers Robotics AI2, 28 (2015)

work page 2015
[48]

M., Ying, L., Zhang, C

Wong, L., Collins, K.M., Ying, L., Zhang, C.E., Weller, A., Gerstenberg, T., O’Donnell, T., Lew, A.K., Andreas, J., Tenenbaum, J.B., Brooke-Wilson, T.: Mod- eling open-world cognition as on-demand synthesis of probabilistic models. CoRR abs/2507.12547(2025)

work page arXiv 2025
[49]

Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman- Weiner, M.: Too many cooks: Bayesian inference for coordinating multi-agent col- laboration. Top. Cogn. Sci.13(2), 414–432 (2021)

work page 2021
[50]

Electronics14(13) (2025)

Xu, Z., Lu, Z., Ding, Y., Tian, L., Liu, S.: Adaptive temporal action localization in video. Electronics14(13) (2025)

work page 2025
[51]

In: AAAI Conference on Artificial Intelligence, AAAI (2018)

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, AAAI (2018)

work page 2018
[52]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete rep- resentations. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

work page 2023
[53]

Zhang, K., Shasha, D.E.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput.18(6), 1245–1262 (1989)

work page 1989

[1] [1]

Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

Baker, C.L., Saxe, R., Tenenbaum, J.B.: Action understanding as inverse planning. Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

work page 2009

[2] [2]

arXiv preprint arXiv:2101.07123 , year=

Blier, L., Tallec, C., Ollivier, Y.: Learning successor states and goal-dependent values: A mathematical viewpoint. CoRRabs/2101.07123(2021)

work page arXiv 2021

[3] [3]

Bonnetto, H

Bonnetto, A., Qi, H., Leong, F., Tashkovska, M., Rad, M., Shokur, S., Hummel, F., Micera, S., Pollefeys, M., Mathis, A.: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models. CoRR abs/2506.01608(2025)

work page arXiv 2025

[4] [4]

Bowers, M., Olausson, T.X., Wong, L., Grand, G., Tenenbaum, J.B., Ellis, K., Solar-Lezama, A.: Top-down synthesis for library learning. Proc. ACM Program. Lang.7, 1182–1213 (2023)

work page 2023

[5] [5]

Na- ture Reviews Cancer20, 343 – 354 (2020)

Clarke, M.A., Fisher, J.: Executable cancer models: successes and challenges. Na- ture Reviews Cancer20, 343 – 354 (2020)

work page 2020

[6] [6]

Cambridge University Press (1967)

Craik, K.J.W.: The Nature of Explanation. Cambridge University Press (1967)

work page 1967

[7] [7]

Davidson, G., Todd, G., Togelius, J., Gureckis, T.M., Lake, B.M.: Goals as reward- producing programs. Nat. Mac. Intell.7(2), 205–220 (2025)

work page 2025

[8] [8]

Neural Comput.5(4), 613–624 (1993)

Dayan, P.: Improving generalization for temporal difference learning: The successor representation. Neural Comput.5(4), 613–624 (1993)

work page 1993

[9] [9]

IEEE Trans

Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern techniques. IEEE Trans. Pattern Anal. Mach. Intell.46(2), 1011–1030 (2024)

work page 2024

[10] [10]

In: International Conference on Machine Learning, ICML (2024)

Du, Y., Kaelbling, L.P.: Compositional generative modeling: A single model is not all you need. In: International Conference on Machine Learning, ICML (2024)

work page 2024

[11] [11]

Nature Communications13(1), 5024 (8 2022)

Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J.B., O’Donnell, T.J.: Syn- thesizing theories of human language with bayesian program induction. Nature Communications13(1), 5024 (8 2022)

work page 2022

[12] [12]

Nature Biotechnology25, 1239–1249 (2007)

Fisher, J., Henzinger, T.A.: Executable cell biology. Nature Biotechnology25, 1239–1249 (2007)

work page 2007

[13] [13]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

Georgescu, M., Barbalau, A., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: Anomaly detection in video via self-supervised and multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

work page 2021

[14] [14]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

work page 2024

[15] [15]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

work page 2022

[16] [16]

In: ACM International Conference on Multimedia, MM (2020)

Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Ac- tion2motion: Conditioned generation of 3D human motions. In: ACM International Conference on Multimedia, MM (2020)

work page 2020

[17] [17]

In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

work page 2010

[18] [18]

ICT Express12(1), 32–49 (2026)

Heo, S., Moon, J., Jung, S.K.: Action recognition: A comprehensive survey of tasks, methods, and challenges. ICT Express12(1), 32–49 (2026)

work page 2026

[19] [19]

In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

work page 2023

[20] [20]

In: International Conference on Learning Representations, ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations, ICLR (2022)

work page 2022

[21] [21]

Qwen2.5-Coder Technical Report

Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5-coder technical report. CoRRabs/2409.12186(2024)

work page internal anchor Pith review arXiv 2024

[22] [22]

Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

Johnson-Laird, P.N.: Mental models and human reasoning. Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

work page 2010

[23] [23]

bioRxiv (2025)

Kozlova, E., Bonnetto, A., Mathis, A.: Dlc2action: A deep learning-based toolbox for automated behavior segmentation. bioRxiv (2025)

work page 2025

[24] [24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

Kulal,S.,Mao,J.,Aiken,A.,Wu,J.:Hierarchicalmotionunderstandingviamotion programs. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

work page 2021

[25] [25]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

Kulal, S., Mao, J., Aiken, A., Wu, J.: Programmatic concept learning for human motion description and synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

work page 2022

[26] [26]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

work page 2017

[27] [27]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

Li,J.,Cao,J.,Zhang,H.,Rempe,D.,Kautz,J.,Iqbal,U.,Yuan,Y.:Genmo:Agen- eralist model for human motion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

work page 2025

[28] [28]

IEEE Trans

Li, S., Farha, Y.A., Liu, Y., Cheng, M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.45(6), 6647–6658 (2023)

work page 2023

[29] [29]

In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

work page 2024

[30] [30]

In: Conference on Neural Information Processing Systems, NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems, NeurIPS (2023)

work page 2023

[31] [31]

ACM Trans

Loper,M.,Mahmood,N.,Romero,J.,Pons-Moll,G.,Black,M.J.:SMPL:askinned multi-person linear model. ACM Trans. Graph.34(6) (2015)

work page 2015

[32] [32]

In: International Conference on Learning Representations, ICLR (2024)

Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Univer- sal humanoid motion representations for physics-based control. In: International Conference on Learning Representations, ICLR (2024)

work page 2024

[33] [33]

In: International Conference on Computer Vision, ICCV (2023)

Luo, Z., Cao, J., Winkler, A.W., Kitani, K., Xu, W.: Perpetual humanoid control for real-time simulated avatars. In: International Conference on Computer Vision, ICCV (2023)

work page 2023

[34] [34]

Morgan & Claypool Publishers (2021)

Mirsky, R., Keren, S., Geib, C.W.: Introduction to Symbolic Plan and Goal Recog- nition. Morgan & Claypool Publishers (2021)

work page 2021

[35] [35]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRRabs/1807.03748(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2025) 16 R

Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A highly- detailed egocentric video dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR...

work page 2025

[37] [37]

In: International Conference on Computer Vision, ICCV (2021)

Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: International Conference on Computer Vision, ICCV (2021)

work page 2021

[38] [38]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

work page 2024

[39] [39]

In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

work page 2022

[40] [40]

Journal of Industrial Information Integration30(2022)

Sharma, A., Kosasih, E.,Zhang, J.,Brintrup,A., Calinescu,A.: Digitaltwins:State of the art theory and practice, challenges, and open research questions. Journal of Industrial Information Integration30(2022)

work page 2022

[41] [41]

Sensors25(13) (2025)

Shin,J.,Hassan,N.,Miah,A.S.M.,Nishimura,S.:Acomprehensivemethodological survey of human activity recognition across diverse data modalities. Sensors25(13) (2025)

work page 2025

[42] [42]

In: AAAI Conference on Artificial Intelligence

Singhania,D.,Rahaman,R.,Yao,A.:Iterativecontrast-classifyforsemi-supervised temporal action segmentation. In: AAAI Conference on Artificial Intelligence. vol. 36, pp. 2262–2270 (2022)

work page 2022

[43] [43]

The MIT Press (2024)

Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press (2024)

work page 2024

[44] [44]

In: International Conference on Learning Representations, ICLR (2025)

Tirinzoni, A., Touati, A., Farebrother, J., Guzek, M., Kanervisto, A., Xu, Y., Lazaric, A., Pirotta, M.: Zero-shot whole-body humanoid control via behavioral foundation models. In: International Conference on Learning Representations, ICLR (2025)

work page 2025

[45] [45]

Transactions of Machine Learning Research (2025)

Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: LLM generation with grammar augmentation. Transactions of Machine Learning Research (2025)

work page 2025

[46] [46]

Frontiers Robotics AI8, 643010 (2021)

Van-Horenbeke, F.A., Peer, A.: Activity, plan, and goal recognition: A review. Frontiers Robotics AI8, 643010 (2021)

work page 2021

[47] [47]

Frontiers Robotics AI2, 28 (2015)

Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Frontiers Robotics AI2, 28 (2015)

work page 2015

[48] [48]

M., Ying, L., Zhang, C

Wong, L., Collins, K.M., Ying, L., Zhang, C.E., Weller, A., Gerstenberg, T., O’Donnell, T., Lew, A.K., Andreas, J., Tenenbaum, J.B., Brooke-Wilson, T.: Mod- eling open-world cognition as on-demand synthesis of probabilistic models. CoRR abs/2507.12547(2025)

work page arXiv 2025

[49] [49]

Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman- Weiner, M.: Too many cooks: Bayesian inference for coordinating multi-agent col- laboration. Top. Cogn. Sci.13(2), 414–432 (2021)

work page 2021

[50] [50]

Electronics14(13) (2025)

Xu, Z., Lu, Z., Ding, Y., Tian, L., Liu, S.: Adaptive temporal action localization in video. Electronics14(13) (2025)

work page 2025

[51] [51]

In: AAAI Conference on Artificial Intelligence, AAAI (2018)

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, AAAI (2018)

work page 2018

[52] [52]

In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete rep- resentations. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

work page 2023

[53] [53]

Zhang, K., Shasha, D.E.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput.18(6), 1245–1262 (1989)

work page 1989