Understanding Human Actions through the Lens of Executable Models
Pith reviewed 2026-05-10 04:51 UTC · model grok-4.3
The pith
Representing human actions as executable motion programs improves data efficiency and reveals intuitive relationships in action recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce EXACT, a domain-specific language that encodes human motions as underspecified motion programs. These programs are interpreted as reward-generating functions that enable zero-shot policy inference via forward-backwards representations. Because the programs are compositional, individual policies can be combined into an executable neuro-symbolic model whose structure mirrors the program. When this pipeline is applied to motion-capture data, it performs human action segmentation and anomaly detection with greater data efficiency and produces more intuitive relationships among actions than monolithic, task-specific baselines.
What carries the argument
EXACT, a domain-specific language whose underspecified motion programs act as reward-generating functions for zero-shot policy inference and whose compositional structure directly supports neuro-symbolic model construction.
If this is right
- Action segmentation can exploit program structure to identify boundaries and sub-parts without additional supervision.
- Anomaly detection becomes possible by comparing observed motion against the reward landscape defined by the program.
- New actions can be handled by composing existing program elements rather than retraining an entire model.
- Overall learning requires fewer examples because the model reuses modular policies across tasks.
Where Pith is reading between the lines
- Robots could acquire new skills by inferring the underlying motion program from a single human demonstration instead of collecting large trajectory datasets.
- The same program representation might extend to language-conditioned action planning, where instructions are parsed into executable reward functions.
- Real-time monitoring systems could flag unsafe or inefficient executions by running the inferred program forward and measuring reward deviation.
Load-bearing premise
Human motions can be represented as underspecified motion programs that function as reward-generating functions for zero-shot policy inference and whose compositional structure supports effective neuro-symbolic modeling.
What would settle it
If, on a held-out collection of motion-capture sequences, the EXACT-based executable models show no gain in data efficiency for segmentation or anomaly detection relative to standard neural-network baselines, the central claim would be falsified.
Figures
read the original abstract
Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action's execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the EXACT domain-specific language for representing human motions as underspecified motion programs, which are interpreted as reward-generating functions to enable zero-shot policy inference via forward-backwards representations. It leverages the compositional structure of these programs to construct executable neuro-symbolic models and evaluates the approach on motion-capture data for human action segmentation and anomaly detection, claiming improved data efficiency and more intuitive capture of action relationships relative to monolithic, task-specific baselines.
Significance. If the empirical results hold, the work provides a structured, interpretable alternative to black-box action recognition by combining symbolic program structure with neural policy inference. This could improve data efficiency in human-centered AI systems and support applications requiring assessment of action quality or compositional generalization, such as robotics and rehabilitation monitoring. The explicit use of executable models as reward functions and the neuro-symbolic composition are notable strengths.
major comments (1)
- [Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.
minor comments (2)
- [Introduction] The abstract and introduction use the term 'underspecified motion programs' without a concise definition or example until later sections; adding a brief illustrative program in the introduction would improve accessibility.
- [Methods] Notation for forward-backwards representations and reward functions could be standardized with a single table or appendix to avoid scattered definitions across sections.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and constructive suggestion for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.
Authors: We agree that explicit statistical analysis strengthens the empirical claims. Our original experiments were run with 5 independent random seeds per condition across the motion-capture datasets, and the reported improvements were consistent; however, we did not include formal significance tests or effect-size reporting in the submitted manuscript. In the revised version we will add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) with p-values, together with Cohen’s d effect sizes, for the key data-efficiency metrics. These results will be inserted into the results section and the corresponding tables/figures, confirming that the observed gains exceed what would be expected from variability alone. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces the novel EXACT domain-specific language to represent motions as underspecified programs interpreted as reward functions, then applies forward-backwards zero-shot inference and neuro-symbolic composition for segmentation and anomaly detection tasks. These steps rely on explicit new definitions and empirical comparisons to monolithic baselines rather than reducing any claimed result to a fitted parameter, self-definition, or self-citation chain by construction. Existing policy-inference concepts are referenced as background but do not bear the load of the central claims, which remain independently supported by the reported data-efficiency gains and compositional modeling.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Human motions can be represented as underspecified motion programs interpreted as reward-generating functions.
- domain assumption Forward-backwards representations enable zero-shot policy inference from such programs.
invented entities (1)
-
EXACT domain-specific language
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition
Baker, C.L., Saxe, R., Tenenbaum, J.B.: Action understanding as inverse planning. Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition
work page 2009
-
[2]
arXiv preprint arXiv:2101.07123 , year=
Blier, L., Tallec, C., Ollivier, Y.: Learning successor states and goal-dependent values: A mathematical viewpoint. CoRRabs/2101.07123(2021)
-
[3]
Bonnetto, A., Qi, H., Leong, F., Tashkovska, M., Rad, M., Shokur, S., Hummel, F., Micera, S., Pollefeys, M., Mathis, A.: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models. CoRR abs/2506.01608(2025)
-
[4]
Bowers, M., Olausson, T.X., Wong, L., Grand, G., Tenenbaum, J.B., Ellis, K., Solar-Lezama, A.: Top-down synthesis for library learning. Proc. ACM Program. Lang.7, 1182–1213 (2023)
work page 2023
-
[5]
Na- ture Reviews Cancer20, 343 – 354 (2020)
Clarke, M.A., Fisher, J.: Executable cancer models: successes and challenges. Na- ture Reviews Cancer20, 343 – 354 (2020)
work page 2020
-
[6]
Cambridge University Press (1967)
Craik, K.J.W.: The Nature of Explanation. Cambridge University Press (1967)
work page 1967
-
[7]
Davidson, G., Todd, G., Togelius, J., Gureckis, T.M., Lake, B.M.: Goals as reward- producing programs. Nat. Mac. Intell.7(2), 205–220 (2025)
work page 2025
-
[8]
Neural Comput.5(4), 613–624 (1993)
Dayan, P.: Improving generalization for temporal difference learning: The successor representation. Neural Comput.5(4), 613–624 (1993)
work page 1993
-
[9]
Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern techniques. IEEE Trans. Pattern Anal. Mach. Intell.46(2), 1011–1030 (2024)
work page 2024
-
[10]
In: International Conference on Machine Learning, ICML (2024)
Du, Y., Kaelbling, L.P.: Compositional generative modeling: A single model is not all you need. In: International Conference on Machine Learning, ICML (2024)
work page 2024
-
[11]
Nature Communications13(1), 5024 (8 2022)
Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J.B., O’Donnell, T.J.: Syn- thesizing theories of human language with bayesian program induction. Nature Communications13(1), 5024 (8 2022)
work page 2022
-
[12]
Nature Biotechnology25, 1239–1249 (2007)
Fisher, J., Henzinger, T.A.: Executable cell biology. Nature Biotechnology25, 1239–1249 (2007)
work page 2007
-
[13]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)
Georgescu, M., Barbalau, A., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: Anomaly detection in video via self-supervised and multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)
work page 2021
-
[14]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)
Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)
work page 2024
-
[15]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)
work page 2022
-
[16]
In: ACM International Conference on Multimedia, MM (2020)
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Ac- tion2motion: Conditioned generation of 3D human motions. In: ACM International Conference on Multimedia, MM (2020)
work page 2020
-
[17]
Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15
work page 2010
-
[18]
ICT Express12(1), 32–49 (2026)
Heo, S., Moon, J., Jung, S.K.: Action recognition: A comprehensive survey of tasks, methods, and challenges. ICT Express12(1), 32–49 (2026)
work page 2026
-
[19]
In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)
Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)
work page 2023
-
[20]
In: International Conference on Learning Representations, ICLR (2022)
Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations, ICLR (2022)
work page 2022
-
[21]
Qwen2.5-Coder Technical Report
Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5-coder technical report. CoRRabs/2409.12186(2024)
work page internal anchor Pith review arXiv 2024
-
[22]
Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)
Johnson-Laird, P.N.: Mental models and human reasoning. Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)
work page 2010
-
[23]
Kozlova, E., Bonnetto, A., Mathis, A.: Dlc2action: A deep learning-based toolbox for automated behavior segmentation. bioRxiv (2025)
work page 2025
-
[24]
In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)
Kulal,S.,Mao,J.,Aiken,A.,Wu,J.:Hierarchicalmotionunderstandingviamotion programs. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)
work page 2021
-
[25]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)
Kulal, S., Mao, J., Aiken, A., Wu, J.: Programmatic concept learning for human motion description and synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)
work page 2022
-
[26]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)
work page 2017
-
[27]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
Li,J.,Cao,J.,Zhang,H.,Rempe,D.,Kautz,J.,Iqbal,U.,Yuan,Y.:Genmo:Agen- eralist model for human motion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
work page 2025
-
[28]
Li, S., Farha, Y.A., Liu, Y., Cheng, M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.45(6), 6647–6658 (2023)
work page 2023
-
[29]
In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)
work page 2024
-
[30]
In: Conference on Neural Information Processing Systems, NeurIPS (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems, NeurIPS (2023)
work page 2023
- [31]
-
[32]
In: International Conference on Learning Representations, ICLR (2024)
Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Univer- sal humanoid motion representations for physics-based control. In: International Conference on Learning Representations, ICLR (2024)
work page 2024
-
[33]
In: International Conference on Computer Vision, ICCV (2023)
Luo, Z., Cao, J., Winkler, A.W., Kitani, K., Xu, W.: Perpetual humanoid control for real-time simulated avatars. In: International Conference on Computer Vision, ICCV (2023)
work page 2023
-
[34]
Morgan & Claypool Publishers (2021)
Mirsky, R., Keren, S., Geib, C.W.: Introduction to Symbolic Plan and Goal Recog- nition. Morgan & Claypool Publishers (2021)
work page 2021
-
[35]
Representation Learning with Contrastive Predictive Coding
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRRabs/1807.03748(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2025) 16 R
Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A highly- detailed egocentric video dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR...
work page 2025
-
[37]
In: International Conference on Computer Vision, ICCV (2021)
Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: International Conference on Computer Vision, ICCV (2021)
work page 2021
-
[38]
In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)
Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)
work page 2024
-
[39]
In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)
Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)
work page 2022
-
[40]
Journal of Industrial Information Integration30(2022)
Sharma, A., Kosasih, E.,Zhang, J.,Brintrup,A., Calinescu,A.: Digitaltwins:State of the art theory and practice, challenges, and open research questions. Journal of Industrial Information Integration30(2022)
work page 2022
-
[41]
Shin,J.,Hassan,N.,Miah,A.S.M.,Nishimura,S.:Acomprehensivemethodological survey of human activity recognition across diverse data modalities. Sensors25(13) (2025)
work page 2025
-
[42]
In: AAAI Conference on Artificial Intelligence
Singhania,D.,Rahaman,R.,Yao,A.:Iterativecontrast-classifyforsemi-supervised temporal action segmentation. In: AAAI Conference on Artificial Intelligence. vol. 36, pp. 2262–2270 (2022)
work page 2022
-
[43]
Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press (2024)
work page 2024
-
[44]
In: International Conference on Learning Representations, ICLR (2025)
Tirinzoni, A., Touati, A., Farebrother, J., Guzek, M., Kanervisto, A., Xu, Y., Lazaric, A., Pirotta, M.: Zero-shot whole-body humanoid control via behavioral foundation models. In: International Conference on Learning Representations, ICLR (2025)
work page 2025
-
[45]
Transactions of Machine Learning Research (2025)
Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: LLM generation with grammar augmentation. Transactions of Machine Learning Research (2025)
work page 2025
-
[46]
Frontiers Robotics AI8, 643010 (2021)
Van-Horenbeke, F.A., Peer, A.: Activity, plan, and goal recognition: A review. Frontiers Robotics AI8, 643010 (2021)
work page 2021
-
[47]
Frontiers Robotics AI2, 28 (2015)
Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Frontiers Robotics AI2, 28 (2015)
work page 2015
-
[48]
Wong, L., Collins, K.M., Ying, L., Zhang, C.E., Weller, A., Gerstenberg, T., O’Donnell, T., Lew, A.K., Andreas, J., Tenenbaum, J.B., Brooke-Wilson, T.: Mod- eling open-world cognition as on-demand synthesis of probabilistic models. CoRR abs/2507.12547(2025)
-
[49]
Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman- Weiner, M.: Too many cooks: Bayesian inference for coordinating multi-agent col- laboration. Top. Cogn. Sci.13(2), 414–432 (2021)
work page 2021
-
[50]
Xu, Z., Lu, Z., Ding, Y., Tian, L., Liu, S.: Adaptive temporal action localization in video. Electronics14(13) (2025)
work page 2025
-
[51]
In: AAAI Conference on Artificial Intelligence, AAAI (2018)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, AAAI (2018)
work page 2018
-
[52]
In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete rep- resentations. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)
work page 2023
-
[53]
Zhang, K., Shasha, D.E.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput.18(6), 1245–1262 (1989)
work page 1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.