pith. machine review for the scientific record. sign in

arxiv: 2604.18064 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

Understanding Human Actions through the Lens of Executable Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:51 UTC · model grok-4.3

classification 💻 cs.AI
keywords human action understandingexecutable modelsdomain-specific languageneuro-symbolic modelingaction segmentationanomaly detectionmotion capturezero-shot inference
0
0 comments X

The pith

Representing human actions as executable motion programs improves data efficiency and reveals intuitive relationships in action recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that human actions can be understood more effectively by expressing them as structured, underspecified programs rather than as flat categories or end-to-end learned patterns. This matters for human-centred systems because recognising what someone is doing also requires judging how they are doing it and how one action relates to others. The proposed approach turns these programs into reward signals that support zero-shot policy inference, then composes the resulting policies into a single neuro-symbolic model that respects the original program structure. Experiments on motion-capture recordings demonstrate that the executable models need less data to segment action sequences and to flag anomalies than conventional task-specific networks. A reader who accepts the premise would therefore expect future action-understanding systems to generalise more readily across people, environments, and tasks.

Core claim

The authors introduce EXACT, a domain-specific language that encodes human motions as underspecified motion programs. These programs are interpreted as reward-generating functions that enable zero-shot policy inference via forward-backwards representations. Because the programs are compositional, individual policies can be combined into an executable neuro-symbolic model whose structure mirrors the program. When this pipeline is applied to motion-capture data, it performs human action segmentation and anomaly detection with greater data efficiency and produces more intuitive relationships among actions than monolithic, task-specific baselines.

What carries the argument

EXACT, a domain-specific language whose underspecified motion programs act as reward-generating functions for zero-shot policy inference and whose compositional structure directly supports neuro-symbolic model construction.

If this is right

  • Action segmentation can exploit program structure to identify boundaries and sub-parts without additional supervision.
  • Anomaly detection becomes possible by comparing observed motion against the reward landscape defined by the program.
  • New actions can be handled by composing existing program elements rather than retraining an entire model.
  • Overall learning requires fewer examples because the model reuses modular policies across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robots could acquire new skills by inferring the underlying motion program from a single human demonstration instead of collecting large trajectory datasets.
  • The same program representation might extend to language-conditioned action planning, where instructions are parsed into executable reward functions.
  • Real-time monitoring systems could flag unsafe or inefficient executions by running the inferred program forward and measuring reward deviation.

Load-bearing premise

Human motions can be represented as underspecified motion programs that function as reward-generating functions for zero-shot policy inference and whose compositional structure supports effective neuro-symbolic modeling.

What would settle it

If, on a held-out collection of motion-capture sequences, the EXACT-based executable models show no gain in data efficiency for segmentation or anomaly detection relative to standard neural-network baselines, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2604.18064 by Manisha Dubey, N. Siddharth, Rimvydas Rubavicius, Subramanian Ramamoorthy.

Figure 1
Figure 1. Figure 1: Overview of executable action modelling for human action understanding. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of parser. We follow an encoder-decoder setup where we [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUROC matrices for the 5 actions from each dataset with target action [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action's execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces the EXACT domain-specific language for representing human motions as underspecified motion programs, which are interpreted as reward-generating functions to enable zero-shot policy inference via forward-backwards representations. It leverages the compositional structure of these programs to construct executable neuro-symbolic models and evaluates the approach on motion-capture data for human action segmentation and anomaly detection, claiming improved data efficiency and more intuitive capture of action relationships relative to monolithic, task-specific baselines.

Significance. If the empirical results hold, the work provides a structured, interpretable alternative to black-box action recognition by combining symbolic program structure with neural policy inference. This could improve data efficiency in human-centered AI systems and support applications requiring assessment of action quality or compositional generalization, such as robotics and rehabilitation monitoring. The explicit use of executable models as reward functions and the neuro-symbolic composition are notable strengths.

major comments (1)
  1. [Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.
minor comments (2)
  1. [Introduction] The abstract and introduction use the term 'underspecified motion programs' without a concise definition or example until later sections; adding a brief illustrative program in the introduction would improve accessibility.
  2. [Methods] Notation for forward-backwards representations and reward functions could be standardized with a single table or appendix to avoid scattered definitions across sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive suggestion for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Evaluation] The central empirical claim of improved data efficiency rests on comparisons to monolithic baselines in the evaluation; however, the manuscript should clarify in the results section whether the gains are statistically significant across multiple runs and datasets, as small effect sizes could undermine the practical advantage of the EXACT-based pipeline.

    Authors: We agree that explicit statistical analysis strengthens the empirical claims. Our original experiments were run with 5 independent random seeds per condition across the motion-capture datasets, and the reported improvements were consistent; however, we did not include formal significance tests or effect-size reporting in the submitted manuscript. In the revised version we will add paired t-tests (or Wilcoxon signed-rank tests where normality assumptions fail) with p-values, together with Cohen’s d effect sizes, for the key data-efficiency metrics. These results will be inserted into the results section and the corresponding tables/figures, confirming that the observed gains exceed what would be expected from variability alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces the novel EXACT domain-specific language to represent motions as underspecified programs interpreted as reward functions, then applies forward-backwards zero-shot inference and neuro-symbolic composition for segmentation and anomaly detection tasks. These steps rely on explicit new definitions and empirical comparisons to monolithic baselines rather than reducing any claimed result to a fitted parameter, self-definition, or self-citation chain by construction. Existing policy-inference concepts are referenced as background but do not bear the load of the central claims, which remain independently supported by the reported data-efficiency gains and compositional modeling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions about representing motions as programs and the feasibility of zero-shot inference; no free parameters are mentioned and the only invented element is the new language itself.

axioms (2)
  • domain assumption Human motions can be represented as underspecified motion programs interpreted as reward-generating functions.
    This is the foundational representation step for policy inference.
  • domain assumption Forward-backwards representations enable zero-shot policy inference from such programs.
    This enables the executable aspect without task-specific training.
invented entities (1)
  • EXACT domain-specific language no independent evidence
    purpose: To represent human motions as underspecified executable motion programs.
    Newly introduced construct in the paper.

pith-pipeline@v0.9.0 · 5489 in / 1403 out tokens · 54038 ms · 2026-05-10T04:51:27.405351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

    Baker, C.L., Saxe, R., Tenenbaum, J.B.: Action understanding as inverse planning. Cognition113(3), 329–349 (2009), reinforcement learning and higher cognition

  2. [2]

    org/abs/2101.07123

    Blier, L., Tallec, C., Ollivier, Y.: Learning successor states and goal-dependent values: A mathematical viewpoint. CoRRabs/2101.07123(2021)

  3. [3]

    CoRR abs/2506.01608(2025)

    Bonnetto, A., Qi, H., Leong, F., Tashkovska, M., Rad, M., Shokur, S., Hummel, F., Micera, S., Pollefeys, M., Mathis, A.: EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models. CoRR abs/2506.01608(2025)

  4. [4]

    Bowers, M., Olausson, T.X., Wong, L., Grand, G., Tenenbaum, J.B., Ellis, K., Solar-Lezama, A.: Top-down synthesis for library learning. Proc. ACM Program. Lang.7, 1182–1213 (2023)

  5. [5]

    Na- ture Reviews Cancer20, 343 – 354 (2020)

    Clarke, M.A., Fisher, J.: Executable cancer models: successes and challenges. Na- ture Reviews Cancer20, 343 – 354 (2020)

  6. [6]

    Cambridge University Press (1967)

    Craik, K.J.W.: The Nature of Explanation. Cambridge University Press (1967)

  7. [7]

    Davidson, G., Todd, G., Togelius, J., Gureckis, T.M., Lake, B.M.: Goals as reward- producing programs. Nat. Mac. Intell.7(2), 205–220 (2025)

  8. [8]

    Neural Comput.5(4), 613–624 (1993)

    Dayan, P.: Improving generalization for temporal difference learning: The successor representation. Neural Comput.5(4), 613–624 (1993)

  9. [9]

    IEEE Trans

    Ding, G., Sener, F., Yao, A.: Temporal action segmentation: An analysis of modern techniques. IEEE Trans. Pattern Anal. Mach. Intell.46(2), 1011–1030 (2024)

  10. [10]

    In: International Conference on Machine Learning, ICML (2024)

    Du, Y., Kaelbling, L.P.: Compositional generative modeling: A single model is not all you need. In: International Conference on Machine Learning, ICML (2024)

  11. [11]

    Nature Communications13(1), 5024 (8 2022)

    Ellis, K., Albright, A., Solar-Lezama, A., Tenenbaum, J.B., O’Donnell, T.J.: Syn- thesizing theories of human language with bayesian program induction. Nature Communications13(1), 5024 (8 2022)

  12. [12]

    Nature Biotechnology25, 1239–1249 (2007)

    Fisher, J., Henzinger, T.A.: Executable cell biology. Nature Biotechnology25, 1239–1249 (2007)

  13. [13]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

    Georgescu, M., Barbalau, A., Ionescu, R.T., Khan, F.S., Popescu, M., Shah, M.: Anomaly detection in video via self-supervised and multi-task learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2021)

  14. [14]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

    Guo, C., Mu, Y., Javed, M.G., Wang, S., Cheng, L.: MoMask: Generative masked modeling of 3D human motions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

  15. [15]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

    Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

  16. [16]

    In: ACM International Conference on Multimedia, MM (2020)

    Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Ac- tion2motion: Conditioned generation of 3D human motions. In: ACM International Conference on Multimedia, MM (2020)

  17. [17]

    In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

    Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation prin- ciple for unnormalized statistical models. In: International Conference on Artificial Intelligence and Statistics, AISTATS (2010) Executable Action Models 15

  18. [18]

    ICT Express12(1), 32–49 (2026)

    Heo, S., Moon, J., Jung, S.K.: Action recognition: A comprehensive survey of tasks, methods, and challenges. ICT Express12(1), 32–49 (2026)

  19. [19]

    In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

    Hirschorn, O., Avidan, S.: Normalizing flows for human pose anomaly detection. In: IEEE/CVF International Conference on Computer Vision, ICCV (2023)

  20. [20]

    In: International Conference on Learning Representations, ICLR (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations, ICLR (2022)

  21. [21]

    Qwen2.5-Coder Technical Report

    Hui, B., Yang, J., Cui, Z., Yang, J., Liu, D., Zhang, L., Liu, T., Zhang, J., Yu, B., Dang, K., Yang, A., Men, R., Huang, F., Ren, X., Ren, X., Zhou, J., Lin, J.: Qwen2.5-coder technical report. CoRRabs/2409.12186(2024)

  22. [22]

    Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

    Johnson-Laird, P.N.: Mental models and human reasoning. Proceedings of the Na- tional Academy of Sciences of the United States of America107(43) (2010)

  23. [23]

    bioRxiv (2025)

    Kozlova, E., Bonnetto, A., Mathis, A.: Dlc2action: A deep learning-based toolbox for automated behavior segmentation. bioRxiv (2025)

  24. [24]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

    Kulal,S.,Mao,J.,Aiken,A.,Wu,J.:Hierarchicalmotionunderstandingviamotion programs. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2021)

  25. [25]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

    Kulal, S., Mao, J., Aiken, A., Wu, J.: Programmatic concept learning for human motion description and synthesis. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition,CVPR (2022)

  26. [26]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

    Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2017)

  27. [27]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

    Li,J.,Cao,J.,Zhang,H.,Rempe,D.,Kautz,J.,Iqbal,U.,Yuan,Y.:Genmo:Agen- eralist model for human motion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2025)

  28. [28]

    IEEE Trans

    Li, S., Farha, Y.A., Liu, Y., Cheng, M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell.45(6), 6647–6658 (2023)

  29. [29]

    In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Conference on Empirical Methods in Natural Language Processing, EMNLP (2024)

  30. [30]

    In: Conference on Neural Information Processing Systems, NeurIPS (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Conference on Neural Information Processing Systems, NeurIPS (2023)

  31. [31]

    ACM Trans

    Loper,M.,Mahmood,N.,Romero,J.,Pons-Moll,G.,Black,M.J.:SMPL:askinned multi-person linear model. ACM Trans. Graph.34(6) (2015)

  32. [32]

    In: International Conference on Learning Representations, ICLR (2024)

    Luo, Z., Cao, J., Merel, J., Winkler, A., Huang, J., Kitani, K.M., Xu, W.: Univer- sal humanoid motion representations for physics-based control. In: International Conference on Learning Representations, ICLR (2024)

  33. [33]

    In: International Conference on Computer Vision, ICCV (2023)

    Luo, Z., Cao, J., Winkler, A.W., Kitani, K., Xu, W.: Perpetual humanoid control for real-time simulated avatars. In: International Conference on Computer Vision, ICCV (2023)

  34. [34]

    Morgan & Claypool Publishers (2021)

    Mirsky, R., Keren, S., Geib, C.W.: Introduction to Symbolic Plan and Goal Recog- nition. Morgan & Claypool Publishers (2021)

  35. [35]

    Representation Learning with Contrastive Predictive Coding

    van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. CoRRabs/1807.03748(2018)

  36. [36]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2025) 16 R

    Perrett, T., Darkhalil, A., Sinha, S., Emara, O., Pollard, S., Parida, K.K., Liu, K., Gatti, P., Bansal, S., Flanagan, K., Chalk, J., Zhu, Z., Guerrier, R., Abdelazim, F., Zhu, B., Moltisanti, D., Wray, M., Doughty, H., Damen, D.: HD-EPIC: A highly- detailed egocentric video dataset. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR...

  37. [37]

    In: International Conference on Computer Vision, ICCV (2021)

    Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: International Conference on Computer Vision, ICCV (2021)

  38. [38]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

    Ren, S., Yao, L., Li, S., Sun, X., Hou, L.: Timechat: A time-sensitive multimodal large language model for long video understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR (2024)

  39. [39]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

    Sener, F., Chatterjee, D., Shelepov, D., He, K., Singhania, D., Wang, R., Yao, A.: Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In: IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR (2022)

  40. [40]

    Journal of Industrial Information Integration30(2022)

    Sharma, A., Kosasih, E.,Zhang, J.,Brintrup,A., Calinescu,A.: Digitaltwins:State of the art theory and practice, challenges, and open research questions. Journal of Industrial Information Integration30(2022)

  41. [41]

    Sensors25(13) (2025)

    Shin,J.,Hassan,N.,Miah,A.S.M.,Nishimura,S.:Acomprehensivemethodological survey of human activity recognition across diverse data modalities. Sensors25(13) (2025)

  42. [42]

    In: AAAI Conference on Artificial Intelligence

    Singhania,D.,Rahaman,R.,Yao,A.:Iterativecontrast-classifyforsemi-supervised temporal action segmentation. In: AAAI Conference on Artificial Intelligence. vol. 36, pp. 2262–2270 (2022)

  43. [43]

    The MIT Press (2024)

    Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: Bayesian Models of Cognition: Reverse Engineering the Mind. The MIT Press (2024)

  44. [44]

    In: International Conference on Learning Representations, ICLR (2025)

    Tirinzoni, A., Touati, A., Farebrother, J., Guzek, M., Kanervisto, A., Xu, Y., Lazaric, A., Pirotta, M.: Zero-shot whole-body humanoid control via behavioral foundation models. In: International Conference on Learning Representations, ICLR (2025)

  45. [45]

    Transactions of Machine Learning Research (2025)

    Ugare, S., Suresh, T., Kang, H., Misailovic, S., Singh, G.: Syncode: LLM generation with grammar augmentation. Transactions of Machine Learning Research (2025)

  46. [46]

    Frontiers Robotics AI8, 643010 (2021)

    Van-Horenbeke, F.A., Peer, A.: Activity, plan, and goal recognition: A review. Frontiers Robotics AI8, 643010 (2021)

  47. [47]

    Frontiers Robotics AI2, 28 (2015)

    Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Frontiers Robotics AI2, 28 (2015)

  48. [48]

    M., Ying, L., Zhang, C

    Wong, L., Collins, K.M., Ying, L., Zhang, C.E., Weller, A., Gerstenberg, T., O’Donnell, T., Lew, A.K., Andreas, J., Tenenbaum, J.B., Brooke-Wilson, T.: Mod- eling open-world cognition as on-demand synthesis of probabilistic models. CoRR abs/2507.12547(2025)

  49. [49]

    Wu, S.A., Wang, R.E., Evans, J.A., Tenenbaum, J.B., Parkes, D.C., Kleiman- Weiner, M.: Too many cooks: Bayesian inference for coordinating multi-agent col- laboration. Top. Cogn. Sci.13(2), 414–432 (2021)

  50. [50]

    Electronics14(13) (2025)

    Xu, Z., Lu, Z., Ding, Y., Tian, L., Liu, S.: Adaptive temporal action localization in video. Electronics14(13) (2025)

  51. [51]

    In: AAAI Conference on Artificial Intelligence, AAAI (2018)

    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: AAAI Conference on Artificial Intelligence, AAAI (2018)

  52. [52]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

    Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: Generating human motion from textual descriptions with discrete rep- resentations. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, CVPR (2023)

  53. [53]

    Zhang, K., Shasha, D.E.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput.18(6), 1245–1262 (1989)