pith. sign in

arxiv: 2504.17180 · v4 · submitted 2025-04-24 · 💻 cs.CV · cs.AI

We'll Fix it in Post: Improving Text-to-Video Generation with Neuro-Symbolic Feedback

Pith reviewed 2026-05-22 18:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords text-to-video generationneuro-symbolic feedbackvideo refinementsemantic consistencytemporal alignmentprompt alignmentzero-training pipeline
0
0 comments X

The pith

NeuS-E derives neuro-symbolic feedback from formal video representations to detect and correct semantic and temporal inconsistencies in text-to-video outputs without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents NeuS-E as a zero-training pipeline that improves text-to-video generation for complex prompts involving multiple objects and sequential events. Current models often produce videos that violate the prompt in terms of timing, logic, or object behavior, yet retraining or fine-tuning them carries prohibitive computational costs. The method first converts the generated video into a formal representation, extracts neuro-symbolic feedback that flags inconsistent events or objects along with their specific frames, and then applies targeted edits guided by that feedback. A sympathetic reader would value this because it turns an expensive training problem into a lighter post-processing step that directly raises how well the final video matches the original text description.

Core claim

NeuS-E is a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. The approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video, and empirical evaluations on both open-source and proprietary T2V models show that it significantly enhances temporal and logical alignment across diverse prompts by almost 40%.

What carries the argument

Neuro-symbolic feedback obtained by analyzing a formal video representation, which identifies inconsistent events, objects, and frames to direct subsequent targeted edits.

If this is right

  • Temporal and logical alignment with prompts improves by nearly 40% across diverse test cases.
  • The same refinement works on both open-source and proprietary text-to-video models.
  • Complex prompts with multiple objects and sequential events become more reliably handled without model changes.
  • High costs of retraining or fine-tuning are avoided by shifting effort to post-generation correction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same formal-representation-plus-feedback loop could be adapted to refine outputs from text-to-image or text-to-audio models.
  • Integrating symbolic verification steps earlier in the generation process might reduce the need for post-edits altogether.
  • Scaling the pipeline to very long videos would test whether the formal representation remains tractable and complete.

Load-bearing premise

The formal video representation accurately captures all relevant semantic and temporal information from the generated video so that detected inconsistencies correspond to real prompt violations.

What would settle it

If human raters or automated metrics show no measurable gain in temporal or logical alignment after the edits, or if the formal representation systematically misses prompt elements that actually appear in the video, the pipeline's value would be falsified.

Figures

Figures reproduced from arXiv: 2504.17180 by Harsh Goel, Minkyu Choi, Sahil Shah, Sandeep Chinchali, S P Sharan.

Figure 1
Figure 1. Figure 1: \textit {NeuS-E} improves the text-to-video (T2V) temporal alignment. The border color of the frames corresponds to the identified events. Vanilla T2V models (top video) fail to generate the sunset behind the mountains. \textit {NeuS-E} systematically identifies and surgically corrects this video segment to improve the temporal fidelity of the synthetic video with targeted feedback. Current research in T2V… view at source ↗
Figure 2
Figure 2. Figure 2: Formally verify generated video with video Automaton [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human Evaluation on Video Editing. Diverging bar chart of human pref￾erence labels on the dataset shows that our editing pipeline improves temporal fidelity [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Improvements from Iterative Rounds of Refinement. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Calibration Plots. We plot the accuracy vs threshold for all VLMs on our calibration dataset constructed from the COCO Caption dataset (top left). We plot the True Positive Rate (TPR) vs False Positive Rate (FPR) across all thresholds on the top right. Finally, the bottom plots show the confidence vs accuracy of the model before and after calibration, respectively. problem. We opt to do the latter. The pro… view at source ↗
Figure 6
Figure 6. Figure 6: Tool for Annotating Videos. Subjects evaluate the video edited by NeuS-E by comparing it to its original generation across five levels: strongly disagree, disagree, neutral, agree, and strongly agree [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
read the original abstract

Current text-to-video (T2V) generation models are increasingly popular due to their ability to produce coherent videos from textual prompts. However, these models often struggle to generate semantically and temporally consistent videos when dealing with longer, more complex prompts involving multiple objects or sequential events. Additionally, the high computational costs associated with training or fine-tuning make direct improvements impractical. To overcome these limitations, we introduce NeuS-E, a novel zero-training video refinement pipeline that leverages neuro-symbolic feedback to automatically enhance video generation, achieving superior alignment with the prompts. Our approach first derives the neuro-symbolic feedback by analyzing a formal video representation and pinpoints semantically inconsistent events, objects, and their corresponding frames. This feedback then guides targeted edits to the original video. Extensive empirical evaluations on both open-source and proprietary T2V models demonstrate that NeuS-E significantly enhances temporal and logical alignment across diverse prompts by almost 40%

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces NeuS-E, a zero-training post-processing pipeline for text-to-video (T2V) generation. It converts a generated video into a formal representation, derives neuro-symbolic feedback to identify semantically inconsistent events/objects and their frames relative to the input prompt, and applies targeted edits to improve temporal and logical alignment, claiming nearly 40% gains across open-source and proprietary T2V models on diverse prompts.

Significance. If the evaluation design and representation fidelity can be substantiated, the result would be significant: it offers a practical, training-free refinement method that sidesteps the high computational costs of retraining or fine-tuning T2V models, potentially improving reliability for complex multi-object or sequential prompts where current generators fail.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim states that NeuS-E 'significantly enhances temporal and logical alignment across diverse prompts by almost 40%'. No metrics, baselines, test-set size, statistical tests, or edit-success criteria are supplied, so the magnitude and reliability of the reported improvement cannot be assessed and the claim remains ungrounded.
  2. [Abstract / pipeline description] Pipeline description (abstract and § describing the method): the approach rests on the assumption that the formal video representation 'accurately captures all relevant semantic and temporal information' so that detected inconsistencies correspond to genuine prompt violations. No independent validation (human annotation, comparison against ground-truth scene graphs, or fidelity metrics) is reported; without it the subsequent edits could be correcting representation artifacts rather than real errors, undermining interpretation of the alignment gains.
minor comments (1)
  1. [Methods] The term 'neuro-symbolic feedback' is used without an explicit breakdown of its symbolic rules versus neural components or how the formal representation is constructed; a short methods subsection or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on grounding our empirical claims and validating the formal representation. We have revised the manuscript to address both points directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim states that NeuS-E 'significantly enhances temporal and logical alignment across diverse prompts by almost 40%'. No metrics, baselines, test-set size, statistical tests, or edit-success criteria are supplied, so the magnitude and reliability of the reported improvement cannot be assessed and the claim remains ungrounded.

    Authors: We agree the abstract claim requires more grounding. In the revised manuscript we have expanded the abstract to report the specific metrics (CLIP prompt-alignment score and a temporal consistency metric based on event ordering), the test set (200 diverse prompts spanning single- and multi-object scenarios), the baselines (vanilla outputs from the same T2V models), and statistical significance (mean relative improvement of 38.7 % with p < 0.001 by paired t-test). Edit success is defined as cases where the refined video improves alignment on at least one metric without degrading the other, confirmed by human raters on a 50-video subsample. revision: yes

  2. Referee: [Abstract / pipeline description] Pipeline description (abstract and § describing the method): the approach rests on the assumption that the formal video representation 'accurately captures all relevant semantic and temporal information' so that detected inconsistencies correspond to genuine prompt violations. No independent validation (human annotation, comparison against ground-truth scene graphs, or fidelity metrics) is reported; without it the subsequent edits could be correcting representation artifacts rather than real errors, undermining interpretation of the alignment gains.

    Authors: We acknowledge the need for explicit fidelity validation. The revised manuscript adds a new subsection (Experiments §4.3) that reports: (i) a human annotation study on 50 randomly sampled videos in which three annotators compared the extracted scene-graph-plus-temporal-relations representation against the video content, yielding 92 % precision and 88 % recall for objects/events; (ii) comparison against ground-truth scene graphs on 30 synthetic prompts, showing 85 % structural match; and (iii) an ablation confirming that edits guided by the representation improve human-rated prompt alignment more than random edits. These results support that detected inconsistencies correspond to genuine prompt violations. revision: yes

Circularity Check

0 steps flagged

No significant circularity: zero-training post-processing pipeline derives feedback from independent formal representation analysis

full rationale

The paper presents NeuS-E as a zero-training refinement pipeline that converts generated video to a formal representation, detects prompt inconsistencies in events/objects/frames, and applies targeted edits. The ~40% alignment gains are reported from empirical evaluations on diverse prompts using both open-source and proprietary T2V models. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation chain. The formal representation serves as an external analysis step rather than being constructed from the final edited outputs or evaluation metrics. The central claim remains an independent empirical result rather than a tautological restatement of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach depends on the existence and reliability of a formal video representation that can be automatically derived and compared to text; this representation is introduced as part of the method without external validation shown in the abstract.

axioms (1)
  • domain assumption A formal video representation can be derived that faithfully encodes objects, events, and their temporal ordering from any generated video.
    Invoked when the pipeline analyzes the video to produce neuro-symbolic feedback.
invented entities (1)
  • neuro-symbolic feedback no independent evidence
    purpose: To automatically identify and localize semantic and temporal inconsistencies between prompt and video.
    New component introduced by the paper to guide edits; no independent evidence outside the method itself is described.

pith-pipeline@v0.9.0 · 5701 in / 1360 out tokens · 28049 ms · 2026-05-22T18:53:37.831448+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 103 canonical work pages · 13 internal anchors

  1. [1]

    Meta AI Blog (2024), available athttps://ai.meta.com/blog/llama- 3-2-connect-2024-vision-edge-mobile-devices/

    AI, M.: Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog (2024), available athttps://ai.meta.com/blog/llama- 3-2-connect-2024-vision-edge-mobile-devices/

  2. [2]

    The MIT Press (2008)

    Baier, C., Katoen, J.P.: Principles of Model Checking. The MIT Press (2008)

  3. [3]

    (eds.) International Conference on Machine Learning

    Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: Meila, M., Zhang, T. (eds.) International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 139, pp. 813–824. PMLR (2021)

  4. [4]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22563–22575 (2023)

  5. [5]

    2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow im- age editing instructions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 18392–18402 (2022),https : / / api . semanticscholar.org/CorpusID:253581213

  6. [6]

    org/abs/2308.11606

    Bugliarello, E., Moraldo, H., Villegas, R., Babaeizadeh, M., Saffar, M.T., Zhang, H., Erhan, D., Ferrari, V., Kindermans, P.J., Voigtlaender, P.: Storybench: A mul- tifaceted benchmark for continuous story visualization (2023),https://arxiv. org/abs/2308.11606

  7. [7]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: Video editing using im- age diffusion. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 23149–23160 (2023),https://api.semanticscholar.org/CorpusID: 257663916

  8. [8]

    Chen, H., Xia, M., He, Y., Zhang, Y., Cun, X., Yang, S., Xing, J., Liu, Y., Chen, Q., Wang, X., Weng, C., Shan, Y.: Videocrafter1: Open diffusion models for high- quality video generation (2023),https://arxiv.org/abs/2310.19512

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion mod- els. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition. pp. 7310–7320 (2024)

  10. [10]

    Microsoft COCO Captions: Data Collection and Evaluation Server

    Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

  11. [11]

    In: Bouamor, H., Pino, J., Bali, K

    Chen, Y., Gandhi, R., Zhang, Y., Fan, C.: NL2TL: Transforming natural lan- guages to temporal logics using large language models. In: Bouamor, H., Pino, J., Bali, K. (eds.) Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 15880–15903. Association for Computational Linguistics, Singapore (Dec 2023).https://doi.org...

  12. [12]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  13. [13]

    arXiv preprint arXiv:2205.01089 (2022) 16 M

    Chen, Z., Yi, K., Li, Y., Ding, M., Torralba, A., Tenenbaum, J.B., Gan, C.: Comphy: Compositional physical reasoning of objects and events from videos. arXiv preprint arXiv:2205.01089 (2022) 16 M. Choi et al

  14. [14]

    arXiv preprint arXiv:2403.05131 (2024)

    Cho, J., Puspitasari, F.D., Zheng, S., Zheng, J., Lee, L.H., Kim, T.H., Hong, C.S., Zhang, C.: Sora as an agi world model? a complete survey on text-to-video generation. arXiv preprint arXiv:2403.05131 (2024)

  15. [15]

    In: European Conference on Computer Vision

    Choi, M., Goel, H., Omama, M., Yang, Y., Shah, S., Chinchali, S.: Towards neuro- symbolic video understanding. In: European Conference on Computer Vision. pp. 220–236. Springer (2025)

  16. [16]

    arXiv preprint arXiv:2405.04180 (2024)

    Chu, Z., Zhang, L., Sun, Y., Xue, S., Wang, Z., Qin, Z., Ren, K.: Sora detector: A unified hallucination detection for large text-to-video models. arXiv preprint arXiv:2405.04180 (2024)

  17. [17]

    MIT Press (1999)

    Clarke, E.M., Grumberg, O., Peled, D.A.: Model Checking. MIT Press (1999)

  18. [18]

    Cosler, M., Hahn, C., Mendoza, D., Schmitt, F., Trippel, C.: nl2spec: Interactively translating unstructured natural language to temporal logics with large language models (2023),https://arxiv.org/abs/2303.04864

  19. [19]

    In: Handbook of Theoretical Com- puter Science, Volume B: Formal Models and Sematics (1991),https://api

    Emerson, E.A.: Temporal and modal logic. In: Handbook of Theoretical Com- puter Science, Volume B: Formal Models and Sematics (1991),https://api. semanticscholar.org/CorpusID:6062082

  20. [20]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7346–7356 (2023)

  21. [21]

    In: IEEE/CVF International Conference on Computer Vision

    Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recog- nition. In: IEEE/CVF International Conference on Computer Vision. pp. 6201–

  22. [22]

    arXiv preprint arXiv:2406.08656 (2024)

    Feng, W., Li, J., Saxon, M., Fu, T.j., Chen, W., Wang, W.Y.: Tc-bench: Bench- marking temporal compositionality in text-to-video and image-to-video genera- tion. arXiv preprint arXiv:2406.08656 (2024)

  23. [23]

    Gen-3: Gen-3 (2024),https://runwayml.com/blog/introducing-gen-3-alpha/

  24. [24]

    TokenFlow: Consistent Diffusion Features for Consistent Video Editing

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion features for consistent video editing. ArXivabs/2307.10373(2023),https:// api.semanticscholar.org/CorpusID:259991741

  25. [25]

    arXiv preprint arXiv:2411.16776 (2024)

    Goel, H., Narasimhan, S.S., Akcin, O., Chinchali, S.: Syndiff-ad: Improving se- mantic segmentation and end-to-end autonomous driving with synthetic data from latent diffusion models. arXiv preprint arXiv:2411.16776 (2024)

  26. [26]

    In: First Vision and Language for Autonomous Driving and Robotics Workshop (2024),https:// openreview.net/forum?id=yaXYQinjOA

    Goel, H., Narasimhan, S.S., Chinchali, S.P.: Improving end-to-end autonomous driving with synthetic data from latent diffusion models. In: First Vision and Language for Autonomous Driving and Robotics Workshop (2024),https:// openreview.net/forum?id=yaXYQinjOA

  27. [27]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Wang, Y., Qiao, Y., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. ArXivabs/2307.04725(2023),https://api.semanticscholar.org/CorpusID: 259501509

  28. [28]

    In: 2019 IEEE 58th conference on decision and control (CDC)

    Hasanbeig, M., Kantaros, Y., Abate, A., Kroening, D., Pappas, G.J., Lee, I.: Re- inforcement learning for temporal logic control synthesis with probabilistic satis- faction guarantees. In: 2019 IEEE 58th conference on decision and control (CDC). pp. 5338–5343. IEEE (2019)

  29. [29]

    He, Y., Yang, T., Zhang, Y., Shan, Y., Chen, Q.: Latent video diffusion models for high-fidelity long video generation (2023),https://arxiv.org/abs/2211.13221

  30. [30]

    CoRRabs/2002.07080(2020),https://arxiv.org/abs/ 2002.07080

    Hensel, C., Junges, S., Katoen, J., Quatmann, T., Volk, M.: The probabilistic model checker storm. CoRRabs/2002.07080(2020),https://arxiv.org/abs/ 2002.07080

  31. [31]

    Advances in neural information processing systems33, 6840–6851 (2020) We’ll Fix it in Post: Improving Text-to-Video Generation with Zero Training 17

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) We’ll Fix it in Post: Improving Text-to-Video Generation with Zero Training 17

  32. [32]

    Video Diffusion Models

    Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video dif- fusion models. ArXivabs/2204.03458(2022),https://api.semanticscholar. org/CorpusID:248006185

  33. [33]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Hong, W., Ding, M., Zheng, W., Liu, X., Tang, J.: Cogvideo: Large-scale pretrain- ing for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868 (2022)

  34. [34]

    Hu, Y., Luo, C., Chen, Z.: Make it move: Controllable image-to-video generation with text descriptions (2022),https://arxiv.org/abs/2112.02815

  35. [35]

    Advances in Neural Information Processing Systems36, 78723–78747 (2023)

    Huang, K., Sun, K., Xie, E., Li, Z., Liu, X.: T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems36, 78723–78747 (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., et al.: Vbench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21807–21818 (2024)

  37. [37]

    Cambridge University Press (2004)

    Huth, M., Ryan, M.: Logic in Computer Science: Modelling and Reasoning about Systems. Cambridge University Press (2004)

  38. [38]

    Jain, Y., Nasery, A., Vineet, V., Behl, H.: Peekaboo: Interactive video generation via masked-diffusion (2024),https://arxiv.org/abs/2312.07509

  39. [39]

    ArXivabs/2310.01107(2023),https://api

    Jeong, H., Ye, J.C.: Ground-a-video: Zero-shot grounded video editing using text-to-image diffusion models. ArXivabs/2310.01107(2023),https://api. semanticscholar.org/CorpusID:263605399

  40. [40]

    Journal of Automated Rea- soning60, 43–62 (2018)

    Jha, S., Raman, V., Sadigh, D., Seshia, S.A.: Safe autonomy under perception uncertainty using chance-constrained temporal logic. Journal of Automated Rea- soning60, 43–62 (2018)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ji, P., Xiao, C., Tai, H., Huo, M.: T2vbench: Benchmarking temporal dynamics for text-to-video generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5325–5335 (2024)

  42. [42]

    Junges, S., Volk, M.: Stormpy - python bindings for storm (2021),github.com/ moves-rwth/stormpy

  43. [43]

    Springer (1960)

    Kemeny, J., Snell, J.: Finite Markov Chains. Springer (1960)

  44. [44]

    Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators (2023),https://arxiv.org/abs/2303.13439

  45. [45]

    Kim, J., Kim, B.S., Ye, J.C.: Free2guide: Gradient-free path integral control for enhancing text-to-video generation with large vision-language models (2024), https://arxiv.org/abs/2411.17041

  46. [46]

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., Wu, K., Lin, Q., Yuan, J., Long, Y., Wang, A., Wang, A., Li, C., Huang, D., Yang, F., Tan, H., Wang, H., Song, J., Bai, J., Wu, J., Xue, J., Wang, J., Wang, K., Liu, M., Li, P., Li, S., Wang, W., Yu, W., Deng, X., Li, Y., Chen, Y., Cui, Y., Peng, Y., Yu, Z., H...

  47. [47]

    Kou, T., Liu, X., Zhang, Z., Li, C., Wu, H., Min, X., Zhai, G., Liu, N.: Subjective- aligned dataset and metric for text-to-video quality assessment (2024),https: //arxiv.org/abs/2403.11956

  48. [48]

    IEEE transactions on robotics25(6), 1370–1381 (2009) 18 M

    Kress-Gazit, H., Fainekos, G.E., Pappas, G.J.: Temporal-logic-based reactive mis- sion and motion planning. IEEE transactions on robotics25(6), 1370–1381 (2009) 18 M. Choi et al

  49. [49]

    In: International Conference on Open Semantic Technologies for Intelligent Systems

    Kroshchanka, A., Golovko, V., Mikhno, E., Kovalev, M., Zahariev, V., Zagorskij, A.: A neural-symbolic approach to computer vision. In: International Conference on Open Semantic Technologies for Intelligent Systems. pp. 282–309. Springer (2021)

  50. [50]

    Kuaishou: Kling (2024),https://kling.kuaishou.com/en

  51. [51]

    Labs, P.: Pika ai: Free video generator with scene ingredients (2024),https: //pikartai.com, pika 2.1 documentation

  52. [52]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    Li, B., Lin, Z., Pathak, D., Li, J., Fei, Y., Wu, K., Xia, X., Zhang, P., Neubig, G., Ramanan, D.: Evaluating and improving compositional text-to-visual generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 5290–5301 (June 2024)

  53. [53]

    Neurocomputing490, 482–494 (2022)

    Li, N., Chang, F., Liu, C.: Human-related anomalous event detection via spatial- temporal graph convolutional autoencoder with embedded long short-term mem- ory network. Neurocomputing490, 482–494 (2022)

  54. [54]

    2024 IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR)pp.8599–8608(2023),https://api.semanticscholar

    Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross- attention control. 2024 IEEE/CVF Conference on Computer Vision and Pat- ternRecognition(CVPR)pp.8599–8608(2023),https://api.semanticscholar. org/CorpusID:257405406

  55. [55]

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models (2024),https://arxiv.org/abs/2310.11440

  56. [56]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, Y., Cun, X., Liu, X., Wang, X., Zhang, Y., Chen, H., Liu, Y., Zeng, T., Chan, R., Shan, Y.: Evalcrafter: Benchmarking and evaluating large video generation models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22139–22149 (2024)

  57. [57]

    Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation (2023),https://arxiv.org/abs/2311.01813

  58. [58]

    Advances in Neural Information Processing Systems36(2024)

    Liu, Y., Li, L., Ren, S., Gao, R., Li, S., Chen, S., Sun, X., Hou, L.: Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems36(2024)

  59. [59]

    Luo, Y., Zhao, X., Chen, M., Zhang, K., Shao, W., Wang, K., Wang, Z., You, Y.: Enhance-a-video: Better generated video for free (2025),https://arxiv.org/ abs/2502.07508

  60. [60]

    Springer-Verlag (1992)

    Manna, Z., Pnueli, A.: The Temporal Logic of Reactive and Concurrent Systems: Specification. Springer-Verlag (1992)

  61. [61]

    In: European Conference on Computer Vision

    Mavroudi, E., Haro, B.B., Vidal, R.: Representation learning on visual-symbolic graphs for video understanding. In: European Conference on Computer Vision. Lecture Notes in Computer Science, vol. 12374, pp. 71–90. Springer (2020)

  62. [62]

    IEEE Trans

    Medioni, G.G., Cohen, I., Brémond, F., Hongeng, S., Nevatia, R.: Event detection and analysis from video streams. IEEE Trans. Pattern Anal. Mach. Intell.23(8), 873–889 (2001)

  63. [63]

    Automatica152, 110692 (2023)

    Mehdipour, N., Althoff, M., Tebbens, R.D., Belta, C.: Formal methods to com- ply with rules of the road in autonomous driving: State of the art and grand challenges. Automatica152, 110692 (2023)

  64. [64]

    Mendoza, D., Hahn, C., Trippel, C.: Translating natural language to tempo- ral logics with large language models and model checkers. In: Formal Meth- ods in Computer-Aided Design (FMCAD) (2024),https://cs.stanford.edu/ ~trippel/pubs/mendoza_FMCAD24.pdf We’ll Fix it in Post: Improving Text-to-Video Generation with Zero Training 19

  65. [65]

    In: International Conference on Learning Representations (2021),https://api

    Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: Sdedit: Guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2021),https://api. semanticscholar.org/CorpusID:245704504

  66. [66]

    Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y.: Dreamix: Video diffusion models are general video editors (2023),https://arxiv.org/abs/2302.01329

  67. [67]

    Cambridge University Press (1998)

    Norris, J.: Markov Chains. Cambridge University Press (1998)

  68. [68]

    GPT-4 Technical Report

    OpenAI: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  69. [69]

    com/sora/, sora technical report

    OpenAI: Video generation models as world simulators (2024),https://openai. com/sora/, sora technical report

  70. [70]

    OpenGVLab Blog (2024), available athttps://internvl

    OpenGVLab: Internvl 2.0: A suite of multimodal large language models for vision and language tasks. OpenGVLab Blog (2024), available athttps://internvl. github.io/blog/2024-07-02-InternVL-2.0/

  71. [71]

    Pika Labs: Accessed september 25, 2023 (2023),https://www.pika.art/

  72. [72]

    2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp

    Qi,C.,Cun,X.,Zhang,Y.,Lei,C.,Wang,X.,Shan,Y.,Chen,Q.:Fatezero:Fusing attentions for zero-shot text-based video editing. 2023 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 15886–15896 (2023),https://api. semanticscholar.org/CorpusID:257557738

  73. [73]

    Research, R.: Introducing gen-3 alpha: A new frontier for video generation (2024), https://runwayml.com/research/introducing- gen- 3- alpha, runway Gen-3 technical report

  74. [74]

    Sarkar, S., Lore, K.G., Sarkar, S.: Early detection of combustion instability by neural-symbolic analysis on hi-speed video. In: Proceedings of the NIPS Work- shop on Cognitive Computation: Integrating Neural and Symbolic Approaches co- located with the 29th Annual Conference on Neural Information Processing Sys- tems. CEUR Workshop Proceedings, vol. 1583...

  75. [75]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2025)

    Sharan, S.P., Choi, M., Shah, S., Goel, H., Omama, M., Chinchali, S.P.: Neuro- symbolic evaluation of text-to-video models using formal verification. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2025)

  76. [76]

    Placeholder Journal (2024), https://deepmind.google/technologies/veo/

    Sharma, A., Yu, A., Razavi, A., Toor, A., Pierson, A., Gupta, A., Waters, A., vandenOord,A.,Tanis,D.,Erhan,D.,Lau,E.,Shaw,E.,Barth-Maron,G.,Shaw, G., Zhang, H., Nandwani, H., Moraldo, H., Kim, H., Blok, I., Bauer, J., Donahue, J., Chung, J., Mathewson, K., David, K., Espeholt, L., van Zee, M., McGill, M., Narasimhan, M., Wang, M., Bińkowski, M., Babaeizad...

  77. [77]

    Shin, A., Mori, Y., Kaneko, K.: The lost melody: Empirical observations on text- to-video generation from a storytelling perspective (2024),https://arxiv.org/ abs/2405.08720

  78. [78]

    In: 2017 IEEE 56th annual conference on decision and control (CDC)

    Shoukry,Y.,Nuzzo,P.,Balkan,A.,Saha,I.,Sangiovanni-Vincentelli,A.L.,Seshia, S.A., Pappas, G.J., Tabuada, P.: Linear temporal logic motion planning for teams of underactuated robots using satisfiability modulo convex programming. In: 2017 IEEE 56th annual conference on decision and control (CDC). pp. 1132–1137. IEEE (2017) 20 M. Choi et al

  79. [79]

    Sun, K., Huang, K., Liu, X., Wu, Y., Xu, Z., Li, Z., Liu, X.: T2v-compbench: A comprehensive benchmark for compositional text-to-video generation (2025), https://arxiv.org/abs/2407.14505

  80. [80]

    Team, W.: Wan: Open and advanced large-scale video generative models (2025)

Showing first 80 references.