Recognition: 2 theorem links
· Lean TheoremReasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
Pith reviewed 2026-05-10 18:59 UTC · model grok-4.3
The pith
VANGUARD unifies video anomaly classification, spatial grounding, and chain-of-thought reasoning in a single vision-language model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VANGUARD is a single-VLM framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning through a three-stage curriculum consisting of classifier warmup on frozen backbone features, LoRA-adapted spatial grounding, and chain-of-thought generation, trained with structured reasoning trajectories from a Qwen3-VL-4B teacher and bounding-box supervision from GroundingDINO, attaining 94% ROC-AUC and 84% F1 on UCF-Crime while also enabling zero-shot cross-domain generalization.
What carries the argument
The three-stage curriculum that progressively layers classifier warmup, LoRA-adapted spatial grounding, and chain-of-thought generation on a vision-language model, supported by teacher-student annotation for reasoning trajectories.
Load-bearing premise
The teacher-student pipeline using Qwen3-VL-4B generates sufficiently accurate and unbiased structured reasoning trajectories and bounding-box supervision to train the student model without introducing systematic errors.
What would settle it
A controlled experiment on UCF-Crime showing that VANGUARD produces geometrically invalid bounding boxes at high rates or that its ROC-AUC drops below 85% when the reasoning stage is removed would falsify the claim that the unified staged training reliably delivers accurate grounding and classification together.
Figures
read the original abstract
Video Anomaly Detection (VAD) has traditionally been framed as binary classification or outlier detection, providing neither interpretable reasoning nor precise spatial localization of anomalous events. While Vision-Language Models (VLMs) offer rich scene understanding, they struggle with reliable spatial grounding - often producing hallucinated or geometrically invalid bounding boxes when asked to localize objects. We propose VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that unifies anomaly classification, spatial grounding, and chain-of-thought reasoning within a single VLM. VANGUARD introduces a three-stage curriculum that progressively layers training objectives: (1) classifier warmup on frozen backbone features, (2) LoRA-adapted spatial grounding, and (3) chain-of-thought generation. To overcome the sparse annotation typical of VAD benchmarks, we employ a teacher-student annotation pipeline in which a VLM (Qwen3-VL-4B) generates structured per-subclip reasoning trajectories based on manual annotations available from the UCA Dataset. Further, GroundingDINO provides bounding box supervision. On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects - capabilities absent from prior VAD methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions than classification-only fine-tuning. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VANGUARD, a VLM-based framework for video anomaly detection (VAD) that unifies binary classification, spatial grounding of anomalous objects, and chain-of-thought (CoT) reasoning. It introduces a three-stage curriculum (classifier warmup on frozen features, LoRA-adapted grounding, and CoT generation) and a teacher-student annotation pipeline that uses Qwen3-VL-4B to generate structured per-subclip reasoning trajectories plus GroundingDINO for bounding-box labels, derived from UCA Dataset manual annotations. On UCF-Crime the method reports 94% ROC-AUC and 84% F1 while producing interpretable explanations and localizations; ablations are said to show that staged training outperforms monolithic optimization and that reasoning acts as an implicit regularizer. Zero-shot transfer to XD-Violence and ShanghaiTech is also claimed.
Significance. If the reported metrics and the fidelity of the generated supervision hold, the work meaningfully extends VAD beyond binary classification by adding spatial grounding and interpretable reasoning—capabilities absent from prior methods. The curriculum design and the claim that structured CoT serves as regularization are potentially influential for future multimodal VAD research, and the cross-domain zero-shot results suggest practical generalization.
major comments (2)
- [§3 (Method), teacher-student annotation pipeline] §3 (Method), teacher-student annotation pipeline: The 94% ROC-AUC / 84% F1 claims on UCF-Crime and the assertion that reasoning acts as an implicit regularizer both depend on the quality of the Qwen3-VL-4B-generated reasoning trajectories and GroundingDINO bounding boxes used as supervision. No quantitative validation (inter-annotator agreement, error rates, hallucination analysis, or human ratings of the generated labels) is reported; systematic errors in the teacher outputs would be reproduced by the student, directly undermining both performance numbers and the regularization claim.
- [§4 (Experiments), ablation and baseline tables] §4 (Experiments), ablation and baseline tables: The statement that staged training outperforms monolithic optimization and yields more balanced predictions is load-bearing for the curriculum contribution, yet the ablations lack controls that isolate the effect of teacher-label noise (e.g., comparison against human-annotated subsets or noise-injection experiments). Without such controls it is impossible to attribute gains to genuine anomaly understanding versus artifacts in the generated supervision.
minor comments (3)
- The abstract and results section should explicitly reference the specific tables or figures that report the 94% ROC-AUC / 84% F1 numbers, the ablation comparisons, and the zero-shot transfer metrics.
- Include error bars, standard deviations across runs, or statistical significance tests for all reported metrics to support the performance claims.
- [§3] Notation for the three-stage curriculum (e.g., loss terms for each stage) should be defined consistently between the method description and the experimental implementation details.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments identify important gaps in validation and experimental controls that we agree warrant strengthening. We respond to each major comment below and commit to the corresponding revisions.
read point-by-point responses
-
Referee: [§3 (Method), teacher-student annotation pipeline] §3 (Method), teacher-student annotation pipeline: The 94% ROC-AUC / 84% F1 claims on UCF-Crime and the assertion that reasoning acts as an implicit regularizer both depend on the quality of the Qwen3-VL-4B-generated reasoning trajectories and GroundingDINO bounding boxes used as supervision. No quantitative validation (inter-annotator agreement, error rates, hallucination analysis, or human ratings of the generated labels) is reported; systematic errors in the teacher outputs would be reproduced by the student, directly undermining both performance numbers and the regularization claim.
Authors: We agree that quantitative validation of the teacher-generated supervision is essential to support the reported performance and the regularization claim. In the revised manuscript we will add a dedicated human evaluation subsection. We will sample 300 sub-clips from the UCF-Crime training set and obtain independent ratings from three human annotators on factual accuracy, logical coherence of the reasoning trajectories, and spatial precision of the GroundingDINO boxes relative to the original UCA manual annotations. We will report inter-annotator agreement (Cohen’s kappa), error rates, and a hallucination analysis. These results will be used to qualify the reliability of the supervision and to strengthen the interpretation of the CoT regularization effect. revision: yes
-
Referee: [§4 (Experiments), ablation and baseline tables] §4 (Experiments), ablation and baseline tables: The statement that staged training outperforms monolithic optimization and yields more balanced predictions is load-bearing for the curriculum contribution, yet the ablations lack controls that isolate the effect of teacher-label noise (e.g., comparison against human-annotated subsets or noise-injection experiments). Without such controls it is impossible to attribute gains to genuine anomaly understanding versus artifacts in the generated supervision.
Authors: We acknowledge that the existing ablations do not isolate the potential influence of label noise from the teacher pipeline. In the revision we will add two new controlled experiments: (1) systematic noise-injection ablations in which we deliberately corrupt a controlled fraction of the reasoning steps and bounding-box labels and measure the resulting degradation in ROC-AUC and F1; (2) a comparison, on any available human-annotated subset, of models trained with teacher-generated versus human labels. These results will be incorporated into the ablation tables and discussion to demonstrate that the benefits of staged training and CoT are not attributable solely to supervision artifacts. revision: yes
Circularity Check
No significant circularity; purely empirical framework with no derivations
full rationale
The paper presents an empirical VLM-based framework for video anomaly detection using a three-stage curriculum and a teacher-student pipeline that generates labels via Qwen3-VL-4B and GroundingDINO from UCA Dataset annotations. No mathematical equations, derivations, predictions, or first-principles results appear in the provided text. Performance claims (e.g., 94% ROC-AUC on UCF-Crime) are experimental outcomes on standard benchmarks, not reductions of outputs to fitted inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing elements. The method is self-contained as an applied ML contribution without circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine-learning assumptions that the training distribution is representative and that LoRA adaptation preserves useful features from the frozen backbone
invented entities (1)
-
VANGUARD three-stage curriculum
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On UCF-Crime, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously producing interpretable chain-of-thought explanations and spatial grounding of anomalous objects
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2025 , eprint=
Flashback: Memory-Driven Zero-shot, Real-time Video Anomaly Detection , author=. 2025 , eprint=
2025
-
[2]
2025 , eprint=
Qwen3-VL Technical Report , author=. 2025 , eprint=
2025
-
[3]
2025 , eprint=
Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting , author=. 2025 , eprint=
2025
-
[4]
2023 , eprint=
GPT-4 Technical Report , author=. 2023 , eprint=
2023
-
[5]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Real-world anomaly detection in surveillance videos , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[6]
Learning to prompt for vision-language models.Int
Zhou, Kaiyang and Yang, Jingkang and Loy, Chen Change and Liu, Ziwei , year=. Learning to Prompt for Vision-Language Models , volume=. International Journal of Computer Vision , publisher=. doi:10.1007/s11263-022-01653-1 , number=
-
[7]
arXiv preprint arXiv:2412.07183 , year=
Exploring What Why and How: A Multifaceted Benchmark for Causation Understanding of Video Anomaly , author=. arXiv preprint arXiv:2412.07183 , year=
-
[8]
2024 , eprint=
Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection , author=. 2024 , eprint=
2024
-
[9]
Unsupervised Video Anomaly Detection Based on Similarity with Predefined Text Descriptions , author=. Sensors , volume=. 2023 , publisher=. doi:10.3390/s23146256 , url=
-
[10]
2024 , eprint=
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning , author=. 2024 , eprint=
2024
-
[11]
2023 , eprint=
Towards Surveillance Video-and-Language Understanding: New Dataset, Baselines, and Challenges , author=. 2023 , eprint=
2023
-
[12]
Wu, Peng and Zhou, Xuerong and Pang, Guansong and Zhou, Lingru and Yan, Qingsen and Wang, Peng and Zhang, Yanning , booktitle=
-
[13]
International Conference on Learning Representations (ICLR) , year=
Pix2Seq: A Language Modeling Framework for Object Detection , author=. International Conference on Learning Representations (ICLR) , year=
-
[14]
URLhttps://openreview.net/forum?id=Qh7or7JRFI
Numerical Coordinate Regression with Convolutional Neural Networks , author=. arXiv preprint arXiv:1801.07372 , year=
-
[15]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=
-
[16]
Proceedings of the 38th International Conference on Machine Learning (ICML) , pages =
Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , pages =. 2021 , publisher =
2021
-
[17]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2022 , month=. doi:10.1609/aaai.v36i2.20028 , abstractNote=
-
[18]
Huang, Chao and Shi, Yushu and Wen, Jie and Wang, Wei and Xu, Yong and Cao, Xiaochun , booktitle =. Ex-. 2025 , editor =
2025
-
[19]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Weakly-supervised video anomaly detection with robust temporal feature magnitude learning , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[20]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
MGFN: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[21]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
Dual Memory Units with Uncertainty Regulation for Weakly Supervised Video Anomaly Detection , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[22]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Future frame prediction for anomaly detection--a new baseline , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[23]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Learning temporal regularity in video sequences , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[24]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Learning memory-guided normality for anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[25]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[26]
IEEE Transactions on Multimedia , volume=
Contrastive Attention for Video Anomaly Detection , author=. IEEE Transactions on Multimedia , volume=. 2022 , publisher=
2022
-
[27]
Scientific Reports , volume=
Multimodal and multiscale feature fusion for weakly supervised video anomaly detection , author=. Scientific Reports , volume=. 2024 , publisher=
2024
-
[28]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Harnessing large language models for training-free video anomaly detection , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[29]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
VERA: Explainable video anomaly detection via verbalized learning of vision-language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=
-
[30]
Li, Changkang and Jiang, Yalong , booktitle=
-
[31]
Zhang, Huaxin and others , journal=
-
[32]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond , author=. arXiv preprint arXiv:2308.12966 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
Sultani, Waqas and Chen, Chen and Shah, Mubarak , title =. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , month =
-
[34]
Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
Lu, Cewu and Shi, Jianping and Jia, Jiaya , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
-
[35]
Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
Luo, Weixin and Liu, Wen and Gao, Shenghua , title =. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , month =
-
[36]
IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS , year=
Anomaly detection in crowd scene , author=. IEEE 10th INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS , year=
-
[37]
European Conference on Computer Vision (ECCV) , year=
Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision , author=. European Conference on Computer Vision (ECCV) , year=
-
[38]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution , author=. arXiv preprint arXiv:2409.12191 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
Advances in Neural Information Processing Systems , volume=
Visual instruction tuning , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
International Conference on Machine Learning , pages=
Learning transferable visual models from natural language supervision , author=. International Conference on Machine Learning , pages=. 2021 , organization=
2021
-
[41]
2023 , organization=
Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven , booktitle=. 2023 , organization=
2023
-
[42]
Li, Hongyu and others , booktitle=
-
[43]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=
Video-ChatGPT: Towards detailed video understanding via large vision and language models , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics , pages=
-
[44]
Videollm: Modeling video sequence with large language models
VideoLLM: Modeling video sequence with large language models , author=. arXiv preprint arXiv:2305.13292 , year=
-
[45]
Grounding
Liu, Shilong and Zeng, Zhaoyang and Ren, Tianhe and Li, Feng and Zhang, Hao and Yang, Jie and Li, Chunyuan and Yang, Jianwei and Su, Hang and Zhu, Jun and others , booktitle=. Grounding. 2024 , organization=
2024
-
[46]
SAM 2: Segment Anything in Images and Videos
Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R. arXiv preprint arXiv:2408.00714 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Grounded
Ren, Tianhe and Liu, Shilong and others , journal=. Grounded
-
[48]
2025 , eprint=
Qwen2.5-VL Technical Report , author=. 2025 , eprint=
2025
-
[49]
Rasheed, Hanoona and Maaz, Muhammad and Shaker, Abdelrahman and Khan, Salman and Cholakal, Hisham and Anwer, Rao Muhammad and Xia, Gui-Song and Shi, Dianbing and Khan, Fahad Shahbaz , booktitle=
-
[50]
International Conference on Learning Representations , year=
Kosmos-2: Grounding multimodal large language models to the world , author=. International Conference on Learning Representations , year=
-
[51]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Chen, Keqin and Zhang, Zhao and Zeng, Weili and Zhang, Richong and Zhu, Feng and Zhao, Rui , year=. Shikra: Unleashing multimodal. 2306.15195 , archivePrefix=
work page internal anchor Pith review arXiv
-
[52]
Wu, Size and others , journal=
-
[53]
Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brian and Driess, Danny and Florence, Pete and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei , booktitle=
-
[54]
and Oswald, Martin R
Bhowmik, Aritra and Derakhshani, Mohammad Mahdi and Koelma, Dennis and Asano, Yuki M. and Oswald, Martin R. and Snoek, Cees G. M. , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , month =. 2025 , pages =
2025
-
[55]
2024 , organization=
Yang, Yuchen and others , booktitle=. 2024 , organization=
2024
-
[56]
Tang, Jiaqi and Lu, Hao and Wu, Ruizheng and Xu, Xiaogang and Ma, Ke and Fang, Cheng and Guo, Bin and Lu, Jiangbo and Chen, Qifeng and Chen, Ying-Cong , booktitle=
-
[57]
2025 , note=
Huang, Chao and Wang, Benfeng and others , journal=. 2025 , note=
2025
-
[58]
2025 , note=
Gao, Shibo and Yang, Peipei and Liu, Yangyang and Chen, Yi and Zhu, Han and Zhang, Xuyao and Huang, Linlin , journal=. 2025 , note=
2025
-
[59]
Advances in Neural Information Processing Systems , volume=
Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Advances in Neural Information Processing Systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[61]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Segment anything , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[62]
European Conference on Computer Vision , pages=
End-to-end object detection with transformers , author=. European Conference on Computer Vision , pages=. 2020 , organization=
2020
-
[63]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[64]
2020 , url=
Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q and Artzi, Yoav , booktitle=. 2020 , url=
2020
-
[65]
2022 , url=
Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle=. 2022 , url=
2022
-
[66]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[67]
2002 , publisher=
Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing , booktitle=. 2002 , publisher=
2002
-
[68]
2004 , publisher=
Lin, Chin-Yew , booktitle=. 2004 , publisher=
2004
-
[69]
Curriculum Learning , booktitle =
Bengio, Yoshua and Louradour, J. Curriculum Learning , booktitle =. 2009 , publisher =
2009
-
[70]
Advances in Neural Information Processing Systems , volume =
Sener, Ozan and Koltun, Vladlen , title =. Advances in Neural Information Processing Systems , volume =
-
[71]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =
-
[72]
International Conference on Learning Representations , year =
Loshchilov, Ilya and Hutter, Frank , title =. International Conference on Learning Representations , year =
-
[73]
Proceedings of the European Conference on Computer Vision , pages =
Integral Human Pose Regression , author =. Proceedings of the European Conference on Computer Vision , pages =. 2018 , publisher =
2018
-
[74]
and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M
Wei, Jason and Bosma, Maarten and Zhao, Vincent Y. and Guu, Kelvin and Yu, Adams Wei and Lester, Brian and Du, Nan and Dai, Andrew M. and Le, Quoc V. , title =. International Conference on Learning Representations , year =
-
[75]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =
Self-Instruct: Aligning Language Models with Self-Generated Instructions , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics , pages =
-
[76]
Vedantam, Ramakrishna and Lawrence Zitnick, C and Parikh, Devi , booktitle=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.