Recognition: unknown
Video-ToC: Video Tree-of-Cue Reasoning
Pith reviewed 2026-05-10 01:17 UTC · model grok-4.3
The pith
Video-ToC uses tree-of-cue reasoning to adapt video LLMs to specific input content for better understanding and fewer hallucinations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Video-ToC framework enhances video understanding in large language models by implementing tree-of-cue reasoning. It features a tree-guided visual cue localization mechanism that provides structured reasoning patterns for fine-grained perception, a reasoning-demand reward mechanism that dynamically sets rewards in RL based on estimated reasoning needs, and an automated annotation pipeline that builds specialized datasets for SFT and RL training.
What carries the argument
Tree-of-cue reasoning, which uses a tree structure to organize and guide the localization of visual cues and the application of reasoning strategies adapted to the video.
If this is right
- Models gain enhanced fine-grained perceptual capabilities through structured reasoning patterns.
- Dynamic reward adjustment enables more effective reasoning strategies by providing on-demand incentives.
- Automated dataset creation supports efficient supervised fine-tuning and reinforcement learning for video tasks.
- Performance surpasses baselines on six video understanding benchmarks and reduces hallucinations on a dedicated benchmark.
Where Pith is reading between the lines
- Similar tree-based cue localization could be tested on other sequential modalities like audio or long text for complex reasoning tasks.
- The method might scale to longer videos where detail tracking is a known weakness in current models.
- Integrating the reward mechanism with other reinforcement learning variants could optimize for real-world video applications.
- If the tree structure proves central, it could inspire hierarchical reasoning components in future multimodal systems.
Load-bearing premise
That the tree-guided visual cue localization and reasoning-demand reward deliver genuine perception-aware adaptation to video content rather than merely improving benchmark scores through training choices.
What would settle it
A controlled test on a new video task outside the six benchmarks where removing the tree structure or the dynamic reward mechanism produces no drop in performance relative to baselines.
Figures
read the original abstract
Existing Video Large Language Models (Video LLMs) struggle with complex video understanding, exhibiting limited reasoning capabilities and potential hallucinations. In particular, these methods tend to perform reasoning solely relying on the pretrained inherent reasoning rationales whilst lacking perception-aware adaptation to the input video content. To address this, we propose \textbf{Video-ToC}, a novel video reasoning framework that enhances video understanding through tree-of-cue reasoning. Specifically, our approach introduces three key innovations: (1) A tree-guided visual cue localization mechanism, which endows the model with enhanced fine-grained perceptual capabilities through structured reasoning patterns; (2) A reasoning-demand reward mechanism, which dynamically adjusts the reward value for reinforcement learning (RL) based on the estimation of reasoning demands, enabling on-demand incentives for more effective reasoning strategies; and (3) An automated annotation pipeline that constructs the Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for supervised fine-tuning (SFT) and RL training, respectively. Extensive evaluations on six video understanding benchmarks and a video hallucination benchmark demonstrate the superiority of Video-ToC over baselines and recent methods. Code is available at https://github.com/qizhongtan/Video-ToC.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Video-ToC, a video reasoning framework for Video LLMs that uses tree-of-cue reasoning to improve complex understanding and reduce hallucinations. It proposes three components: (1) tree-guided visual cue localization for fine-grained perception via structured patterns, (2) a reasoning-demand reward that dynamically adjusts RL incentives based on estimated demands, and (3) an automated annotation pipeline creating Video-ToC-SFT-1k and Video-ToC-RL-2k datasets for SFT and RL. The central claim is that these yield superior performance over baselines on six video understanding benchmarks plus a hallucination benchmark.
Significance. If the tree-guided localization and demand-based reward are shown to drive gains beyond data curation and generic RL, the work could meaningfully advance perception-aware structured reasoning in video models. The automated dataset pipeline and public code are concrete strengths that aid reproducibility.
major comments (2)
- [§4 and §4.3] §4 (Experiments) and §4.3 (Ablations): The superiority claims rest on the tree-guided cue localization and reasoning-demand reward producing perception-aware adaptation. However, the reported results do not include ablations that hold the Video-ToC-SFT-1k / Video-ToC-RL-2k datasets and overall training protocol fixed while removing only the tree structure or the demand-based reward adjustment. Without these controls, benchmark improvements could be explained by higher-quality curated data or standard RL benefits rather than the claimed mechanisms.
- [§4] §4 (Experiments): The abstract and main results assert superiority on six benchmarks and a hallucination benchmark, yet the evaluation protocol, error bars, statistical significance tests, and per-benchmark breakdowns are not sufficiently detailed to verify that the gains are robust and not driven by training choices or dataset specifics.
minor comments (2)
- [§3] Notation for the tree structure and reward function should be introduced with explicit equations or pseudocode in §3 to make the reasoning-demand adjustment reproducible.
- [Figures] Figure captions and axis labels in the qualitative examples could more clearly indicate which visual cues were localized by the tree mechanism versus baseline attention.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important aspects of experimental rigor that we address below. We have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4 and §4.3] §4 (Experiments) and §4.3 (Ablations): The superiority claims rest on the tree-guided cue localization and reasoning-demand reward producing perception-aware adaptation. However, the reported results do not include ablations that hold the Video-ToC-SFT-1k / Video-ToC-RL-2k datasets and overall training protocol fixed while removing only the tree structure or the demand-based reward adjustment. Without these controls, benchmark improvements could be explained by higher-quality curated data or standard RL benefits rather than the claimed mechanisms.
Authors: We agree that the original ablations in §4.3 did not fully isolate the tree-guided localization and demand-based reward while holding the exact same datasets and training protocol fixed. We have added new controlled experiments in the revised §4.3 that (i) replace the tree structure with standard frame-level visual features while using identical Video-ToC-SFT-1k/RL-2k data and training, and (ii) replace the demand-based reward with uniform RL rewards under the same protocol. These ablations show statistically significant drops, confirming the mechanisms contribute beyond data curation and generic RL. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract and main results assert superiority on six benchmarks and a hallucination benchmark, yet the evaluation protocol, error bars, statistical significance tests, and per-benchmark breakdowns are not sufficiently detailed to verify that the gains are robust and not driven by training choices or dataset specifics.
Authors: We acknowledge the need for greater transparency. The revised §4 now includes: a detailed evaluation protocol section describing benchmark preprocessing, prompt templates, and metric computation; error bars computed over three random seeds for all main results; paired t-test p-values (<0.05) for reported improvements; and expanded per-benchmark tables with individual scores and breakdowns. These details are also summarized in the supplementary material. revision: yes
Circularity Check
No circularity; empirical claims rest on new components and external benchmarks without self-referential reductions.
full rationale
The paper introduces Video-ToC via three architectural innovations (tree-guided cue localization, reasoning-demand reward, automated annotation pipeline) and evaluates them on six video understanding benchmarks plus a hallucination benchmark. No equations, derivations, or first-principles results appear that reduce claimed improvements to quantities defined by the same mechanisms. No self-citations are used to justify uniqueness theorems or ansatzes, and no fitted parameters are relabeled as predictions. The central claims are therefore self-contained against external benchmarks rather than circular by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Y . Zhang, J. Wu, W. Li, B. Li, Z. Ma, Z. Liu, and C. Li, “Video instruction tuning with synthetic data,” arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review arXiv 2024
-
[2]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Internvideo2: Scaling foundation models for multimodal video understanding,
Y . Wang, K. Li, X. Li, J. Yu, Y . He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y . Shi et al., “Internvideo2: Scaling foundation models for multimodal video understanding,” in European Conference on Computer Vision. Springer, 2024, pp. 396–416
2024
-
[4]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Video-r1: Reinforcing video reasoning in mllms,
K. Feng, K. Gong, B. Li, Z. Guo, Y . Wang, T. Peng, B. Wang, and X. Yue, “Video-r1: Reinforcing video reasoning in mllms,” Advances in neural information processing systems, 2025
2025
-
[6]
arXiv preprint arXiv:2504.06958 (2025)
X. Li, Z. Yan, D. Meng, L. Dong, X. Zeng, Y . He, Y . Wang, Y . Qiao, Y . Wang, and L. Wang, “Videochat-r1: Enhancing spatio-temporal per- ception via reinforcement fine-tuning,” arXiv preprint arXiv:2504.06958, 2025
-
[7]
Videohallu: Evaluating and mitigating multi-modal hallucinations on synthetic video understanding
Z. Li, X. Wu, Y . Qin, G. Shi, H. Du, D. Manocha, T. Zhou, and J. L. Boyd-Graber, “Videohallu: Evaluating and mitigating multi-modal hallucinations for synthetic videos,” arXiv preprint arXiv:2505.01481, 2025
-
[8]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
J. Yang, S. Yang, A. W. Gupta, R. Han, L. Fei-Fei, and S. Xie, “Thinking in space: How multimodal large language models see, remember, and recall spaces,” arXiv preprint arXiv:2412.14171, 2024
-
[10]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
K. Hu, P. Wu, F. Pu, W. Xiao, Y . Zhang, X. Yue, B. Li, and Z. Liu, “Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos,” arXiv preprint arXiv:2501.13826, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Y . Zhao, L. Xie, H. Zhang, G. Gan, Y . Long, Z. Hu, T. Hu, W. Chen, C. Li, J. Song et al., “Mmvu: Measuring expert-level multi-discipline video understanding,” arXiv preprint arXiv:2501.12380, 2025
-
[12]
Mvbench: A comprehensive multi-modal video understanding benchmark,
K. Li, Y . Wang, Y . He, Y . Li, Y . Wang, Y . Liu, Z. Wang, J. Xu, G. Chen, P. Luo et al., “Mvbench: A comprehensive multi-modal video understanding benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 195–22 206
2024
-
[13]
Tempcompass: Do video llms really understand videos?
Y . Liu, S. Li, Y . Liu, Y . Wang, S. Ren, L. Li, S. Chen, X. Sun, and L. Hou, “Tempcompass: Do video llms really understand videos?” in Findings of the Association for Computational Linguistics, 2024, pp. 8731–8772
2024
-
[14]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
C. Fu, Y . Dai, Y . Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y . Shen, M. Zhang et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” arXiv preprint arXiv:2405.21075, 2024
work page internal anchor Pith review arXiv 2024
-
[15]
Y . Wang, Y . Wang, D. Zhao, C. Xie, and Z. Zheng, “Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models,” arXiv preprint arXiv:2406.16338, 2024
-
[16]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”Advances in neural information processing systems, vol. 36, pp. 34 892–34 916, 2023
2023
-
[17]
Minigpt-4: Enhancing vision-language understanding with advanced large language models,
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” in The Twelfth International Conference on Learning Representations, 2023
2023
-
[18]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024
2024
-
[20]
Video-chatgpt: Towards detailed video understanding via large vision and language models,
M. Maaz, H. Rasheed, S. Khan, and F. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), 2024, pp. 12 585– 12 602
2024
-
[21]
Video-llama: An instruction-tuned audio- visual language model for video understanding,
H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio- visual language model for video understanding,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2023, pp. 543–553
2023
-
[22]
St-llm: Large language models are effective temporal learners,
R. Liu, C. Li, H. Tang, Y . Ge, Y . Shan, and G. Li, “St-llm: Large language models are effective temporal learners,” in European Conference on Computer Vision. Springer, 2024, pp. 1–18
2024
-
[23]
Llama-vid: An image is worth 2 tokens in large language models,
Y . Li, C. Wang, and J. Jia, “Llama-vid: An image is worth 2 tokens in large language models,” in European Conference on Computer Vision. Springer, 2024, pp. 323–340
2024
-
[24]
Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,
J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742
2023
-
[25]
When thinking drifts: Evidential grounding for robust video reasoning,
M. Luo, Z. Xue, A. Dimakis, and K. Grauman, “When thinking drifts: Evidential grounding for robust video reasoning,” Advances in neural information processing systems, 2025
2025
-
[26]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
2022
-
[27]
Multi- modal chain-of-thought reasoning in language models,
Z. Zhang, A. Zhang, M. Li, G. Karypis, A. Smola et al., “Multi- modal chain-of-thought reasoning in language models,” Transactions on Machine Learning Research, 2023
2023
-
[28]
V*: Guided visual search as a core mechanism in multimodal llms,
P. Wu and S. Xie, “V*: Guided visual search as a core mechanism in multimodal llms,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 084–13 094
2024
-
[29]
S. Han, W. Huang, H. Shi, L. Zhuo, X. Su, S. Zhang, X. Zhou, X. Qi, Y . Liao, and S. Liu, “Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection,” arXiv preprint arXiv:2411.14794, 2024
-
[30]
Video- of-thought: step-by-step video reasoning from perception to cognition,
H. Fei, S. Wu, W. Ji, H. Zhang, M. Zhang, M. L. Lee, and W. Hsu, “Video- of-thought: step-by-step video reasoning from perception to cognition,” in International Conference on Machine Learning, 2024, pp. 13 109–13 125
2024
-
[31]
Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,
H. Shao, S. Qian, H. Xiao, G. Song, Z. Zong, L. Wang, Y . Liu, and H. Li, “Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning,” Advances in Neural Information Processing Systems, vol. 37, pp. 8612– 8642, 2024
2024
-
[32]
Cogcom: Train large vision-language models diving into details through chain of manipulations,
J. Qi, M. Ding, W. Wang, Y . Bai, Q. Lv, W. Hong, B. Xu, L. Hou, J. Li, Y . Donget al., “Cogcom: Train large vision-language models diving into details through chain of manipulations,”arXiv preprint arXiv:2402.04236, 2024
-
[33]
Llava-o1: Let vision language models reason step-by-step,
G. Xu, P. Jin, L. Hao, Y . Song, L. Sun, and L. Yuan, “Llava-o1: Let vision language models reason step-by-step,” arXiv preprint arXiv:2411.10440, 2024
-
[34]
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
T. Chu, Y . Zhai, J. Yang, S. Tong, S. Xie, D. Schuurmans, Q. V . Le, S. Levine, and Y . Ma, “Sft memorizes, rl generalizes: A comparative study JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12 of foundation model post-training,” arXiv preprint arXiv:2501.17161, 2025
work page internal anchor Pith review arXiv 2021
-
[35]
arXiv preprint arXiv:2503.12937 , year =
J. Zhang, J. Huang, H. Yao, S. Liu, X. Zhang, S. Lu, and D. Tao, “R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization,” arXiv preprint arXiv:2503.12937, 2025
-
[36]
arXiv preprint arXiv:2503.06520 (2025)
Y . Liu, B. Peng, Z. Zhong, Z. Yue, F. Lu, B. Yu, and J. Jia, “Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement,” arXiv preprint arXiv:2503.06520, 2025
-
[37]
Time-r1: Post-training large vision language model for temporal video grounding,
Y . Wang, B. Xu, Z. Yue, Z. Xiao, Z. Wang, L. Zhang, D. Yang, W. Wang, and Q. Jin, “Timezero: Temporal video grounding with reasoning-guided lvlm,” arXiv preprint arXiv:2503.13377, 2025
-
[38]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao, “Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts,” in The Twelfth International Conference on Learning Representations, 2023
2023
-
[39]
Measuring multimodal mathematical reasoning with math-vision dataset,
K. Wang, J. Pan, W. Shi, Z. Lu, H. Ren, A. Zhou, M. Zhan, and H. Li, “Measuring multimodal mathematical reasoning with math-vision dataset,” Advances in Neural Information Processing Systems, vol. 37, pp. 95 095–95 169, 2024
2024
-
[40]
Lisa: Reasoning segmentation via large language model,
X. Lai, Z. Tian, Y . Chen, Y . Li, Y . Yuan, S. Liu, and J. Jia, “Lisa: Reasoning segmentation via large language model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9579–9589
2024
-
[41]
Time-r1: Post-training large vision language model for temporal video grounding,
Y . Wang, Z. Wang, B. Xu, Y . Du, K. Lin, Z. Xiao, Z. Yue, J. Ju, L. Zhang, D. Yang et al., “Time-r1: Post-training large vision language model for temporal video grounding,” Advances in neural information processing systems, 2025
2025
-
[42]
Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025
X. Zhang, S. Wen, W. Wu, and L. Huang, “Tinyllava-video-r1: Towards smaller lmms for video reasoning,” arXiv preprint arXiv:2504.09641, 2025
-
[43]
Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,
Q. Wang, Y . Yu, Y . Yuan, R. Mao, and T. Zhou, “Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning,” Advances in neural information processing systems, 2025
2025
-
[44]
Panda- 70m: Captioning 70m videos with multiple cross-modality teachers,
T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y . Fang, H.-Y . Lee, J. Ren, M.-H. Yang et al., “Panda- 70m: Captioning 70m videos with multiple cross-modality teachers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 320–13 331
2024
-
[45]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Llamafactory: Unified efficient fine-tuning of 100+ language models,
Y . Zheng, R. Zhang, J. Zhang, Y . YeYanhan, and Z. Luo, “Llamafactory: Unified efficient fine-tuning of 100+ language models,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume3: System Demonstrations), 2024, pp. 400–410
2024
-
[47]
Easyr1: An efficient, scalable, multi-modality rl training framework,
Y . Zheng, J. Lu, S. Wang, Z. Feng, D. Kuang, and Y . Xiong, “Easyr1: An efficient, scalable, multi-modality rl training framework,” https://github. com/hiyouga/EasyR1, 2025
2025
-
[48]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Z. Cheng, S. Leng, H. Zhang, Y . Xin, X. Li, G. Chen, Y . Zhu, W. Zhang, Z. Luo, D. Zhao et al., “Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms,”arXiv preprint arXiv:2406.07476, 2024
work page internal anchor Pith review arXiv 2024
-
[50]
Long Context Transfer from Language to Vision
P. Zhang, K. Zhang, B. Li, G. Zeng, J. Yang, Y . Zhang, Z. Wang, H. Tan, C. Li, and Z. Liu, “Long context transfer from language to vision,” arXiv preprint arXiv:2406.16852, 2024
work page internal anchor Pith review arXiv 2024
-
[51]
Vila: On pre-training for visual language models,
J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han, “Vila: On pre-training for visual language models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 26 689–26 699
2024
-
[52]
LLaVA-OneVision: Easy Visual Task Transfer
B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P. Zhang, Y . Li, Z. Liu et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024. Qizhong Tanreceived the B.Eng. degree from Harbin Institute of Technology, Shenzhen, China, in 2024, where he is currently pursuing the Ph.D. degree with the School of Computer...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.