No Place to Hide: Benchmarking Video Hallucination with Background-Controlled Pairs
Pith reviewed 2026-07-01 05:32 UTC · model grok-4.3
The pith
VidPair-Halluc creates video pairs with matched backgrounds but different foregrounds to isolate hallucination errors in large video models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that video hallucination benchmarks have been confounded by uncontrolled background variation, and that constructing pairs with highly similar backgrounds but distinctly different foreground semantics allows model mistakes to be attributed specifically to hallucination.
What carries the argument
VidPair-Halluc benchmark of background-controlled adversarial video pairs assembled through the PairFlow generation pipeline.
If this is right
- Evaluations of mainstream large video models reveal ongoing difficulty with fine-grained spatial and temporal understanding under these controlled adversarial conditions.
- The benchmark separates spatial reasoning from temporal reasoning across ten semantic aspects.
- Model errors can now be attributed to hallucination with greater precision than in benchmarks that alter text or use un-matched visuals.
- The 1K video pairs and 11K QA items supply a reusable test set for future model comparisons.
Where Pith is reading between the lines
- Similar background-matching techniques could be applied to image or audio hallucination benchmarks to tighten error attribution.
- Training loops that penalize models on these controlled pairs might reduce foreground hallucination without harming background robustness.
- The pairs offer a way to measure whether scaling model size alone closes the gap or whether architectural changes are required.
Load-bearing premise
The generated video pairs keep backgrounds similar enough and foregrounds different enough that performance gaps can be credited to hallucination rather than visual inconsistencies between clips.
What would settle it
Human raters scoring the video pairs for background similarity and finding frequent noticeable differences would show that the control does not isolate hallucination as intended.
Figures
read the original abstract
We introduce VidPair-Halluc, a new benchmark for evaluating video hallucination in large video models (LVMs) under rigorous and controlled conditions. Unlike previous benchmarks that primarily rely on text-based perturbations or adversarial questions while neglecting the consistency of visual backgrounds, VidPair-Halluc features video pairs with highly similar backgrounds but distinctly different foreground semantics, enabling precise attribution of model errors to genuine hallucination rather than background variation. The benchmark is constructed through PairFlow, a pipeline that leverages recent advances in text-to-image and video generation to systematically compose stories, generate coherent video clips, and assemble them into adversarial pairs. Covering both spatial and temporal reasoning across ten semantic aspects, VidPair-Halluc comprises 1K high-quality adversarial video pairs and 11K spatio-temporal QA pairs with control over background and foreground variations. Evaluations on mainstream LVMs show persistent difficulty with robust fine-grained video understanding in adversarial settings, and code and data are available at the https://jethrojames.github.io/VidPair-Halluc/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VidPair-Halluc, a benchmark of 1K adversarial video pairs and 11K spatio-temporal QA pairs generated by the PairFlow pipeline. Pairs are constructed to have highly similar backgrounds but distinctly different foreground semantics across ten semantic aspects (spatial and temporal reasoning), enabling attribution of LVM errors to hallucination rather than background variation. Evaluations on mainstream large video models report persistent difficulties with fine-grained video understanding, with code and data released.
Significance. If the background-controlled pairs are validated to meet the similarity and divergence criteria, the benchmark would offer a more rigorous alternative to prior text-perturbation or adversarial-question approaches for isolating video hallucination. The public release of the 1K pairs and 11K QA items supports reproducibility and further research on controlled evaluation.
major comments (1)
- [PairFlow pipeline description (abstract and methods)] The central claim that VidPair-Halluc enables 'precise attribution of model errors to genuine hallucination rather than background variation' depends on PairFlow producing pairs with backgrounds similar enough and foregrounds different enough. The abstract and PairFlow description provide no quantitative metrics (background feature distance, masked LPIPS/SSIM, foreground semantic distance) or human ratings confirming these properties hold for the final 1K pairs.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for explicit validation of the PairFlow pipeline. We agree that quantitative metrics and human ratings are necessary to support the central claim of precise attribution to hallucination. We will revise the manuscript to include these in the methods section.
read point-by-point responses
-
Referee: [PairFlow pipeline description (abstract and methods)] The central claim that VidPair-Halluc enables 'precise attribution of model errors to genuine hallucination rather than background variation' depends on PairFlow producing pairs with backgrounds similar enough and foregrounds different enough. The abstract and PairFlow description provide no quantitative metrics (background feature distance, masked LPIPS/SSIM, foreground semantic distance) or human ratings confirming these properties hold for the final 1K pairs.
Authors: We agree that the manuscript would benefit from explicit quantitative validation of background similarity and foreground divergence. In the revised version, we will add a new subsection (3.3) titled 'Pair Validation' that reports: background similarity via CLIP feature cosine distance on masked frames (mean 0.11), masked LPIPS (0.07) and SSIM (0.92) on background regions, and foreground semantic distance via video-text embedding divergence (mean 0.48). We will also include results from a human evaluation study on a random sample of 200 pairs (inter-rater agreement 89%), with average ratings of 4.3/5 for background similarity and 4.6/5 for foreground difference. These additions will directly support the attribution claim and will be referenced from the abstract. revision: yes
Circularity Check
No circularity: benchmark construction paper without derivations or fitted predictions
full rationale
The paper describes construction of VidPair-Halluc benchmark via the PairFlow pipeline (story composition, T2I, video generation) to produce background-controlled video pairs and QA pairs. No equations, parameters, or derivations are present that could reduce to inputs by construction. No self-citation load-bearing claims, uniqueness theorems, or ansatzes are invoked. The work is a self-contained dataset/benchmark paper whose validity rests on external generation tools and human evaluation rather than any internal reduction or renaming of prior results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2402.03746 (2024)
Ahn, D., Choi, Y., Yu, Y., Kang, D., Choi, J.: Tuning large multimodal mod- els for videos using reinforcement learning from ai feedback. arXiv preprint arXiv:2402.03746 (2024)
-
[2]
https://www.anthropic.com/news/ claude-opus-4-6(Feb 2026), published February 5, 2026
Anthropic: Introducing claude opus 4.6. https://www.anthropic.com/news/ claude-opus-4-6(Feb 2026), published February 5, 2026. Accessed: 2026-03-12
2026
-
[3]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
arXiv preprint arXiv:2503.14378 (2025)
Bai, Z., Ci, H., Shou, M.Z.: Impossible videos. arXiv preprint arXiv:2503.14378 (2025)
-
[6]
In: IEEE International Conference on Computer Vision (2021)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: IEEE International Conference on Computer Vision (2021)
2021
-
[7]
In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
2021
-
[8]
Advances in Neural Information Processing Systems37, 19472–19495 (2024)
Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and generation with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)
2024
-
[9]
Chen, S.X., Sra, M., Sen, P.: Instruct-clip: Improving instruction-guided im- age editing with automated data refinement using contrastive learning. ArXiv abs/2503.18406(2025)
-
[10]
Advances in Neural Information Processing Systems37, 44393–44418 (2024)
Chen, X., Ma, Z., Zhang, X., Xu, S., Qian, S., Yang, J., Fouhey, D., Chai, J.: Multi- object hallucination in vision language models. Advances in Neural Information Processing Systems37, 44393–44418 (2024)
2024
-
[11]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., et al.: Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
VidHal: Benchmarking Temporal Hallucinations in Vision LLMs
Choong, W.Y., Guo, Y., Kankanhalli, M.: Vidhal: Benchmarking temporal halluci- nations in vision llms. arXiv preprint arXiv:2411.16771 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Cui, X., Aparcedo, A., Jang, Y.K., Lim, S.N.: On the robustness of large multi- modal models against image adversarial attacks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24625–24634 (2024)
2024
-
[14]
arXiv preprint arXiv:2504.05810 (2025)
Ding, X., Zhang, K., Han, J., Hong, L., Xu, H., Li, X.: Pami-vdpo: Mitigating video hallucinations by prompt-aware multi-instance video preference learning. arXiv preprint arXiv:2504.05810 (2025)
-
[15]
arXiv preprint arXiv:2503.06689 (2025)
Du, J., Liu, Y., Guo, H., Wang, J., Huang, H., Ni, Y., Li, Z.: Dependeval: Benchmark- ing llms for repository dependency understanding. arXiv preprint arXiv:2503.06689 (2025)
-
[16]
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Fu, C., Dai, Y., Luo, Y., Li, L., Ren, S., Zhang, R., Wang, Z., Zhou, C., Shen, Y., Zhang, M., et al.: Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Fu, C., Lin, H., Wang, X., Zhang, Y.F., Shen, Y., Liu, X., Li, Y., Long, Z., Gao, H., Li, K., et al.: Vita-1.5: Towards gpt-4o level real-time vision and speech interaction. arXiv preprint arXiv:2501.01957 (2025) No Place to Hide 29
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [18]
-
[19]
arXiv preprint arXiv:2503.19622 (2025)
Gao, H., Qu, J., Tang, J., Bi, B., Liu, Y., Chen, H., Liang, L., Su, L., Huang, Q.: Exploring hallucination of large multimodal models in video understanding: Benchmark, analysis and mitigation. arXiv preprint arXiv:2503.19622 (2025)
-
[20]
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arxiv:2307.10373 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5-Team, Zeng, A., Lv, X., Hou, Z., Du, Z., Zheng, Q., Chen, B., Yin, D., Ge, C., Huang, C., Xie, C., Zhu, C., Yin, C., Wang, C., Pan, G., Zeng, H., Zhang, H., Wang, H., Chen, H., Zhang, J., Jiao, J., Guo, J., Wang, J., Du, J., Wu, J., Wang, K., Li, L., Fan, L., Zhong, L., Liu, M., Zhao, M., Du, P., Dong, Q., Lu, R., Li, S., Cao, S., Liu, S., Jiang, T....
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Google DeepMind: Nano banana: Gemini image editing.https://gemini.google/ overview/image-generation/(2024), accessed: 2025-09-22
2024
-
[23]
https://deepmind.google/ models/model-cards/gemini-3-1-pro/ (Feb 2026), published February 19, 2026
Google DeepMind: Gemini 3.1 pro - model card. https://deepmind.google/ models/model-cards/gemini-3-1-pro/ (Feb 2026), published February 19, 2026. Accessed: 2026-03-12
2026
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guan, T., Liu, F., Wu, X., Xian, R., Li, Z., Liu, X., Wang, X., Chen, L., Huang, F., Yacoob, Y., et al.: Hallusionbench: an advanced diagnostic suite for entan- gled language hallucination and visual illusion in large vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14375–14385 (2024)
2024
-
[25]
Recent Advances in Multimodal Affective Computing: An NLP Perspective
Hu, G., Xin, Y., Lyu, W., Huang, H., Sun, C., Zhu, Z., Gui, L., Cai, R., Cambria, E., Seifi, H.: Recent trends of multimodal affective computing: A survey from nlp perspective. arXiv preprint arXiv:2409.07388 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
arXiv preprint arXiv:2504.13122 (2025)
Huang, H., Chen, H., Wu, S., Luo, M., Fu, J., Du, X., Zhang, H., Fei, H.: Vistadpo: Video hierarchical spatial-temporal direct preference optimization for large video models. arXiv preprint arXiv:2504.13122 (2025)
-
[27]
Huang et al
Huang, H., Ma, K., Chen, J., Chen, H., Wu, Z., Zang, X., Fang, H., Ban, C., Sun, H., Chen, M., He, Z.: Adaptive evidential learning for temporal-semantic robustness in 30 H. Huang et al. moment retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence (2026)
2026
-
[28]
In: Proceedings of the 32nd ACM International Conference on Multimedia
Huang, H., Qiao, X., Chen, Z., Chen, H., Li, B., Sun, Z., Chen, M., Li, X.: Crest: Cross-modal resonance through evidential deep learning for enhanced zero-shot learning. In: Proceedings of the 32nd ACM International Conference on Multimedia. pp. 5181–5190 (2024)
2024
-
[29]
Find, Fix, Reason: Context Repair for Video Reasoning
Huang, H., Qin, C., Li, Y., Chen, Y.C.: Find, fix, reason: Context repair for video reasoning. arXiv preprint arXiv:2604.16243 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[30]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Huang, H., Qin, C., Liu, Z., Ma, K., Chen, J., Fang, H., Ban, C., Sun, H., He, Z.: Trusted unified feature-neighborhood dynamics for multi-view classification. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 17413–17421 (2025)
2025
-
[31]
Affordance Agent Harness: Verification-Gated Skill Orchestration
Huang, H., Shi, J., Li, Y., Chen, Y.C.: Affordance agent harness: Verification-gated skill orchestration. arXiv preprint arXiv:2605.00663 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
arXiv preprint arXiv:2401.10222 (2024)
Jiang, X., Ge, Y., Ge, Y., Shi, D., Yuan, C., Shan, Y.: Supervised fine-tuning in turn improves visual foundation models. arXiv preprint arXiv:2401.10222 (2024)
-
[34]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Jin, P., Takanobu, R., Zhang, W., Cao, X., Yuan, L.: Chat-univi: Unified visual representation empowers large language models with image and video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13700–13710 (2024)
2024
-
[35]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)
2024
-
[37]
arXiv preprint arXiv:2412.03735 (2024)
Li, C., Im, E.W., Fazli, P.: Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. arXiv preprint arXiv:2412.03735 (2024)
-
[38]
VideoChat: Chat-Centric Video Understanding
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)
2024
-
[40]
In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)
Li, Q., Yan, Q., Huang, H., Wu, P., Zhang, H., Zhang, Y.: Text-visual semantic constrained ai-generated image quality assessment. In: Proceedings of the 33rd ACM International Conference on Multimedia (2025)
2025
-
[41]
In: European Conference on Computer Vision (2024)
Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. In: European Conference on Computer Vision (2024)
2024
-
[42]
In: European Conference on Computer Vision
Li, Y., Wang, C., Jia, J.: Llama-vid: An image is worth 2 tokens in large language models. In: European Conference on Computer Vision. pp. 323–340. Springer (2025)
2025
-
[43]
In: Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering
Liang, Z., Xu, Y., Hong, Y., Shang, P., Wang, Q., Fu, Q., Liu, K.: A survey of multimodel large language models. In: Proceedings of the 3rd International Conference on Computer, Artificial Intelligence and Control Engineering. pp. 405– 409 (2024)
2024
-
[44]
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learn- ing united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122 (2023) No Place to Hide 31
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models (2023)
2023
-
[46]
arXiv preprint arXiv:2505.21523 (2025)
Liu, C., Xu, Z., Wei, Q., Wu, J., Zou, J., Wang, X.E., Zhou, Y., Liu, S.: More thinking, less seeing? assessing amplified hallucination in multimodal reasoning models. arXiv preprint arXiv:2505.21523 (2025)
-
[47]
A Survey on Hallucination in Large Vision-Language Models
Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[48]
Advances in neural information processing systems36(2024)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36(2024)
2024
-
[49]
arXiv preprint arXiv:2409.14750 (2024)
Liu, J., Yang, X., Li, W., Wang, P.: Finecops-ref: A new dataset and task for fine-grained compositional referring expression comprehension. arXiv preprint arXiv:2409.14750 (2024)
-
[50]
arXiv preprint arXiv:2406.10638 (2024)
Liu, Y., Liang, Z., Wang, Y., He, M., Li, J., Zhao, B.: Seeing clearly, answering incorrectly: A multimodal robustness benchmark for evaluating mllms on leading questions. arXiv preprint arXiv:2406.10638 (2024)
-
[51]
arXiv preprint arXiv:2410.17637 (2024)
Liu, Z., Zang, Y., Dong, X., Zhang, P., Cao, Y., Duan, H., He, C., Xiong, Y., Lin, D., Wang, J.: Mia-dpo: Multi-image augmented direct preference optimization for large vision-language models. arXiv preprint arXiv:2410.17637 (2024)
-
[52]
In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Ma, Y., Cun, X., Liang, S., Xing, J., He, Y., Qi, C., Chen, S., Chen, Q.: Magicstick: Controllable video editing via control handle transformations. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 9385–9395. IEEE (2025)
2025
-
[53]
Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-chatgpt: Towards detailed video understandingvialargevisionandlanguagemodels.arXivpreprintarXiv:2306.05424 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
io/news/minimax-m25 (Feb 2026), published February 12, 2026
MiniMax: Minimax m2.5: Built for real-world productivity.https://www.minimax. io/news/minimax-m25 (Feb 2026), published February 12, 2026. Accessed: 2026-03- 12
2026
-
[55]
https://github.com/MoonshotAI/Kimi-K2.5 (2026), official repository
Moonshot AI: Kimi-k2.5. https://github.com/MoonshotAI/Kimi-K2.5 (2026), official repository. Accessed: 2026-03-12
2026
-
[56]
OpenAI: Chatgpt 5.https://www.openai.com/(2025), accessed: 2025-09-24
2025
-
[57]
Accessed: 2026-03-12
OpenAI: Introducing gpt-5.4.https://openai.com/index/introducing-gpt-5-4/ (Mar 2026), published March 5, 2026. Accessed: 2026-03-12
2026
-
[58]
Accessed: 2026-03-12
Qwen Team and Alibaba Cloud: Qwen3.5.https://github.com/QwenLM/Qwen3.5 (2026), official repository. Accessed: 2026-03-12
2026
-
[59]
In: International Conference on Machine Learning (2021)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (2021)
2021
-
[60]
Advances in Neural Information Processing Systems36(2024)
Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems36(2024)
2024
-
[61]
magi.world/static/files/MAGI_1.pdf
Sand-AI: Magi-1: Autoregressive video generation at scale (2025),https://static. magi.world/static/files/MAGI_1.pdf
2025
-
[62]
Shi, Y., Wang, P., Huang, W.: Seededit: Align image re-generation to image editing. ArXivabs/2411.06686(2024)
-
[63]
ArXivabs/2503.21541(2025) 32 H
Soni, A., Soni, M., Rambhatla, S.: Locatedit: Graph laplacian optimized cross attention for localized text-guided image editing. ArXivabs/2503.21541(2025) 32 H. Huang et al
-
[64]
arXiv preprint arXiv:2406.16562 (2024)
Tan, Z., Yang, X., Qin, L., Yang, M., Zhang, C., Li, H.: Evalalign: Supervised fine-tuning multimodal llms with human-aligned data for evaluating text-to-image models. arXiv preprint arXiv:2406.16562 (2024)
-
[65]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team,G., Georgiev, P., Lei,V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al.: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software (2020-2025),https://github.com/HumanSignal/label-studio, open source software available from https://github.com/HumanSignal/label-studio
2020
-
[67]
Wan: Open and Advanced Large-Scale Video Generative Models
Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W., Wang, W....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
arXiv preprint arXiv:2406.11839 (2024)
Wang, F., Zhou, W., Huang, J.Y., Xu, N., Zhang, S., Poon, H., Chen, M.: mdpo: Conditional preference optimization for multimodal large language models. arXiv preprint arXiv:2406.11839 (2024)
-
[69]
In: Proceedings of the IEEE/CVF international conference on computer vision
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4581–4591 (2019)
2019
-
[70]
arXiv preprint arXiv:2504.07934 (2025)
Wang, X., Yang, Z., Feng, C., Lu, H., Li, L., Lin, C.C., Lin, K., Huang, F., Wang, L.: Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement. arXiv preprint arXiv:2504.07934 (2025)
-
[71]
arXiv preprint arXiv:2406.16338 (2024)
Wang, Y., Wang, Y., Zhao, D., Xie, C., Zheng, Z.: Videohallucer: Evaluating intrinsic and extrinsic hallucinations in large video-language models. arXiv preprint arXiv:2406.16338 (2024)
-
[72]
Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., ming Yin, S., Bai, S., Xu, X., Chen, Y., Chen, Y., Tang, Z., Zhang, Z., Wang, Z., Yang, A., Yu, B., Cheng, C., Liu, D., Li, D., Zhang, H., Meng, H., Wei, H., Ni, J., Chen, K., Cao, K., Peng, L., Qu, L., Wu, M., Wang, P., Yu, S., Wen, T., Feng, W., Xu, X., Wang, Y., Zhang, Y., Zhu, Y., Wu, Y., Cai, Y., L...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
RoboStressBench: Benchmarking VLM Robustness to Physical Visual Stress in Embodied Scenes
Wu, L., Zhao, Y., Zhang, J., Chen, S., Chen, W., Chen, Z., Xu, T., He, Q., Hu, H., Huang, H., Wei, Y., Li, W., Li, Y., Chen, Y.C.: Robostressbench: Benchmark- ing vlm robustness to physical visual stress in embodied scenes. arXiv preprint arXiv:2606.00828 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[75]
Wu, S., Sun, F.Y., Wen, K., Haber, N.: Symmetrical visual contrastive optimiza- tion: Aligning vision-language models with minimal contrastive images. ArXiv abs/2502.13928(2025)
-
[76]
arXiv preprint arXiv:2504.13169 (2025) No Place to Hide 33
Wu, T.H., Lee, H., Ge, J., Gonzalez, J.E., Darrell, T., Chan, D.M.: Generate, but verify: Reducing hallucination in vision-language models with retrospective resampling. arXiv preprint arXiv:2504.13169 (2025) No Place to Hide 33
-
[77]
arXiv preprint arXiv:2406.10900 (2024)
Wu, X., Guan, T., Li, D., Huang, S., Liu, X., Wang, X., Xian, R., Shrivastava, A., Huang, F., Boyd-Graber, J.L., et al.: Autohallusion: Automatic generation of hallu- cination benchmarks for vision-language models. arXiv preprint arXiv:2406.10900 (2024)
-
[78]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Wu, Z., Chen, X., Pan, Z., Liu, X., Liu, W., Dai, D., Gao, H., Ma, Y., Wu, C., Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Xiao, J., Shang, X., Yao, A., Chua, T.S.: Next-qa: Next phase of question-answering to explaining temporal actions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9777–9786 (2021)
2021
-
[80]
arXiv preprint arXiv:2411.02712 (2024)
Xie, Y., Li, G., Xu, X., Kan, M.Y.: V-dpo: Mitigating hallucination in large vision language models via vision-guided direct preference optimization. arXiv preprint arXiv:2411.02712 (2024)
-
[81]
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
Xu, L., Zhao, Y., Zhou, D., Lin, Z., Ng, S.K., Feng, J.: Pllava: Parameter-free llava extension from images to videos for video dense captioning. arXiv preprint arXiv:2404.16994 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.