TuringViT: Making SOTA Vision Transformers Accessible to All
Pith reviewed 2026-06-29 05:30 UTC · model grok-4.3
The pith
TuringViT trains Vision Transformers that outperform standard baselines using only 10% of the data through linear attention, curated datasets, and built-in dynamic resolution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TuringViT replaces standard softmax attention with Turing Linear Attention for efficient sequence modeling, pairs it with VISTA-Curation to build supervision-rich image-video training sets, and adds native dynamic-resolution pretraining that accepts flexible inputs from the start. These elements together let the model exceed leading open-source ViT baselines while using only 10% of the data, deliver stronger results when plugged into downstream VLMs, and exhibit markedly better latency scaling at high resolutions. Scaling-law plots indicate performance keeps rising predictably with more curated data rather than saturating.
What carries the argument
Turing Linear Attention (TLA) for efficient sequence modeling, together with VISTA-Curation of supervision-rich data and native dynamic-resolution pretraining.
If this is right
- Outperforms open-source ViT baselines while using only 10% of the data
- Produces stronger results when used as the visual encoder inside vision-language models
- Shows substantially better latency scaling when inputs increase in resolution
- Continues to improve according to scaling laws as the volume of curated data grows
- Supplies a reproducible pipeline that lowers the barrier for training and deploying custom ViTs
Where Pith is reading between the lines
- The same three-component recipe could be tested on non-vision transformer backbones to check whether the efficiency gains transfer
- If the curation step proves reusable, teams might experiment with swapping only the attention module while keeping the data pipeline fixed
- Real-world deployment across multiple AI systems suggests the design choices also simplify hardware mapping and temporal modeling integration
- One could measure whether the latency advantage persists when the model is quantized or run on edge devices
Load-bearing premise
That the combination of linear attention, curated data, and dynamic pretraining delivers generalizable gains without hidden costs in quality or compatibility that standard training methods avoid.
What would settle it
Retrain a standard softmax-attention ViT on the same VISTA-Curated data using the same dynamic-resolution schedule and compare its accuracy, VLM downstream scores, and high-resolution latency directly against TuringViT.
Figures
read the original abstract
Modern VLMs and VLA systems commonly adopt off-the-shelf ViTs such as SigLIP2 as visual encoders, but diverse downstream requirements in latency, temporal modeling, and VLM integration often call for customized SOTA-level ViTs. Training such encoders remains beyond the reach of much of the community, as it requires massive image-text data, while standard softmax attention makes high-resolution or dynamic-resolution pretraining prohibitively costly and often forces low-resolution pretraining followed by post-hoc adaptation. TuringViT addresses these challenges with three key designs: Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to construct supervision-rich image-video training data, and native dynamic-resolution pretraining that supports flexible inputs from the start and transfers seamlessly to downstream VLMs. As a result, TuringViT outperforms leading open-source ViT baselines with only 10% of the data, achieves stronger downstream VLM performance, and delivers substantially better latency scaling on high-resolution inputs. Our scaling-law analysis further shows that TuringViT continues to improve predictably with curated data scale, far from saturation. Its fast adaptation, hardware-friendly design, and efficient deployment have made it a unified visual foundation across XPeng's AI systems. More broadly, TuringViT provides a reproducible pipeline that dramatically lowers the cost for the community to train, customize, and deploy SOTA-level ViTs, moving toward making such Vision Transformers accessible to all.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TuringViT, a Vision Transformer that integrates Turing Linear Attention (TLA) for efficient sequence modeling, VISTA-Curation to build supervision-rich image-video datasets, and native dynamic-resolution pretraining. It claims that these designs together enable outperforming leading open-source ViT baselines using only 10% of typical data, stronger downstream VLM performance, substantially improved latency scaling on high-resolution inputs, predictable scaling-law behavior with additional curated data, and successful deployment as a unified visual foundation in XPeng's AI systems, while providing a reproducible pipeline to lower barriers for the community.
Significance. If the empirical claims hold after isolating the contributions of each component and providing quantitative metrics, the work would meaningfully lower the data and compute thresholds for training customized high-performance Vision Transformers, facilitating broader adoption in latency-sensitive, high-resolution, or temporally-aware applications.
major comments (2)
- [Abstract] Abstract: The claim that TuringViT 'outperforms leading open-source ViT baselines with only 10% of the data' and delivers 'stronger downstream VLM performance' and 'substantially better latency scaling' is presented without any metrics, baselines, ablation tables, or experimental controls, rendering the central empirical outcomes unverifiable from the manuscript.
- [Abstract] Abstract: The attribution of gains to the joint effect of TLA and native dynamic-resolution pretraining (as opposed to VISTA-Curation alone) lacks isolating evidence; no experiment is described that trains a standard softmax-attention ViT on the identical VISTA-Curated dataset to test whether the architectural choices are load-bearing for the reported 10%-data advantage.
minor comments (1)
- [Abstract] Abstract: The phrase 'scaling-law analysis' is invoked without specifying the functional form, fitted parameters, or plotted ranges that would allow readers to assess the 'far from saturation' statement.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below. Where the concerns identify gaps in the current presentation, we have outlined specific revisions to strengthen the empirical grounding and isolating evidence.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that TuringViT 'outperforms leading open-source ViT baselines with only 10% of the data' and delivers 'stronger downstream VLM performance' and 'substantially better latency scaling' is presented without any metrics, baselines, ablation tables, or experimental controls, rendering the central empirical outcomes unverifiable from the manuscript.
Authors: We agree that the abstract as written is too high-level and does not include quantitative anchors. The full manuscript (Sections 4.1–4.3 and Tables 2–5) reports the concrete metrics, including top-1 accuracy deltas, data-volume comparisons against SigLIP2/EVA-CLIP baselines, VLM downstream scores on VQAv2 and COCO, and latency curves at 224/448/1024 resolutions. To make these outcomes immediately verifiable from the abstract itself, we will revise the abstract to incorporate the key numerical results (e.g., “+1.8% ImageNet accuracy at 10% data, 2.3× faster high-resolution inference, +3.1% VLM average”) while preserving the original length constraints. revision: yes
-
Referee: [Abstract] Abstract: The attribution of gains to the joint effect of TLA and native dynamic-resolution pretraining (as opposed to VISTA-Curation alone) lacks isolating evidence; no experiment is described that trains a standard softmax-attention ViT on the identical VISTA-Curated dataset to test whether the architectural choices are load-bearing for the reported 10%-data advantage.
Authors: The referee correctly identifies that the current manuscript does not contain a controlled ablation that trains a standard softmax-attention ViT on exactly the same VISTA-Curated corpus. While the paper does compare TuringViT against public softmax baselines trained on larger, non-curated corpora (Table 3), this does not isolate the architectural contribution under identical data conditions. We will therefore add a new controlled experiment (new Table 6) that trains a standard ViT-B/16 with softmax attention on the identical VISTA-Curated dataset at the same compute budget; the revised manuscript will report the resulting accuracy gap and latency numbers to quantify the load-bearing role of TLA and dynamic-resolution pretraining. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents purely empirical results from training and evaluating TuringViT using TLA, VISTA-Curation, and native dynamic-resolution pretraining. No mathematical derivations, equations, fitted-parameter predictions, or self-citation load-bearing steps appear in the abstract or described content. Performance claims (outperformance with 10% data, scaling laws, downstream VLM gains) are framed as experimental outcomes rather than reductions to inputs by construction. The reader's assessment of score 2.0 is consistent with this self-contained empirical structure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023. URLhttps://arxiv.org/abs/2308.12966
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Wenhai Wang, Enze Xie, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks.CoRR, abs/2312.14238, 2023. URL https://arxiv.org/abs/2312.14238
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Physical Intelligence, Bo Ai, Ali Amin, et al.π 0.7: a steerable generalist robotic foundation model with emergent capabilities. 2026. URLhttps://arxiv.org/abs/2604.15483
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[5]
Dm0: An embodied-native vision-language-action model towards physical ai
En Yu, Haoran Lv, Jianjian Sun, et al. Dm0: An embodied-native vision-language-action model towards physical ai. 2026. URLhttps://arxiv.org/abs/2602.14974
-
[6]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021
2021
-
[7]
Mobileclip: Fast image-text models through multi-modal reinforced training
Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, et al. Mobileclip: Fast image-text models through multi-modal reinforced training. 2024. URLhttps://arxiv.org/abs/2311.17049
-
[8]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. 2025. URL https://arxiv.org/abs/2502.14786
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Shuai Bai, Keqin Chen, Xuejing Liu, et al. Qwen2.5-vl technical report. 2025. URL https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang, Kunchang Li, Yizhuo Li, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
An image is worth 16x16 words: Transformers for image recognition at scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021
2021
-
[13]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, et al. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021
2021
-
[14]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, et al. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020
2020
-
[15]
LAION-5b: An open large-scale dataset for training next generation image-text models
Christoph Schuhmann, Romain Beaumont, Cade Vencu, et al. LAION-5b: An open large-scale dataset for training next generation image-text models. 2022
2022
-
[16]
Datacomp: In search of the next generation of multimodal datasets
Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, et al. Datacomp: In search of the next generation of multimodal datasets. 2023. URLhttps://arxiv.org/abs/2304.14108
-
[17]
Data filtering networks
Alex Fang, Albin Madappally Jose, Amit Jain, et al. Data filtering networks. InInternational Conference on Learning Representations, volume 2024, pages 36221–36237, 2024
2024
-
[18]
Sharegpt4v: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, et al. Sharegpt4v: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision, pages 370–387. Springer, 2024
2024
-
[19]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, et al. Llavanext: Improved reasoning, ocr, and world knowledge, 2024
2024
-
[20]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024
2024
-
[21]
Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023
Mostafa Dehghani, Basil Mustafa, Josip Djolonga, et al. Patch n’pack: Navit, a vision transformer for any aspect ratio and resolution.Advances in Neural Information Processing Systems, 36:2252–2274, 2023
2023
-
[22]
Long-CLIP: Unlocking the long-text capability of CLIP
Beichen Zhang, Pan Zhang, Xiaoyi Dong, et al. Long-CLIP: Unlocking the long-text capability of CLIP. arXiv preprint arXiv:2403.15378, 2024
-
[23]
TULIP: Token-length upgraded CLIP.arXiv preprint arXiv:2410.10034, 2024
Ivona Najdenkoska, Mohammad Mahdi Derakhshani, Yuki M Asano, et al. TULIP: Token-length upgraded CLIP.arXiv preprint arXiv:2410.10034, 2024
-
[24]
Efficient attention: Attention with linear complexities
Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, et al. Efficient attention: Attention with linear complexities. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3531–3539, 2021. 16 FOUNDATIONMODELTEAMTuringViTTECHNICALREPORT
2021
-
[25]
Rethinking Attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[26]
Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, et al. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z Li, Madian Khabsa, et al. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[28]
Oriane Siméoni, Huy V . V o, Maximilian Seitzer, et al. Dinov3. 2025. URL https://arxiv.org/abs/2508.10104
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Murtadha Ahmed, Yu Lu, et al. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
2024
-
[30]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, et al. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024
2024
-
[31]
Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
Biao Zhang, Rico Sennrich. Root mean square layer normalization.Advances in neural information processing systems, 32, 2019
2019
-
[32]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[33]
Cliploss and norm-based data selection methods for multimodal contrastive learning.Advances in Neural Information Processing Systems, 37:15028–15069, 2024
Yiping Wang, Yifang Chen, Wendan Yan, et al. Cliploss and norm-based data selection methods for multimodal contrastive learning.Advances in Neural Information Processing Systems, 37:15028–15069, 2024
2024
-
[34]
Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024
Yuxin Fang, Quan Sun, Xinggang Wang, et al. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024
2024
-
[35]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, et al. Eva-clip: Improved training techniques for clip at scale.arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, et al. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023
2023
-
[37]
Classification done right for vision-language pre-training
Zilong Huang, Qinghao Ye, Bingyi Kang, et al. Classification done right for vision-language pre-training. Advances in Neural Information Processing Systems, 37:96483–96504, 2024
2024
-
[38]
Mobileclip2: Improving multi-modal reinforced training.arXiv preprint arXiv:2508.20691, 2025
Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, et al. Mobileclip2: Improving multi-modal reinforced training.arXiv preprint arXiv:2508.20691, 2025
-
[39]
Dong Guo, Faming Wu, Feida Zhu, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, et al. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009
2009
-
[41]
Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400
Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, et al. Do imagenet classifiers generalize to imagenet? InInternational conference on machine learning, pages 5389–5400. PMLR, 2019
2019
-
[42]
Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in neural information processing systems, 32, 2019
Andrei Barbu, David Mayo, Julian Alverio, et al. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in neural information processing systems, 32, 2019
2019
-
[43]
Natural adversarial examples
Dan Hendrycks, Kevin Zhao, Steven Basart, et al. Natural adversarial examples. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15262–15271, 2021
2021
-
[44]
The many faces of robustness: A critical analysis of out-of-distribution generalization
Dan Hendrycks, Steven Basart, Norman Mu, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021
2021
-
[45]
Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019
Haohan Wang, Songwei Ge, Zachary Lipton, et al. Learning robust global representations by penalizing local predictive power.Advances in neural information processing systems, 32, 2019
2019
-
[46]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. Microsoft coco: Common objects in context. 2015. URLhttps://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[47]
Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models
Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, et al. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. 2016. URL https://arxiv.org/abs/1505.04870
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[48]
Indoor segmentation and support inference from rgbd images
Nathan Silberman, Derek Hoiem, Pushmeet Kohli, et al. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision, pages 746–760. Springer, 2012
2012
-
[49]
Scene parsing through ade20k dataset
Bolei Zhou, Hang Zhao, Xavier Puig, et al. Scene parsing through ade20k dataset. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017
2017
-
[50]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, et al. The 2017 davis challenge on video object segmentation.arXiv preprint arXiv:1704.00675, 2017. 17 FOUNDATIONMODELTEAMTuringViTTECHNICALREPORT
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[51]
Vision transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021
2021
-
[52]
Unified perceptual parsing for scene understanding
Tete Xiao, Yingcheng Liu, Bolei Zhou, et al. Unified perceptual parsing for scene understanding. In Proceedings of the European conference on computer vision (ECCV), pages 418–434, 2018
2018
-
[53]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[54]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, et al. Msr-vtt: A large video description dataset for bridging video and language. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016
2016
-
[55]
Modeling context in referring expressions
Licheng Yu, Patrick Poirson, Shan Yang, et al. Modeling context in referring expressions. InEuropean Conference on Computer Vision, pages 69–85. Springer, 2016
2016
-
[56]
Generation and comprehension of unambiguous object descriptions
Junhua Mao, Jonathan Huang, Alexander Toshev, et al. Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11–20, 2016
2016
-
[57]
Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023
Roni Paiss, Ariel Ephrat, Omer Tov, et al. Teaching CLIP to count to ten.arXiv preprint arXiv:2302.12066, 2023
-
[58]
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, et al. BLINK: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
Yuliang Liu, Zhang Li, Mingxin Huang, et al. OCRBench: On the hidden mystery of OCR in large multimodal models.arXiv preprint arXiv:2305.07895, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
Ling Fu, Biao Yang, Zhebin Kuang, et al. OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning.arXiv preprint arXiv:2501.00321, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[61]
ChartQA: A benchmark for question answering about charts with visual and logical reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, et al. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, 2022
2022
-
[62]
Towards VQA models that can read
Amanpreet Singh, Vivek Natarajan, Meet Shah, et al. Towards VQA models that can read. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8317–8326, 2019
2019
-
[63]
Minesh Mathew, Dimosthenis Karatzas, C. V . Jawahar. DocVQA: A dataset for VQA on document images. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2200–2209, 2021
2021
-
[64]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, et al. A diagram is worth a dozen images. InEuropean Conference on Computer Vision, pages 235–251. Springer, 2016
2016
-
[65]
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI.arXiv preprint arXiv:2311.16502, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
Ke Wang, Junting Pan, Weikang Shi, et al. Measuring multimodal mathematical reasoning with MATH-Vision dataset.arXiv preprint arXiv:2402.14804, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, et al. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[68]
Chengke Zou, Xingang Guo, Rui Yang, et al. DynaMath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models.arXiv preprint arXiv:2411.00836, 2024
-
[69]
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
Yijia Xiao, Edward Sun, Tianyu Liu, et al. LogicVista: Multimodal LLM logical reasoning benchmark in visual contexts.arXiv preprint arXiv:2407.04973, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
MMBench: Is Your Multi-modal Model an All-around Player?
Yuan Liu, Haodong Duan, Yuanhan Zhang, et al. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[71]
Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024
xAI. Grok-1.5 vision preview.https://x.ai/news/grok-1.5v, 2024
2024
-
[72]
Are We on the Right Way for Evaluating Large Vision-Language Models?
Lin Chen, Jinsong Li, Xiaoyi Dong, et al. Are we on the right way for evaluating large vision-language models?arXiv preprint arXiv:2403.20330, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, et al. MM-Vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[74]
Tianrui Guan, Fuxiao Liu, Xiyang Wu, et al. HallusionBench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models.arXiv preprint arXiv:2310.14566, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, et al. MME-RealWorld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans?arXiv preprint arXiv:2408.13257, 2024. 18 FOUNDATIONMODELTEAMTuringViTTECHNICALREPORT
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[76]
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
Bohao Li, Rui Wang, Guangzhi Wang, et al. SEED-Bench: Benchmarking multimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[77]
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He, et al. MVBench: A comprehensive multi-modal video understanding benchmark.arXiv preprint arXiv:2311.17005, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
MLVU: Benchmarking Multi-task Long Video Understanding
Junjie Zhou, Yan Shu, Bo Zhao, et al. MLVU: Benchmarking multi-task long video understanding.arXiv preprint arXiv:2406.04264, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
Haoning Wu, Dongxu Li, Bei Chen, et al. LongVideoBench: A benchmark for long-context interleaved video-language understanding.arXiv preprint arXiv:2407.15754, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Qwen, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. 19
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.