Deep Pre-Alignment for VLMs

arxiv: 2605.15300 · v1 · pith:4ZAXKMRQnew · submitted 2026-05-14 · 💻 cs.CV

Deep Pre-Alignment for VLMs

Tianyu Yu , Kechen Fang , Zihao Wan , Kaidong Zhang , Yicheng Zhang , Jun Song , Bo Zheng , Yuan Yao This is my paper

Pith reviewed 2026-05-19 16:22 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision language modelsdeep pre-alignmentmultimodal alignmentperceivervision transformerlanguage model forgettingVLM architecture

0 comments p. Extension

pith:4ZAXKMRQ Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{4ZAXKMRQ}

Prints a linked pith:4ZAXKMRQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Deep Pre-Alignment replaces the ViT encoder with a small VLM perceiver to align visual features deeply with the LLM's text space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models often struggle because visual features from standard encoders require the language model to spend its early layers on basic alignment rather than complex reasoning. This paper tests whether inserting a small vision-language model as a perceiver before the main language model can handle that alignment in advance. Experiments show gains on multimodal tasks that grow with model size and less loss of text-only performance. A reader might care if this modular swap proves to be a reliable way to build stronger multimodal systems without retraining everything from scratch.

Core claim

By using a small VLM as the visual perceiver instead of a ViT plus projector, the architecture ensures that visual features enter the large language model already aligned with its text space, freeing the LLM's layers for deeper understanding and reducing forgetting of language capabilities.

What carries the argument

The small VLM perceiver that maps visual inputs into features deeply aligned with the target LLM's text space.

If this is right

Outperforms standard architectures by 1.9 points on 8 multimodal benchmarks at 4B scale.
Gains increase to 3.0 points at 32B scale.
Reduces language capability forgetting by 32.9% across 3 text benchmarks.
Delivers consistent improvements across Qwen3 and LLaMA 3.2 model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could simplify scaling VLMs by allowing pre-trained small models to handle initial alignment for larger ones.
Future architectures might explore varying the size or training of the perceiver independently of the main LLM.
Similar pre-alignment ideas could apply to other multimodal combinations like audio or video with language.

Load-bearing premise

That the outputs from the small VLM perceiver are aligned enough with the LLM text space to prevent the LLM from using its early layers for superficial modality matching.

What would settle it

Measuring attention patterns in the LLM's first layers with and without the perceiver to check if modality alignment still occurs early on, or observing no performance gain on benchmarks.

Figures

Figures reproduced from arXiv: 2605.15300 by Bo Zheng, Jun Song, Kaidong Zhang, Kechen Fang, Tianyu Yu, Yicheng Zhang, Yuan Yao, Zihao Wan.

**Figure 1.** Figure 1: (a) Architectural overview of DPA. By simply replacing the ViT encoder with a perceiver VLM, DPA offloads the superficial modality alignment burden from the target large language model, and deeply align visual features inside the perceiver language blocks. The input visual features are thus better aligned with text space. (b) DPA significantly minimize the modality gap (Huang et al., 2025) and also improve… view at source ↗

**Figure 2.** Figure 2: Correlation between perceiver standalone performance and corresponding DPA model performance. ρ denotes the Pearson correlation coefficient in each task group. 0, 10K, 250K, 1M: number of instruction samples used to train corresponding perceivers; untrained: the perceiver used in DPA consists of randomly initialized projections. experiments. The evaluation results of these models are shown in [PITH_FULL… view at source ↗

**Figure 3.** Figure 3: Modality gap comparison of different perceiver layers. We compute the layer-wise MIR between per-layer output of perceiver with the text space of the Qwen3 0.6B model. All models exhibit fast convergence of the modality gap in deep layers, and finally reach a similar level. do not use Qwen3 4B or Qwen3 32B as the target large language model since MIR requires dimensions of both spaces to be the same. We … view at source ↗

**Figure 4.** Figure 4: Cross-layer intra-modal similarity matrices of text spaces and visual spaces. “T” and “V” denote text and visual spaces, respectively. (Lighter colors indicate higher similarities.) The DPA visual space (d) exhibits “block-diagonal” subspaces that resemble the subspaces found in text spaces (a-c), whereas the baseline visual space (e) remains fuzzy. 0 20 Model Block Index 10 0 10 1 10 2 Modality Gap DPA-Qw… view at source ↗

**Figure 5.** Figure 5: Per-layer modality gap comparison of different models. DPA consistently minimizes the modality gap on most layers, and the reduction on the 32B setting is more significant. 0.5 1.0 1.5 2.0 Epoch 2.5 2.6 2.7 Modality Gap DPA-Qwen3-4B LLaVA-NeXT-Qwen3-4B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Modality Gap dynamics during instruction tuning stage. DPA consistently achieves smaller modality gap during the whole training process. LLaVA (Liu et al., 2023; 2024a), use a fixed-resolution CLIP (Radford et al., 2021) directly as the visual encoder. The output visual features are then injected into the large language model through different connectors. Subsequent works (Liu et al., 2024b; Guo et al., 20… view at source ↗

read the original abstract

Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9\% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPA swaps the ViT for a small VLM perceiver to pre-align visual features, showing 1.9-3 point benchmark gains and 33% less language forgetting, but the mechanism claim rests on indirect evidence.

read the letter

The paper's core move is to replace the standard ViT-plus-projector setup with a small VLM that acts as a perceiver, so the features reaching the main LLM are already closer to text space. They report this helps at both 4B and 32B scales, with gains on eight multimodal benchmarks and a clear drop in forgetting on text-only tasks. The approach is presented as a modular drop-in that works across Qwen and Llama families with only modest extra cost.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Deep Pre-Alignment (DPA) for vision-language models. It replaces the standard ViT encoder plus projector with a small VLM acting as perceiver, motivated by prior work showing that visual features remain distant from text space in early LLM layers. Experiments report that DPA yields 1.9-point average gains across 8 multimodal benchmarks at the 4B scale (widening to 3.0 points at 32B) and a 32.9% reduction in language forgetting on 3 text benchmarks, with results consistent across Qwen3 and LLaMA 3.2 families. The approach is presented as a modular, low-overhead upgrade.

Significance. If the performance numbers hold under fuller controls, DPA offers a practical architectural alternative that could reduce alignment overhead and forgetting during VLM scaling. The reported consistency across model scales and LLM families provides a useful empirical signal for the community, though the specific mechanistic attribution to 'deep pre-alignment' would need direct verification to strengthen the contribution beyond benchmark deltas.

major comments (1)

Motivation and abstract: The central interpretive claim that the small VLM perceiver produces 'deep pre-alignment' (freeing the LLM's initial layers from superficial modality matching) is load-bearing for the paper's narrative but is not directly tested. No layer-wise cosine similarity, feature-distance metrics, or attention-to-modality diagnostics are reported, nor is there an ablation that holds perceiver capacity fixed while varying only alignment depth. The 1.9–3.0 point gains and 32.9% forgetting reduction are consistent with the story but do not distinguish it from alternative explanations such as richer or differently distributed features.

minor comments (2)

Experimental details: The abstract and results sections should specify the exact baselines (including whether they use the same training data and schedule), perceiver size relative to the target LLM, and any statistical tests or variance estimates for the reported point gains.
Presentation: Clarify the precise definition of 'language capability forgetting' (which three text benchmarks and how measured) and provide a short table comparing perceiver compute overhead to the standard ViT+projector baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript proposing Deep Pre-Alignment for VLMs. We address the major comment below and outline planned revisions to strengthen the mechanistic interpretation.

read point-by-point responses

Referee: Motivation and abstract: The central interpretive claim that the small VLM perceiver produces 'deep pre-alignment' (freeing the LLM's initial layers from superficial modality matching) is load-bearing for the paper's narrative but is not directly tested. No layer-wise cosine similarity, feature-distance metrics, or attention-to-modality diagnostics are reported, nor is there an ablation that holds perceiver capacity fixed while varying only alignment depth. The 1.9–3.0 point gains and 32.9% forgetting reduction are consistent with the story but do not distinguish it from alternative explanations such as richer or differently distributed features.

Authors: We agree that direct mechanistic verification would strengthen the central claim. The manuscript motivates the approach from cited prior work on early-layer modality misalignment and presents the performance gains plus forgetting reduction as empirical outcomes of offloading alignment to the perceiver. These results are consistent with reduced superficial processing in the LLM but do not isolate depth from feature quality. In the revision we will add layer-wise cosine similarity and feature-distance metrics between visual embeddings and text-space representations across LLM layers for both the baseline and DPA models. We will also include a control discussion (and experiment where compute permits) that compares the small VLM perceiver against a capacity-matched ViT projector to better separate alignment depth from richer feature distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with external benchmark validation

full rationale

The paper motivates its Deep Pre-Alignment architecture by citing external prior analyses on VLM alignment challenges, then replaces the ViT+projector with a small VLM perceiver and reports direct performance gains on 8 multimodal benchmarks plus forgetting reduction on 3 text benchmarks. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. No self-citations appear in the provided text as load-bearing for the central claim. The evaluation relies on standard external benchmarks that are independent of any internal definitions or fits, rendering the work self-contained against measurable outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current ViT-to-LLM pipelines suffer from superficial alignment in early layers, plus the introduction of a new perceiver entity whose alignment properties are validated only internally via experiments.

axioms (1)

domain assumption Visual features from standard ViT encoders remain distant from text space in the initial layers of the LLM, forcing superficial alignment work.
Invoked in the opening motivation and supported only by citations to zhang-etal-2024-investigating and artzy-schwartz-2024-attend.

invented entities (1)

Small VLM as perceiver no independent evidence
purpose: To produce deeply aligned visual features for the target LLM.
New modular component introduced to replace ViT encoder; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5800 in / 1463 out tokens · 47093 ms · 2026-05-19T16:22:11.479040+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

by offloading alignment to the perceiver, DPA achieves a 32.9% reduction in language capability forgetting

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

172 extracted references · 172 canonical work pages · 29 internal anchors

[1]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[2]

NeurIPS , volume=

Flamingo: a visual language model for few-shot learning , author=. NeurIPS , volume=

work page
[3]

Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu , booktitle=

work page
[4]

Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon , year =

work page
[5]

Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and others , journal=

work page
[6]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[7]

OpenAI , year=. Hello

work page
[8]

Introducing the next generation of

work page
[9]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023
[10]

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =

Tianyu Yu and Jinyi Hu and Yuan Yao and Haoye Zhang and Yue Zhao and Chongyi Wang and Shan Wang and Yinxv Pan and Jiao Xue and Dahai Li and Zhiyuan Liu and Hai. Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.00653 , eprinttype =. 2310.00653 , timestamp =

work page doi:10.48550/arxiv.2310.00653 2023
[11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024
[12]

Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven C. H. Hoi , editor =. Proceedings of NeurIPS , year =

work page
[13]

2025 , eprint=

LoRA vs Full Fine-tuning: An Illusion of Equivalence , author=. 2025 , eprint=

work page 2025
[14]

FirstName Alpher and FirstName Gamow , title =

work page
[15]

Microsoft

Tsung. Microsoft. Proceedings of ECCV , series =. 2014 , url =. doi:10.1007/978-3-319-10602-1\_48 , timestamp =

work page doi:10.1007/978-3-319-10602-1 2014
[16]

Object Hallucination in Image Captioning , booktitle =

Anna Rohrbach and Lisa Anne Hendricks and Kaylee Burns and Trevor Darrell and Kate Saenko , editor =. Object Hallucination in Image Captioning , booktitle =. 2018 , url =. doi:10.18653/V1/D18-1437 , timestamp =

work page doi:10.18653/v1/d18-1437 2018
[17]

Making the

Yash Goyal and Tejas Khot and Douglas Summers. Making the. Proceedings of CVPR , pages =. 2017 , url =. doi:10.1109/CVPR.2017.670 , timestamp =

work page doi:10.1109/cvpr.2017.670 2017
[18]

arXiv preprint arXiv:2308.06394 , year=

Detecting and preventing hallucinations in large vision language models , author=. arXiv preprint arXiv:2308.06394 , year=

work page arXiv
[19]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i24.34744 , abstractNote=

work page doi:10.1609/aaai.v39i24.34744 2025
[20]

arXiv preprint arXiv:2310.16045 , year=

Woodpecker: Hallucination Correction for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.16045 , year=

work page arXiv
[21]

Wang, Bin and Wu, Fan and Han, Xiao and Peng, Jiahui and Zhong, Huaping and Zhang, Pan and Dong, Xiaoyi and Li, Weijia and Li, Wei and Wang, Jiaqi and others , journal=

work page
[22]

On the Road with

Wen, Licheng and Yang, Xuemeng and Fu, Daocheng and Wang, Xiaofeng and Cai, Pinlong and Li, Xin and Ma, Tao and Li, Yingxuan and Xu, Linran and Shang, Dengke and others , journal=. On the Road with

work page
[23]

Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Qiu, Zhenyu and Lin, Wei and Yang, Jinrui and Zheng, Xiawu and others , journal=

work page
[24]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang. Aligning Large Multimodal Models with Factually Augmented. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.14525 , eprinttype =. 2309.14525 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.14525 2023
[25]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

work page
[26]

2022 , eprint=

BEiT: BERT Pre-Training of Image Transformers , author=. 2022 , eprint=

work page 2022
[27]

Image as a Foreign Language:

Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others , booktitle=. Image as a Foreign Language:

work page
[28]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie. LLaMA: Open and Efficient Foundation Language Models , journal =. 2023 , url =. doi:10.48550/ARXIV.2302.13971 , eprinttype =. 2302.13971 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023
[29]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing

work page
[30]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , year=. Judging. 2306.05685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

ICML , pages=

Learning transferable visual models from natural language supervision , author=. ICML , pages=. 2021 , organization=

work page 2021
[33]

NeurIPS , volume=

Training language models to follow instructions with human feedback , author=. NeurIPS , volume=

work page
[34]

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=

work page
[35]

The Dawn of

Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan , journal=. The Dawn of

work page
[36]

Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , journal=

work page
[37]

Li, Lei and Yin, Yuwei and Li, Shicheng and Chen, Liang and Wang, Peiyi and Ren, Shuhuai and Li, Mukai and Yang, Yazheng and Xu, Jingjing and Sun, Xu and others , journal=

work page
[38]

Awadalla, Anas and Gao, Irena and Gardner, Josh and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Sagawa, Shiori and others , journal=

work page
[39]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Language Is Not All You Need: Aligning Perception with Language Models

Language is not all you need: Aligning perception with language models , author=. arXiv preprint arXiv:2302.14045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Piergiovanni, AJ and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and others , journal=

work page
[42]

Ye, Qinghao and Xu, Haiyang and Xu, Guohai and Ye, Jiabo and Yan, Ming and Zhou, Yiyang and Wang, Junyang and Hu, Anwen and Shi, Pengcheng and Shi, Yaya and others , journal=

work page
[43]

Zhang, Renrui and Han, Jiaming and Zhou, Aojun and Hu, Xiangfei and Yan, Shilin and Lu, Pan and Li, Hongsheng and Gao, Peng and Qiao, Yu , journal=

work page
[44]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal=

work page
[45]

Introducing our Multimodal Models , url =

Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

work page
[46]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=

work page
[47]

Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and others , journal=

work page
[48]

GPT-4 Technical Report

OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.08774 , eprinttype =. 2303.08774 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[49]

NeurIPS , volume=

Learning to summarize with human feedback , author=. NeurIPS , volume=

work page
[50]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ultrafeedback: Boosting language models with high-quality feedback , author=. arXiv preprint arXiv:2310.01377 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2306.01693 , year=

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=

work page arXiv
[53]

Scalable agent alignment via reward modeling: a research direction

Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

arXiv preprint arXiv:2103.14659 , year=

Alignment of language agents , author=. arXiv preprint arXiv:2103.14659 , year=

work page arXiv
[55]

Stanford

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B , year=. Stanford

work page
[56]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

work page 2023
[58]

Proximal Policy Optimization Algorithms

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017
[59]

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=

work page
[60]

John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges , howpublished =

work page
[61]

Proceedings of ECCV , year=

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , author=. Proceedings of ECCV , year=

work page
[62]

Proceedings of ICCV , year=

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=. Proceedings of ICCV , year=

work page
[63]

doi:10.5281/zenodo.5143773 , url =

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773
[64]

International Journal of Computer Vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International Journal of Computer Vision , volume=. 2020 , publisher=

work page 2020
[65]

Proceedings of CVPR , year =

Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai. Proceedings of CVPR , year =

work page
[66]

Large multi-modal models for strong performance and efficient deployment , howpublished =

OpenBMB , year =. Large multi-modal models for strong performance and efficient deployment , howpublished =

work page
[67]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[68]

Towards Real-World Writing Assistance:

Yinghui Li and Zishan Xu and Shaoshen Chen and Haojing Huang and Yangning Li and Yong Jiang and Zhongli Li and Qingyu Zhou and Hai. Towards Real-World Writing Assistance:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.11268 , eprinttype =. 2311.11268 , timestamp =

work page doi:10.48550/arxiv.2311.11268 2023
[69]

2023 , eprint=

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding , author=. 2023 , eprint=

work page 2023
[70]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

work page
[71]

Proceedings of ICLR , year=

Analyzing and mitigating object hallucination in large vision-language models , author=. Proceedings of ICLR , year=

work page
[72]

Processing of CVPR , year=

Qidong Huang and Xiaoyi Dong and Pan Zhang and Bin Wang and Conghui He and Jiaqi Wang and Dahua Lin and Weiming Zhang and Nenghai Yu , title =. Processing of CVPR , year=

work page
[73]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee and Samrat Phatale and Hassan Mansoor and Kellie Lu and Thomas Mesnard and Colton Bishop and Victor Carbune and Abhinav Rastogi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.00267 , eprinttype =. 2309.00267 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00267 2023
[74]

CoRR , volume =

Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.10665 , eprinttype =. 2312.10665 , timestamp =

work page doi:10.48550/arxiv.2312.10665 2023
[75]

CoRR , volume =

Dongping Chen and Ruoxi Chen and Shilin Zhang and Yinuo Liu and Yaochen Wang and Huichi Zhou and Qihui Zhang and Pan Zhou and Yao Wan and Lichao Sun , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.04788 , eprinttype =. 2402.04788 , timestamp =

work page doi:10.48550/arxiv.2402.04788 2024
[76]

Leonard Adolphs and Tianyu Gao and Jing Xu and Kurt Shuster and Sainbayar Sukhbaatar and Jason Weston , editor =. The. Proceedings of ACL , pages =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.493 , timestamp =

work page doi:10.18653/v1/2023.acl-long.493 2023
[77]

Scaling Laws for Reward Model Overoptimization , booktitle =

Leo Gao and John Schulman and Jacob Hilton , editor =. Scaling Laws for Reward Model Overoptimization , booktitle =. 2023 , url =

work page 2023
[78]

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang and Yuhang Wang and Guohai Xu and Jing Zhang and Yukai Gu and Haitao Jia and Ming Yan and Ji Zhang and Jitao Sang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.07397 , eprinttype =. 2311.07397 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.07397 2023
[80]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li and Yuechen Zhang and Chengyao Wang and Zhisheng Zhong and Yixin Chen and Ruihang Chu and Shaoteng Liu and Jiaya Jia , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2403.18814 , eprinttype =. 2403.18814 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.18814 2024

Showing first 80 references.

[1] [1]

International conference on machine learning , pages=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[2] [2]

NeurIPS , volume=

Flamingo: a visual language model for few-shot learning , author=. NeurIPS , volume=

work page

[3] [3]

Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu , booktitle=

work page

[4] [4]

Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon , year =

work page

[5] [5]

Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and others , journal=

work page

[6] [6]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page

[7] [7]

OpenAI , year=. Hello

work page

[8] [8]

Introducing the next generation of

work page

[9] [9]

2023 , eprint=

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

work page 2023

[10] [10]

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =

Tianyu Yu and Jinyi Hu and Yuan Yao and Haoye Zhang and Yue Zhao and Chongyi Wang and Shan Wang and Yinxv Pan and Jiao Xue and Dahai Li and Zhiyuan Liu and Hai. Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.00653 , eprinttype =. 2310.00653 , timestamp =

work page doi:10.48550/arxiv.2310.00653 2023

[11] [11]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

work page 2024

[12] [12]

Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven C. H. Hoi , editor =. Proceedings of NeurIPS , year =

work page

[13] [13]

2025 , eprint=

LoRA vs Full Fine-tuning: An Illusion of Equivalence , author=. 2025 , eprint=

work page 2025

[14] [14]

FirstName Alpher and FirstName Gamow , title =

work page

[15] [15]

Microsoft

Tsung. Microsoft. Proceedings of ECCV , series =. 2014 , url =. doi:10.1007/978-3-319-10602-1\_48 , timestamp =

work page doi:10.1007/978-3-319-10602-1 2014

[16] [16]

Object Hallucination in Image Captioning , booktitle =

Anna Rohrbach and Lisa Anne Hendricks and Kaylee Burns and Trevor Darrell and Kate Saenko , editor =. Object Hallucination in Image Captioning , booktitle =. 2018 , url =. doi:10.18653/V1/D18-1437 , timestamp =

work page doi:10.18653/v1/d18-1437 2018

[17] [17]

Making the

Yash Goyal and Tejas Khot and Douglas Summers. Making the. Proceedings of CVPR , pages =. 2017 , url =. doi:10.1109/CVPR.2017.670 , timestamp =

work page doi:10.1109/cvpr.2017.670 2017

[18] [18]

arXiv preprint arXiv:2308.06394 , year=

Detecting and preventing hallucinations in large vision language models , author=. arXiv preprint arXiv:2308.06394 , year=

work page arXiv

[19] [19]

Proceedings of the AAAI Conference on Artificial Intelligence , author=

Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i24.34744 , abstractNote=

work page doi:10.1609/aaai.v39i24.34744 2025

[20] [20]

arXiv preprint arXiv:2310.16045 , year=

Woodpecker: Hallucination Correction for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.16045 , year=

work page arXiv

[21] [21]

Wang, Bin and Wu, Fan and Han, Xiao and Peng, Jiahui and Zhong, Huaping and Zhang, Pan and Dong, Xiaoyi and Li, Weijia and Li, Wei and Wang, Jiaqi and others , journal=

work page

[22] [22]

On the Road with

Wen, Licheng and Yang, Xuemeng and Fu, Daocheng and Wang, Xiaofeng and Cai, Pinlong and Li, Xin and Ma, Tao and Li, Yingxuan and Xu, Linran and Shang, Dengke and others , journal=. On the Road with

work page

[23] [23]

Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Qiu, Zhenyu and Lin, Wei and Yang, Jinrui and Zheng, Xiawu and others , journal=

work page

[24] [24]

Aligning Large Multimodal Models with Factually Augmented RLHF

Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang. Aligning Large Multimodal Models with Factually Augmented. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.14525 , eprinttype =. 2309.14525 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.14525 2023

[25] [25]

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

work page

[26] [26]

2022 , eprint=

BEiT: BERT Pre-Training of Image Transformers , author=. 2022 , eprint=

work page 2022

[27] [27]

Image as a Foreign Language:

Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others , booktitle=. Image as a Foreign Language:

work page

[28] [28]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie. LLaMA: Open and Efficient Foundation Language Models , journal =. 2023 , url =. doi:10.48550/ARXIV.2302.13971 , eprinttype =. 2302.13971 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.13971 2023

[29] [29]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing

work page

[30] [30]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , year=. Judging. 2306.05685 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

ICML , pages=

Learning transferable visual models from natural language supervision , author=. ICML , pages=. 2021 , organization=

work page 2021

[33] [33]

NeurIPS , volume=

Training language models to follow instructions with human feedback , author=. NeurIPS , volume=

work page

[34] [34]

Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=

work page

[35] [35]

The Dawn of

Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan , journal=. The Dawn of

work page

[36] [36]

Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , journal=

work page

[37] [37]

Li, Lei and Yin, Yuwei and Li, Shicheng and Chen, Liang and Wang, Peiyi and Ren, Shuhuai and Li, Mukai and Yang, Yazheng and Xu, Jingjing and Sun, Xu and others , journal=

work page

[38] [38]

Awadalla, Anas and Gao, Irena and Gardner, Josh and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Sagawa, Shiori and others , journal=

work page

[39] [39]

Otter: A Multi-Modal Model with In-Context Instruction Tuning

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

Language Is Not All You Need: Aligning Perception with Language Models

Language is not all you need: Aligning perception with language models , author=. arXiv preprint arXiv:2302.14045 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Piergiovanni, AJ and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and others , journal=

work page

[42] [42]

Ye, Qinghao and Xu, Haiyang and Xu, Guohai and Ye, Jiabo and Yan, Ming and Zhou, Yiyang and Wang, Junyang and Hu, Anwen and Shi, Pengcheng and Shi, Yaya and others , journal=

work page

[43] [43]

Zhang, Renrui and Han, Jiaming and Zhou, Aojun and Hu, Xiangfei and Yan, Shilin and Lu, Pan and Li, Hongsheng and Gao, Peng and Qiao, Yu , journal=

work page

[44] [44]

Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal=

work page

[45] [45]

Introducing our Multimodal Models , url =

Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

work page

[46] [46]

Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=

work page

[47] [47]

Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and others , journal=

work page

[48] [48]

GPT-4 Technical Report

OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.08774 , eprinttype =. 2303.08774 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[49] [49]

NeurIPS , volume=

Learning to summarize with human feedback , author=. NeurIPS , volume=

work page

[50] [50]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

UltraFeedback: Boosting Language Models with Scaled AI Feedback

Ultrafeedback: Boosting language models with high-quality feedback , author=. arXiv preprint arXiv:2310.01377 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

arXiv preprint arXiv:2306.01693 , year=

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=

work page arXiv

[53] [53]

Scalable agent alignment via reward modeling: a research direction

Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

arXiv preprint arXiv:2103.14659 , year=

Alignment of language agents , author=. arXiv preprint arXiv:2103.14659 , year=

work page arXiv

[55] [55]

Stanford

Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B , year=. Stanford

work page

[56] [56]

Let's Verify Step by Step

Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[57] [57]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

work page 2023

[58] [58]

Proximal Policy Optimization Algorithms

John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2017

[59] [59]

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=

work page

[60] [60]

John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges , howpublished =

work page

[61] [61]

Proceedings of ECCV , year=

A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , author=. Proceedings of ECCV , year=

work page

[62] [62]

Proceedings of ICCV , year=

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=. Proceedings of ICCV , year=

work page

[63] [63]

doi:10.5281/zenodo.5143773 , url =

Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

work page doi:10.5281/zenodo.5143773

[64] [64]

International Journal of Computer Vision , volume=

The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International Journal of Computer Vision , volume=. 2020 , publisher=

work page 2020

[65] [65]

Proceedings of CVPR , year =

Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai. Proceedings of CVPR , year =

work page

[66] [66]

Large multi-modal models for strong performance and efficient deployment , howpublished =

OpenBMB , year =. Large multi-modal models for strong performance and efficient deployment , howpublished =

work page

[67] [67]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[68] [68]

Towards Real-World Writing Assistance:

Yinghui Li and Zishan Xu and Shaoshen Chen and Haojing Huang and Yangning Li and Yong Jiang and Zhongli Li and Qingyu Zhou and Hai. Towards Real-World Writing Assistance:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.11268 , eprinttype =. 2311.11268 , timestamp =

work page doi:10.48550/arxiv.2311.11268 2023

[69] [69]

2023 , eprint=

SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding , author=. 2023 , eprint=

work page 2023

[70] [70]

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

work page

[71] [71]

Proceedings of ICLR , year=

Analyzing and mitigating object hallucination in large vision-language models , author=. Proceedings of ICLR , year=

work page

[72] [72]

Processing of CVPR , year=

Qidong Huang and Xiaoyi Dong and Pan Zhang and Bin Wang and Conghui He and Jiaqi Wang and Dahua Lin and Weiming Zhang and Nenghai Yu , title =. Processing of CVPR , year=

work page

[73] [73]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee and Samrat Phatale and Hassan Mansoor and Kellie Lu and Thomas Mesnard and Colton Bishop and Victor Carbune and Abhinav Rastogi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.00267 , eprinttype =. 2309.00267 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.00267 2023

[74] [74]

CoRR , volume =

Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.10665 , eprinttype =. 2312.10665 , timestamp =

work page doi:10.48550/arxiv.2312.10665 2023

[75] [75]

CoRR , volume =

Dongping Chen and Ruoxi Chen and Shilin Zhang and Yinuo Liu and Yaochen Wang and Huichi Zhou and Qihui Zhang and Pan Zhou and Yao Wan and Lichao Sun , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.04788 , eprinttype =. 2402.04788 , timestamp =

work page doi:10.48550/arxiv.2402.04788 2024

[76] [76]

Leonard Adolphs and Tianyu Gao and Jing Xu and Kurt Shuster and Sainbayar Sukhbaatar and Jason Weston , editor =. The. Proceedings of ACL , pages =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.493 , timestamp =

work page doi:10.18653/v1/2023.acl-long.493 2023

[77] [77]

Scaling Laws for Reward Model Overoptimization , booktitle =

Leo Gao and John Schulman and Jacob Hilton , editor =. Scaling Laws for Reward Model Overoptimization , booktitle =. 2023 , url =

work page 2023

[78] [78]

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [79]

AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

Junyang Wang and Yuhang Wang and Guohai Xu and Jing Zhang and Yukai Gu and Haitao Jia and Ming Yan and Ji Zhang and Jitao Sang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.07397 , eprinttype =. 2311.07397 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2311.07397 2023

[80] [80]

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

Yanwei Li and Yuechen Zhang and Chengyao Wang and Zhisheng Zhong and Yixin Chen and Ruihang Chu and Shaoteng Liu and Jiaya Jia , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2403.18814 , eprinttype =. 2403.18814 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2403.18814 2024