pith. sign in

arxiv: 2605.15300 · v1 · pith:4ZAXKMRQnew · submitted 2026-05-14 · 💻 cs.CV

Deep Pre-Alignment for VLMs

Pith reviewed 2026-05-19 16:22 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision language modelsdeep pre-alignmentmultimodal alignmentperceivervision transformerlanguage model forgettingVLM architecture
0
0 comments X p. Extension
pith:4ZAXKMRQ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{4ZAXKMRQ}

Prints a linked pith:4ZAXKMRQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Deep Pre-Alignment replaces the ViT encoder with a small VLM perceiver to align visual features deeply with the LLM's text space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models often struggle because visual features from standard encoders require the language model to spend its early layers on basic alignment rather than complex reasoning. This paper tests whether inserting a small vision-language model as a perceiver before the main language model can handle that alignment in advance. Experiments show gains on multimodal tasks that grow with model size and less loss of text-only performance. A reader might care if this modular swap proves to be a reliable way to build stronger multimodal systems without retraining everything from scratch.

Core claim

By using a small VLM as the visual perceiver instead of a ViT plus projector, the architecture ensures that visual features enter the large language model already aligned with its text space, freeing the LLM's layers for deeper understanding and reducing forgetting of language capabilities.

What carries the argument

The small VLM perceiver that maps visual inputs into features deeply aligned with the target LLM's text space.

If this is right

  • Outperforms standard architectures by 1.9 points on 8 multimodal benchmarks at 4B scale.
  • Gains increase to 3.0 points at 32B scale.
  • Reduces language capability forgetting by 32.9% across 3 text benchmarks.
  • Delivers consistent improvements across Qwen3 and LLaMA 3.2 model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could simplify scaling VLMs by allowing pre-trained small models to handle initial alignment for larger ones.
  • Future architectures might explore varying the size or training of the perceiver independently of the main LLM.
  • Similar pre-alignment ideas could apply to other multimodal combinations like audio or video with language.

Load-bearing premise

That the outputs from the small VLM perceiver are aligned enough with the LLM text space to prevent the LLM from using its early layers for superficial modality matching.

What would settle it

Measuring attention patterns in the LLM's first layers with and without the perceiver to check if modality alignment still occurs early on, or observing no performance gain on benchmarks.

Figures

Figures reproduced from arXiv: 2605.15300 by Bo Zheng, Jun Song, Kaidong Zhang, Kechen Fang, Tianyu Yu, Yicheng Zhang, Yuan Yao, Zihao Wan.

Figure 1
Figure 1. Figure 1: (a) Architectural overview of DPA. By simply replacing the ViT encoder with a perceiver VLM, DPA offloads the superficial modality alignment burden from the target large language model, and deeply align visual features inside the perceiver language blocks. The input visual features are thus better aligned with text space. (b) DPA significantly minimize the modality gap (Huang et al., 2025) and also improve… view at source ↗
Figure 2
Figure 2. Figure 2: Correlation between perceiver standalone perfor￾mance and corresponding DPA model performance. ρ denotes the Pearson correlation coefficient in each task group. 0, 10K, 250K, 1M: number of instruction samples used to train correspond￾ing perceivers; untrained: the perceiver used in DPA consists of randomly initialized projections. experiments. The evaluation results of these models are shown in [PITH_FULL… view at source ↗
Figure 3
Figure 3. Figure 3: Modality gap comparison of different perceiver lay￾ers. We compute the layer-wise MIR between per-layer output of perceiver with the text space of the Qwen3 0.6B model. All models exhibit fast convergence of the modality gap in deep layers, and finally reach a similar level. do not use Qwen3 4B or Qwen3 32B as the target large lan￾guage model since MIR requires dimensions of both spaces to be the same. We … view at source ↗
Figure 4
Figure 4. Figure 4: Cross-layer intra-modal similarity matrices of text spaces and visual spaces. “T” and “V” denote text and visual spaces, respectively. (Lighter colors indicate higher similarities.) The DPA visual space (d) exhibits “block-diagonal” subspaces that resemble the subspaces found in text spaces (a-c), whereas the baseline visual space (e) remains fuzzy. 0 20 Model Block Index 10 0 10 1 10 2 Modality Gap DPA-Qw… view at source ↗
Figure 5
Figure 5. Figure 5: Per-layer modality gap comparison of different mod￾els. DPA consistently minimizes the modality gap on most layers, and the reduction on the 32B setting is more significant. 0.5 1.0 1.5 2.0 Epoch 2.5 2.6 2.7 Modality Gap DPA-Qwen3-4B LLaVA-NeXT-Qwen3-4B [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Modality Gap dynamics during instruction tuning stage. DPA consistently achieves smaller modality gap during the whole training process. LLaVA (Liu et al., 2023; 2024a), use a fixed-resolution CLIP (Radford et al., 2021) directly as the visual encoder. The output visual features are then injected into the large language model through different connectors. Subsequent works (Liu et al., 2024b; Guo et al., 20… view at source ↗
read the original abstract

Most Vision Language Models (VLMs) directly map outputs from ViT encoders to the LLM via a lightweight projector. While effective, recent analysis suggests this architecture suffers from an alignment challenge: visual features remain distant from the text space in the initial layers of the LLM, forcing the model to waste critical depth~\cite{zhang-etal-2024-investigating,artzy-schwartz-2024-attend} on superficial modality alignment rather than deep understanding and complex reasoning. In this work, we propose Deep Pre-Alignment (DPA), a novel architecture that replaces the standard ViT encoder with a small VLM as perceiver, ensuring visual features are deeply aligned with the text space of the target large language model. Comprehensive experiments demonstrate the effectiveness of DPA. On the 4B parameter scale, DPA outperforms baselines by 1.9 points across 8 multimodal benchmarks, with gains widening to 3.0 points at the 32B scale. Moreover, by offloading alignment to the perceiver, DPA achieves a 32.9\% reduction in language capability forgetting over 3 text benchmarks. We further demonstrate that these gains are consistent across different LLM families including Qwen3 and LLaMA 3.2, highlighting the generality of our approach. Beyond performance, DPA also offers a seamless upgrade path for current VLM development, requiring only a modular replacement for the visual encoder with marginal computation overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Deep Pre-Alignment (DPA) for vision-language models. It replaces the standard ViT encoder plus projector with a small VLM acting as perceiver, motivated by prior work showing that visual features remain distant from text space in early LLM layers. Experiments report that DPA yields 1.9-point average gains across 8 multimodal benchmarks at the 4B scale (widening to 3.0 points at 32B) and a 32.9% reduction in language forgetting on 3 text benchmarks, with results consistent across Qwen3 and LLaMA 3.2 families. The approach is presented as a modular, low-overhead upgrade.

Significance. If the performance numbers hold under fuller controls, DPA offers a practical architectural alternative that could reduce alignment overhead and forgetting during VLM scaling. The reported consistency across model scales and LLM families provides a useful empirical signal for the community, though the specific mechanistic attribution to 'deep pre-alignment' would need direct verification to strengthen the contribution beyond benchmark deltas.

major comments (1)
  1. Motivation and abstract: The central interpretive claim that the small VLM perceiver produces 'deep pre-alignment' (freeing the LLM's initial layers from superficial modality matching) is load-bearing for the paper's narrative but is not directly tested. No layer-wise cosine similarity, feature-distance metrics, or attention-to-modality diagnostics are reported, nor is there an ablation that holds perceiver capacity fixed while varying only alignment depth. The 1.9–3.0 point gains and 32.9% forgetting reduction are consistent with the story but do not distinguish it from alternative explanations such as richer or differently distributed features.
minor comments (2)
  1. Experimental details: The abstract and results sections should specify the exact baselines (including whether they use the same training data and schedule), perceiver size relative to the target LLM, and any statistical tests or variance estimates for the reported point gains.
  2. Presentation: Clarify the precise definition of 'language capability forgetting' (which three text benchmarks and how measured) and provide a short table comparing perceiver compute overhead to the standard ViT+projector baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript proposing Deep Pre-Alignment for VLMs. We address the major comment below and outline planned revisions to strengthen the mechanistic interpretation.

read point-by-point responses
  1. Referee: Motivation and abstract: The central interpretive claim that the small VLM perceiver produces 'deep pre-alignment' (freeing the LLM's initial layers from superficial modality matching) is load-bearing for the paper's narrative but is not directly tested. No layer-wise cosine similarity, feature-distance metrics, or attention-to-modality diagnostics are reported, nor is there an ablation that holds perceiver capacity fixed while varying only alignment depth. The 1.9–3.0 point gains and 32.9% forgetting reduction are consistent with the story but do not distinguish it from alternative explanations such as richer or differently distributed features.

    Authors: We agree that direct mechanistic verification would strengthen the central claim. The manuscript motivates the approach from cited prior work on early-layer modality misalignment and presents the performance gains plus forgetting reduction as empirical outcomes of offloading alignment to the perceiver. These results are consistent with reduced superficial processing in the LLM but do not isolate depth from feature quality. In the revision we will add layer-wise cosine similarity and feature-distance metrics between visual embeddings and text-space representations across LLM layers for both the baseline and DPA models. We will also include a control discussion (and experiment where compute permits) that compares the small VLM perceiver against a capacity-matched ViT projector to better separate alignment depth from richer feature distributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with external benchmark validation

full rationale

The paper motivates its Deep Pre-Alignment architecture by citing external prior analyses on VLM alignment challenges, then replaces the ViT+projector with a small VLM perceiver and reports direct performance gains on 8 multimodal benchmarks plus forgetting reduction on 3 text benchmarks. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction. No self-citations appear in the provided text as load-bearing for the central claim. The evaluation relies on standard external benchmarks that are independent of any internal definitions or fits, rendering the work self-contained against measurable outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that current ViT-to-LLM pipelines suffer from superficial alignment in early layers, plus the introduction of a new perceiver entity whose alignment properties are validated only internally via experiments.

axioms (1)
  • domain assumption Visual features from standard ViT encoders remain distant from text space in the initial layers of the LLM, forcing superficial alignment work.
    Invoked in the opening motivation and supported only by citations to zhang-etal-2024-investigating and artzy-schwartz-2024-attend.
invented entities (1)
  • Small VLM as perceiver no independent evidence
    purpose: To produce deeply aligned visual features for the target LLM.
    New modular component introduced to replace ViT encoder; no independent falsifiable prediction outside the reported experiments.

pith-pipeline@v0.9.0 · 5800 in / 1463 out tokens · 47093 ms · 2026-05-19T16:22:11.479040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

172 extracted references · 172 canonical work pages · 29 internal anchors

  1. [1]

    International conference on machine learning , pages=

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  2. [2]

    NeurIPS , volume=

    Flamingo: a visual language model for few-shot learning , author=. NeurIPS , volume=

  3. [3]

    Changpinyo, Soravit and Sharma, Piyush and Ding, Nan and Soricut, Radu , booktitle=

  4. [4]

    Byeon, Minwoo and Park, Beomhee and Kim, Haecheon and Lee, Sungjun and Baek, Woonhyuk and Kim, Saehoon , year =

  5. [5]

    Schuhmann, Christoph and Beaumont, Romain and Vencu, Richard and Gordon, Cade and Wightman, Ross and Cherti, Mehdi and Coombes, Theo and Katta, Aarush and Mullis, Clayton and Wortsman, Mitchell and others , journal=

  6. [6]

    Advances in neural information processing systems , volume=

    Visual instruction tuning , author=. Advances in neural information processing systems , volume=

  7. [7]

    OpenAI , year=. Hello

  8. [8]

    Introducing the next generation of

  9. [9]

    2023 , eprint=

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond , author=. 2023 , eprint=

  10. [10]

    Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =

    Tianyu Yu and Jinyi Hu and Yuan Yao and Haoye Zhang and Yue Zhao and Chongyi Wang and Shan Wang and Yinxv Pan and Jiao Xue and Dahai Li and Zhiyuan Liu and Hai. Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants , journal =. 2023 , url =. doi:10.48550/ARXIV.2310.00653 , eprinttype =. 2310.00653 , timestamp =

  11. [11]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2024 , pages =

  12. [12]

    Wenliang Dai and Junnan Li and Dongxu Li and Anthony Meng Huat Tiong and Junqi Zhao and Weisheng Wang and Boyang Li and Pascale Fung and Steven C. H. Hoi , editor =. Proceedings of NeurIPS , year =

  13. [13]

    2025 , eprint=

    LoRA vs Full Fine-tuning: An Illusion of Equivalence , author=. 2025 , eprint=

  14. [14]

    FirstName Alpher and FirstName Gamow , title =

  15. [15]

    Microsoft

    Tsung. Microsoft. Proceedings of ECCV , series =. 2014 , url =. doi:10.1007/978-3-319-10602-1\_48 , timestamp =

  16. [16]

    Object Hallucination in Image Captioning , booktitle =

    Anna Rohrbach and Lisa Anne Hendricks and Kaylee Burns and Trevor Darrell and Kate Saenko , editor =. Object Hallucination in Image Captioning , booktitle =. 2018 , url =. doi:10.18653/V1/D18-1437 , timestamp =

  17. [17]

    Making the

    Yash Goyal and Tejas Khot and Douglas Summers. Making the. Proceedings of CVPR , pages =. 2017 , url =. doi:10.1109/CVPR.2017.670 , timestamp =

  18. [18]

    arXiv preprint arXiv:2308.06394 , year=

    Detecting and preventing hallucinations in large vision language models , author=. arXiv preprint arXiv:2308.06394 , year=

  19. [19]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    Detecting and Mitigating Hallucination in Large Vision Language Models via Fine-Grained AI Feedback , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2025 , month=. doi:10.1609/aaai.v39i24.34744 , abstractNote=

  20. [20]

    arXiv preprint arXiv:2310.16045 , year=

    Woodpecker: Hallucination Correction for Multimodal Large Language Models , author=. arXiv preprint arXiv:2310.16045 , year=

  21. [21]

    Wang, Bin and Wu, Fan and Han, Xiao and Peng, Jiahui and Zhong, Huaping and Zhang, Pan and Dong, Xiaoyi and Li, Weijia and Li, Wei and Wang, Jiaqi and others , journal=

  22. [22]

    On the Road with

    Wen, Licheng and Yang, Xuemeng and Fu, Daocheng and Wang, Xiaofeng and Cai, Pinlong and Li, Xin and Ma, Tao and Li, Yingxuan and Xu, Linran and Shang, Dengke and others , journal=. On the Road with

  23. [23]

    Fu, Chaoyou and Chen, Peixian and Shen, Yunhang and Qin, Yulei and Zhang, Mengdan and Lin, Xu and Qiu, Zhenyu and Lin, Wei and Yang, Jinrui and Zheng, Xiawu and others , journal=

  24. [24]

    Aligning Large Multimodal Models with Factually Augmented RLHF

    Zhiqing Sun and Sheng Shen and Shengcao Cao and Haotian Liu and Chunyuan Li and Yikang Shen and Chuang Gan and Liang. Aligning Large Multimodal Models with Factually Augmented. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.14525 , eprinttype =. 2309.14525 , timestamp =

  25. [25]

    Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , journal=

  26. [26]

    2022 , eprint=

    BEiT: BERT Pre-Training of Image Transformers , author=. 2022 , eprint=

  27. [27]

    Image as a Foreign Language:

    Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others , booktitle=. Image as a Foreign Language:

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron and Thibaut Lavril and Gautier Izacard and Xavier Martinet and Marie. LLaMA: Open and Efficient Foundation Language Models , journal =. 2023 , url =. doi:10.48550/ARXIV.2302.13971 , eprinttype =. 2302.13971 , timestamp =

  29. [29]

    and Stoica, Ion and Xing, Eric P

    Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , month =. Vicuna: An Open-Source Chatbot Impressing

  30. [30]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , year=. Judging. 2306.05685 , archivePrefix=

  31. [31]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Training a helpful and harmless assistant with reinforcement learning from human feedback , author=. arXiv preprint arXiv:2204.05862 , year=

  32. [32]

    ICML , pages=

    Learning transferable visual models from natural language supervision , author=. ICML , pages=. 2021 , organization=

  33. [33]

    NeurIPS , volume=

    Training language models to follow instructions with human feedback , author=. NeurIPS , volume=

  34. [34]

    Lu, Pan and Bansal, Hritik and Xia, Tony and Liu, Jiacheng and Li, Chunyuan and Hajishirzi, Hannaneh and Cheng, Hao and Chang, Kai-Wei and Galley, Michel and Gao, Jianfeng , booktitle=

  35. [35]

    The Dawn of

    Yang, Zhengyuan and Li, Linjie and Lin, Kevin and Wang, Jianfeng and Lin, Chung-Ching and Liu, Zicheng and Wang, Lijuan , journal=. The Dawn of

  36. [36]

    Liu, Fuxiao and Guan, Tianrui and Li, Zongxia and Chen, Lichang and Yacoob, Yaser and Manocha, Dinesh and Zhou, Tianyi , journal=

  37. [37]

    Li, Lei and Yin, Yuwei and Li, Shicheng and Chen, Liang and Wang, Peiyi and Ren, Shuhuai and Li, Mukai and Yang, Yazheng and Xu, Jingjing and Sun, Xu and others , journal=

  38. [38]

    Awadalla, Anas and Gao, Irena and Gardner, Josh and Hessel, Jack and Hanafy, Yusuf and Zhu, Wanrong and Marathe, Kalyani and Bitton, Yonatan and Gadre, Samir and Sagawa, Shiori and others , journal=

  39. [39]

    Otter: A Multi-Modal Model with In-Context Instruction Tuning

    Otter: A multi-modal model with in-context instruction tuning , author=. arXiv preprint arXiv:2305.03726 , year=

  40. [40]

    Language Is Not All You Need: Aligning Perception with Language Models

    Language is not all you need: Aligning perception with language models , author=. arXiv preprint arXiv:2302.14045 , year=

  41. [41]

    Chen, Xi and Wang, Xiao and Changpinyo, Soravit and Piergiovanni, AJ and Padlewski, Piotr and Salz, Daniel and Goodman, Sebastian and Grycner, Adam and Mustafa, Basil and Beyer, Lucas and others , journal=

  42. [42]

    Ye, Qinghao and Xu, Haiyang and Xu, Guohai and Ye, Jiabo and Yan, Ming and Zhou, Yiyang and Wang, Junyang and Hu, Anwen and Shi, Pengcheng and Shi, Yaya and others , journal=

  43. [43]

    Zhang, Renrui and Han, Jiaming and Zhou, Aojun and Hu, Xiangfei and Yan, Shilin and Lu, Pan and Li, Hongsheng and Gao, Peng and Qiao, Yu , journal=

  44. [44]

    Zhu, Deyao and Chen, Jun and Shen, Xiaoqian and Li, Xiang and Elhoseiny, Mohamed , journal=

  45. [45]

    Introducing our Multimodal Models , url =

    Bavishi, Rohan and Elsen, Erich and Hawthorne, Curtis and Nye, Maxwell and Odena, Augustus and Somani, Arushi and Ta. Introducing our Multimodal Models , url =

  46. [46]

    Driess, Danny and Xia, Fei and Sajjadi, Mehdi SM and Lynch, Corey and Chowdhery, Aakanksha and Ichter, Brian and Wahid, Ayzaan and Tompson, Jonathan and Vuong, Quan and Yu, Tianhe and others , journal=

  47. [47]

    Wang, Weihan and Lv, Qingsong and Yu, Wenmeng and Hong, Wenyi and Qi, Ji and Wang, Yan and Ji, Junhui and Yang, Zhuoyi and Zhao, Lei and Song, Xixuan and others , journal=

  48. [48]

    GPT-4 Technical Report

    OpenAI , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2303.08774 , eprinttype =. 2303.08774 , timestamp =

  49. [49]

    NeurIPS , volume=

    Learning to summarize with human feedback , author=. NeurIPS , volume=

  50. [50]

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. arXiv preprint arXiv:2305.14233 , year=

  51. [51]

    UltraFeedback: Boosting Language Models with Scaled AI Feedback

    Ultrafeedback: Boosting language models with high-quality feedback , author=. arXiv preprint arXiv:2310.01377 , year=

  52. [52]

    arXiv preprint arXiv:2306.01693 , year=

    Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. arXiv preprint arXiv:2306.01693 , year=

  53. [53]

    Scalable agent alignment via reward modeling: a research direction

    Scalable agent alignment via reward modeling: a research direction , author=. arXiv preprint arXiv:1811.07871 , year=

  54. [54]

    arXiv preprint arXiv:2103.14659 , year=

    Alignment of language agents , author=. arXiv preprint arXiv:2103.14659 , year=

  55. [55]

    Stanford

    Taori, Rohan and Gulrajani, Ishaan and Zhang, Tianyi and Dubois, Yann and Li, Xuechen and Guestrin, Carlos and Liang, Percy and Hashimoto, Tatsunori B , year=. Stanford

  56. [56]

    Let's Verify Step by Step

    Let's Verify Step by Step , author=. arXiv preprint arXiv:2305.20050 , year=

  57. [57]

    Manning and Stefano Ermon and Chelsea Finn , editor =

    Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

  58. [58]

    Proximal Policy Optimization Algorithms

    John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov , title =. CoRR , volume =. 2017 , url =. 1707.06347 , timestamp =

  59. [59]

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , journal=

  60. [60]

    John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges , howpublished =

  61. [61]

    Proceedings of ECCV , year=

    A-OKVQA: A Benchmark for Visual Question Answering using World Knowledge , author=. Proceedings of ECCV , year=

  62. [62]

    Proceedings of ICCV , year=

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models , author=. Proceedings of ICCV , year=

  63. [63]

    doi:10.5281/zenodo.5143773 , url =

    Ilharco, Gabriel and Wortsman, Mitchell and Wightman, Ross and Gordon, Cade and Carlini, Nicholas and Taori, Rohan and Dave, Achal and Shankar, Vaishaal and Namkoong, Hongseok and Miller, John and Hajishirzi, Hannaneh and Farhadi, Ali and Schmidt, Ludwig , title =. doi:10.5281/zenodo.5143773 , url =

  64. [64]

    International Journal of Computer Vision , volume=

    The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale , author=. International Journal of Computer Vision , volume=. 2020 , publisher=

  65. [65]

    Proceedings of CVPR , year =

    Tianyu Yu and Yuan Yao and Haoye Zhang and Taiwen He and Yifeng Han and Ganqu Cui and Jinyi Hu and Zhiyuan Liu and Hai. Proceedings of CVPR , year =

  66. [66]

    Large multi-modal models for strong performance and efficient deployment , howpublished =

    OpenBMB , year =. Large multi-modal models for strong performance and efficient deployment , howpublished =

  67. [67]

    SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

    Seed-bench: Benchmarking multimodal llms with generative comprehension , author=. arXiv preprint arXiv:2307.16125 , year=

  68. [68]

    Towards Real-World Writing Assistance:

    Yinghui Li and Zishan Xu and Shaoshen Chen and Haojing Huang and Yangning Li and Yong Jiang and Zhongli Li and Qingyu Zhou and Hai. Towards Real-World Writing Assistance:. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.11268 , eprinttype =. 2311.11268 , timestamp =

  69. [69]

    2023 , eprint=

    SeqGPT: An Out-of-the-box Large Language Model for Open Domain Sequence Understanding , author=. 2023 , eprint=

  70. [70]

    LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , author=

  71. [71]

    Proceedings of ICLR , year=

    Analyzing and mitigating object hallucination in large vision-language models , author=. Proceedings of ICLR , year=

  72. [72]

    Processing of CVPR , year=

    Qidong Huang and Xiaoyi Dong and Pan Zhang and Bin Wang and Conghui He and Jiaqi Wang and Dahua Lin and Weiming Zhang and Nenghai Yu , title =. Processing of CVPR , year=

  73. [73]

    RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    Harrison Lee and Samrat Phatale and Hassan Mansoor and Kellie Lu and Thomas Mesnard and Colton Bishop and Victor Carbune and Abhinav Rastogi , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.00267 , eprinttype =. 2309.00267 , timestamp =

  74. [74]

    CoRR , volume =

    Lei Li and Zhihui Xie and Mukai Li and Shunian Chen and Peiyi Wang and Liang Chen and Yazheng Yang and Benyou Wang and Lingpeng Kong , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2312.10665 , eprinttype =. 2312.10665 , timestamp =

  75. [75]

    CoRR , volume =

    Dongping Chen and Ruoxi Chen and Shilin Zhang and Yinuo Liu and Yaochen Wang and Huichi Zhou and Qihui Zhang and Pan Zhou and Yao Wan and Lichao Sun , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2402.04788 , eprinttype =. 2402.04788 , timestamp =

  76. [76]

    Leonard Adolphs and Tianyu Gao and Jing Xu and Kurt Shuster and Sainbayar Sukhbaatar and Jason Weston , editor =. The. Proceedings of ACL , pages =. 2023 , url =. doi:10.18653/V1/2023.ACL-LONG.493 , timestamp =

  77. [77]

    Scaling Laws for Reward Model Overoptimization , booktitle =

    Leo Gao and John Schulman and Jacob Hilton , editor =. Scaling Laws for Reward Model Overoptimization , booktitle =. 2023 , url =

  78. [78]

    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

    Aligning Modalities in Vision Large Language Models via Preference Fine-tuning , author=. arXiv preprint arXiv:2402.11411 , year=

  79. [79]

    AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation

    Junyang Wang and Yuhang Wang and Guohai Xu and Jing Zhang and Yukai Gu and Haitao Jia and Ming Yan and Ji Zhang and Jitao Sang , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2311.07397 , eprinttype =. 2311.07397 , timestamp =

  80. [80]

    Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

    Yanwei Li and Yuechen Zhang and Chengyao Wang and Zhisheng Zhong and Yixin Chen and Ruihang Chu and Shaoteng Liu and Jiaya Jia , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2403.18814 , eprinttype =. 2403.18814 , timestamp =

Showing first 80 references.