How Far Are We from Generating Missing Modalities with Foundation Models?
Pith reviewed 2026-05-25 08:11 UTC · model grok-4.3
The pith
Foundation models need dynamic mining and self-refinement to reconstruct missing modalities accurately, as direct use often yields misaligned outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Multimodal foundation models often fall short for missing modality reconstruction in two respects: fine-grained semantic extraction from the available modalities and robust validation of generated modalities. Three paradigms are formalized and evaluated across 42 model variants. An agentic framework is introduced that dynamically formulates modality-aware mining strategies based on input context to obtain richer discriminative features and that adds a self-refinement mechanism iterating verification and quality enhancement through internal feedback. This yields at least 14 percent reduction in FID for missing image reconstruction and at least 10 percent reduction in MER for missing text, as
What carries the argument
The agentic framework that dynamically formulates modality-aware mining strategies from input context and applies iterative self-refinement via internal feedback to improve generated modality quality.
If this is right
- Reconstruction accuracy rises for both missing images and missing text across the tested paradigms.
- Generated modalities support better performance on downstream tasks that rely on complete multimodal inputs.
- The same framework operates on multiple foundation model variants without requiring model-specific retraining.
- The two identified limitations, when addressed, directly reduce cases of semantically misaligned generations.
Where Pith is reading between the lines
- Similar dynamic mining and refinement steps may be required for other generation tasks that rely on foundation models.
- The evaluation suggests that future foundation models could benefit from native support for context-aware feature mining.
- Practical systems that complete partial multimodal data may need explicit validation loops to avoid propagating errors.
Load-bearing premise
The selected metrics, datasets, and 42 model variants give an unbiased picture of reconstruction quality and downstream adaptability.
What would settle it
A replication on a fresh dataset or with additional unseen model variants in which the agentic framework produces no reduction or an increase in FID and MER scores.
Figures
read the original abstract
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates using multimodal foundation models for missing modality reconstruction. It formalizes three paradigms, evaluates 42 model variants on reconstruction accuracy and downstream adaptability, identifies two failure modes (fine-grained semantic extraction from available modalities and robust validation of generated modalities), proposes an agentic framework with dynamic modality-aware mining strategies and a self-refinement mechanism, and reports that the proposed method reduces FID by at least 14% for missing image reconstruction and MER by at least 10% for missing text reconstruction relative to baselines. Code is released at the cited GitHub repository.
Significance. If the empirical claims hold under rigorous protocols, the work would usefully document limitations of current foundation models on missing-modality tasks and supply a concrete agentic baseline that improves reconstruction metrics. The public code release supports reproducibility and is a clear strength.
major comments (2)
- [Abstract] Abstract: the central quantitative claim (≥14% FID reduction for images, ≥10% MER reduction for text) is load-bearing, yet the abstract supplies no experimental protocol, dataset names, variant-enumeration procedure, pre-specification of metrics, or statistical tests. This directly matches the stress-test concern that the reported deltas could arise from post-hoc selection among the 42 variants or metric choice.
- [Experimental evaluation] Experimental evaluation (throughout): the paper states that 42 variants were tested across three paradigms and that downstream adaptability was measured, but provides no description of how variants were chosen (exhaustive vs. iterative), no ablation tables isolating the contribution of the mining strategy versus self-refinement, and no quantitative downstream-task numbers to support the adaptability claim.
minor comments (1)
- [Abstract] Abstract: 'Code are released' should read 'Code is released'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central quantitative claim (≥14% FID reduction for images, ≥10% MER reduction for text) is load-bearing, yet the abstract supplies no experimental protocol, dataset names, variant-enumeration procedure, pre-specification of metrics, or statistical tests. This directly matches the stress-test concern that the reported deltas could arise from post-hoc selection among the 42 variants or metric choice.
Authors: We agree that the abstract is overly concise and omits key experimental details. In the revision we will expand the abstract to name the primary datasets, briefly note the three paradigms and the 42-variant enumeration (covering representative models per paradigm), reference the metrics (FID, MER) and their pre-specification in Section 4, and point to the main results tables. The reported deltas are taken directly from the primary experimental tables rather than post-hoc selection; we will also add a short statement on statistical significance where applicable. revision: yes
-
Referee: [Experimental evaluation] Experimental evaluation (throughout): the paper states that 42 variants were tested across three paradigms and that downstream adaptability was measured, but provides no description of how variants were chosen (exhaustive vs. iterative), no ablation tables isolating the contribution of the mining strategy versus self-refinement, and no quantitative downstream-task numbers to support the adaptability claim.
Authors: We acknowledge these omissions in the current draft. The 42 variants were selected to exhaustively cover the main model families within each of the three formalized paradigms; we will add an explicit paragraph and supplementary table documenting the selection criteria. We will insert ablation tables that isolate the modality-aware mining strategy from the self-refinement mechanism. For downstream adaptability we will add quantitative results (accuracy or F1 on representative tasks) with the corresponding numbers and statistical comparisons. revision: yes
Circularity Check
No circularity: empirical comparisons to baselines are independent of method definitions
full rationale
The paper's central claims consist of direct experimental measurements (FID reduced by at least 14% and MER by at least 10% versus baselines) obtained by evaluating 42 model variants across three paradigms plus a proposed agentic framework. No equations, fitted parameters, or self-citations are used to derive these percentages; the reported deltas arise from straightforward metric computation on held-out reconstructions. The derivation chain is therefore observational and externally falsifiable against the same baselines and metrics, satisfying the self-contained criterion with no load-bearing reductions to the method's own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Foundation models can be adapted as plug-and-play solutions for missing modality reconstruction
Reference graph
Works this paper leans on
-
[1]
Incomplete multimodality-diffused emotion recognition,
Y . Wang, Y . Li, and Z. Cui, “Incomplete multimodality-diffused emotion recognition,” Advances in Neural Information Processing Systems, vol. 36, pp. 17 117–17 128, 2023. 1
work page 2023
-
[2]
Smil: Multimodal learning with severely missing modality,
M. Ma, J. Ren, L. Zhao, S. Tulyakov, C. Wu, and X. Peng, “Smil: Multimodal learning with severely missing modality,” in AAAI, vol. 35, no. 3, 2021, pp. 2302–2310. 1, 2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11 Mining Information: The image shows the completion screen of Level 1 in a side - scrolling video game featuring a black and...
work page 2021
-
[3]
Are multi- modal transformers robust to missing modality?
M. Ma, J. Ren, L. Zhao, D. Testuggine, and X. Peng, “Are multi- modal transformers robust to missing modality?” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 18 177–18 186. 1
work page 2022
-
[4]
M3care: Learning with missing modalities in multimodal healthcare data,
C. Zhang, X. Chu, L. Ma, Y . Zhu, Y . Wang, J. Wang, and J. Zhao, “M3care: Learning with missing modalities in multimodal healthcare data,” in Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining , 2022, pp. 2418–2428. 1
work page 2022
-
[5]
Emu3: Next-Token Prediction is All You Need
X. Wang, X. Zhang, Z. Luo, Q. Sun, Y . Cui, J. Wang, F. Zhang, Y . Wang, Z. Li, Q. Yu et al., “Emu3: Next-token prediction is all you need,” arXiv preprint arXiv:2409.18869, 2024. 1, 2, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,” arXiv preprint arXiv:2501.17811 , 2025. 1, 9
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276 , 2024. 1, 2, 3, 9
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang et al., “Qwen2. 5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou, “Show-o: One single transformer to unify multimodal understanding and generation,” arXiv preprint arXiv:2408.12528, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Knowledge bridger: Towards training-free missing multi-modality completion,
G. Ke, S. He, X. L. Wang, B. Wang, G. Chao, Y . Zhang, Y . Xie, and H. Su, “Knowledge bridger: Towards training-free missing multi-modality completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2025, pp. 1–1. 1, 2, 4, 9
work page 2025
-
[11]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2022, pp. 10 684–10 695. 1, 2, 3
work page 2022
-
[12]
Gen- erative adversarial text to image synthesis,
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Gen- erative adversarial text to image synthesis,” in International conference on machine learning . PMLR, 2016, pp. 1060–1069. 1
work page 2016
-
[13]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
J. Yu, Y . Xu, J. Y . Koh, T. Luong, G. Baid, Z. Wang, V . Vasudevan, A. Ku, Y . Yang, B. K. Ayan et al., “Scaling autoregressive models for content-rich text-to-image generation,” arXiv preprint arXiv:2206.10789 , vol. 2, no. 3, p. 5, 2022. 1
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[15]
Audioldm: Text-to-audio generation with latent diffusion models,
H. Liu, Z. Chen, Y . Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503 , 2023. 1, 2
-
[16]
Imagebind: One embedding space to bind them all,
R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V . Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 15 180–15 190. 1, 3
work page 2023
-
[17]
Can we generate images with cot? let’s verify and reinforce image generation step by step,
Z. Guo, R. Zhang, C. Tong, Z. Zhao, P. Gao, H. Li, and P.-A. Heng, “Can we generate images with cot? let’s verify and reinforce image generation step by step,” arXiv preprint arXiv:2501.13926 , 2025. 2
-
[18]
Comfygen: Prompt-adaptive workflows for text-to-image generation,
R. Gal, A. Haviv, Y . Alaluf, A. H. Bermano, D. Cohen-Or, and G. Chechik, “Comfygen: Prompt-adaptive workflows for text-to-image generation,” arXiv preprint arXiv:2410.01731 , 2024. 2
-
[19]
Can test-time scaling improve world foundation model?
W. Cong, H. Zhu, P. Wang, B. Liu, D. Xu, K. Wang, D. Z. Pan, Y . Wang, Z. Fan, and Z. Wang, “Can test-time scaling improve world foundation model?” arXiv preprint arXiv:2503.24320 , 2025. 2
-
[20]
Training strategies to handle missing modalities for audio-visual expression recognition,
S. Parthasarathy and S. Sundaram, “Training strategies to handle missing modalities for audio-visual expression recognition,” in ICMI, 2020, pp. 400–404. 2
work page 2020
-
[21]
Deep partial multi-view learning,
C. Zhang, Y . Cui, Z. Han, J. T. Zhou, H. Fu, and Q. Hu, “Deep partial multi-view learning,” IEEE PAMI, vol. 44, no. 5, pp. 2402–2415, 2020. 2
work page 2020
-
[22]
Multi-modal learning with missing modality via shared-specific feature modelling,
H. Wang, Y . Chen, C. Ma, J. Avery, L. Hull, and G. Carneiro, “Multi-modal learning with missing modality via shared-specific feature modelling,” in CVPR, 2023, pp. 15 878–15 887. 2
work page 2023
-
[23]
Gcnet: Graph completion network for incomplete multimodal learning in conversation,
Z. Lian, L. Chen, L. Sun, B. Liu, and J. Tao, “Gcnet: Graph completion network for incomplete multimodal learning in conversation,” IEEE T-PAMI, vol. 45, no. 7, pp. 8419–8432, 2023. 2
work page 2023
-
[24]
H. Pham, P. P. Liang, T. Manzini, L.-P. Morency, and B. P ´oczos, “Found in translation: Learning robust joint representations by cyclic translations between modalities,” in AAAI, vol. 33, no. 01, 2019, pp. 6892–6899. 2
work page 2019
-
[25]
Multimodal prompting with missing modalities for visual recognition,
Y .-L. Lee, Y .-H. Tsai, W.-C. Chiu, and C.-Y . Lee, “Multimodal prompting with missing modalities for visual recognition,” in CVPR, 2023, pp. 14 943–14 952. 2
work page 2023
-
[26]
Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition,
Z. Guo, T. Jin, and Z. Zhao, “Multimodal prompt learning with missing modalities for sentiment analysis and emotion recognition,” in ACL, 2024, pp. 1726–1736. 2
work page 2024
-
[27]
Multi-modal modality- masked diffusion network for brain mri synthesis with random modality missing,
X. Meng, K. Sun, J. Xu, X. He, and D. Shen, “Multi-modal modality- masked diffusion network for brain mri synthesis with random modality missing,” IEEE Transactions on Medical Imaging , 2024. 2
work page 2024
-
[28]
X. Xiao, Q. V . Hu, and G. Wang, “Fgc2f-udiff: Frequency-guided and coarse-to-fine unified diffusion model for multi-modality missing mri synthesis,” IEEE Transactions on Computational Imaging , 2024. 2
work page 2024
-
[29]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al. , “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12 efficient foundation language models,” arXiv preprint arXiv:2302.13971 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
B. F. Labs, “Flux,” https://github.com/black-forest-labs/flux, 2024. 2, 3
work page 2024
-
[32]
Z. Evans, J. D. Parker, C. Carr, Z. Zukowski, J. Taylor, and J. Pons, “Stable audio open,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2025, pp. 1–5. 2, 3
work page 2025
-
[33]
Audioldm 2: Learning holistic audio generation with self-supervised pretraining,
H. Liu, Y . Yuan, X. Liu, X. Mei, Q. Kong, Q. Tian, Y . Wang, W. Wang, Y . Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , 2024. 2, 3
work page 2024
-
[34]
Next-gpt: Any-to-any multimodal llm,
S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” in Forty-first International Conference on Machine Learning, 2024. 2
work page 2024
-
[35]
Generative adversarial networks,
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM , vol. 63, no. 11, pp. 139–144, 2020. 2
work page 2020
-
[36]
Conditional Generative Adversarial Nets
M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784 , 2014. 2
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” arXiv preprint arXiv:1809.11096 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Classifier-Free Diffusion Guidance
J. Ho and T. Salimans, “Classifier-free diffusion guidance,”arXiv preprint arXiv:2207.12598, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[39]
Amm- diff: Adaptive multi-modality diffusion network for missing modality imputation,
A. Kebaili, J. Lapuyade-Lahorgue, P. Vera, and S. Ruan, “Amm- diff: Adaptive multi-modality diffusion network for missing modality imputation,” arXiv preprint arXiv:2501.12840 , 2025. 2
-
[40]
Missdiff: Training dif- fusion models on tabular data with missing values,
Y . Ouyang, L. Xie, C. Li, and G. Cheng, “Missdiff: Training dif- fusion models on tabular data with missing values,” arXiv preprint arXiv:2307.00467, 2023. 2
-
[41]
J. Li, S. Wang, Q. Zhang, S. Yu, and F. Chen, “Generating with fairness: A modality-diffused counterfactual framework for incomplete multimodal recommendations,” in Proceedings of the ACM on Web Conference 2025 , 2025, pp. 2787–2798. 2
work page 2025
-
[42]
Agent AI: Surveying the Horizons of Multimodal Interaction
Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkar, R. Taori, Y . Noda, D. Terzopoulos, Y . Choiet al., “Agent ai: Surveying the horizons of multimodal interaction,” arXiv preprint arXiv:2401.03568 , 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Agent s: An open agentic framework that uses computers like a human,
S. Agashe, J. Han, S. Gan, J. Yang, A. Li, and X. E. Wang, “Agent s: An open agentic framework that uses computers like a human,” arXiv preprint arXiv:2410.08164, 2024. 2
-
[44]
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
P. J. Sager, B. Meyer, P. Yan, R. von Wartburg-Kottler, L. Etaiwi, A. Enayati, G. Nobel, A. Abdulkadir, B. F. Grewe, and T. Stadelmann, “Ai agents for computer use: A review of instruction-based computer control, gui automation, and operator assistants,”arXiv preprint arXiv:2501.16150,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Solving math word problems via cooperative reasoning induced language models,
X. Zhu, J. Wang, L. Zhang, Y . Zhang, R. Gan, J. Zhang, and Y . Yang, “Solving math word problems via cooperative reasoning induced language models,” arXiv preprint arXiv:2210.16257 , 2022. 2
-
[46]
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
T. Liang, Z. He, W. Jiao, X. Wang, Y . Wang, R. Wang, Y . Yang, S. Shi, and Z. Tu, “Encouraging divergent thinking in large language models through multi-agent debate,” arXiv preprint arXiv:2305.19118 , 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[47]
Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y . Li, D. Chen, Y . Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” arXiv preprint arXiv:2312.08935 , 2023. 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Agen- tic ai software engineer: Programming with trust,
A. Roychoudhury, C. Pasareanu, M. Pradel, and B. Ray, “Agen- tic ai software engineer: Programming with trust,” arXiv preprint arXiv:2502.13767, 2025. 2
-
[49]
Building living software systems with generative & agentic ai,
J. White, “Building living software systems with generative & agentic ai,” arXiv preprint arXiv:2408.01768 , 2024. 2
-
[50]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” Advances in Neural Informa- tion Processing Systems , vol. 36, pp. 68 539–68 551, 2023. 2
work page 2023
-
[51]
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,
Y . Shen, K. Song, X. Tan, D. Li, W. Lu, and Y . Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face,” Advances in Neural Information Processing Systems , vol. 36, pp. 38 154–38 180,
-
[52]
Navgpt: Explicit reasoning in vision- and-language navigation with large language models,
G. Zhou, Y . Hong, and Q. Wu, “Navgpt: Explicit reasoning in vision- and-language navigation with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 38, no. 7, 2024, pp. 7641–7649. 2
work page 2024
-
[53]
Adaptagent: Adapting multimodal web agents with few-shot learning from human demonstrations,
G. Verma, R. Kaur, N. Srishankar, Z. Zeng, T. Balch, and M. Veloso, “Adaptagent: Adapting multimodal web agents with few-shot learning from human demonstrations,” arXiv preprint arXiv:2411.13451 , 2024. 2
-
[54]
Mind2web: Towards a generalist agent for the web,
X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su, “Mind2web: Towards a generalist agent for the web,” Advances in Neural Information Processing Systems , vol. 36, pp. 28 091–28 114,
-
[55]
Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark,
D. Chen, R. Chen, S. Zhang, Y . Wang, Y . Liu, H. Zhou, Q. Zhang, Y . Wan, P. Zhou, and L. Sun, “Mllm-as-a-judge: Assessing multimodal llm-as- a-judge with vision-language benchmark,” in Forty-first International Conference on Machine Learning , 2024. 3
work page 2024
-
[56]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Vggsound: A large- scale audio-visual dataset,
H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large- scale audio-visual dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 721–725. 3
work page 2020
-
[58]
Msr-vtt: A large video description dataset for bridging video and language,
J. Xu, T. Mei, T. Yao, and Y . Rui, “Msr-vtt: A large video description dataset for bridging video and language,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 5288–
work page 2016
-
[59]
Audiocaps: Generating captions for audios in the wild,
C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132. 3
work page 2019
-
[60]
Microsoft coco: Common objects in context,
T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13 . Springer, 2014, pp. 740–755. 3
work page 2014
-
[61]
Gans trained by a two time-scale update rule converge to a local nash equilibrium,
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems , vol. 30,
-
[62]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning . PmLR, 2021, pp. 8748–8763. 4
work page 2021
-
[63]
From wer and ril to mer and wil: improved evaluation measures for connected speech recognition
A. C. Morris, V . Maier, and P. D. Green, “From wer and ril to mer and wil: improved evaluation measures for connected speech recognition.” in Interspeech, 2004, pp. 2765–2768. 4
work page 2004
-
[64]
Tasnet: time-domain audio separation network for real-time, single-channel speech separation,
Y . Luo and N. Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700. 4
work page 2018
-
[65]
I.-T. Recommendation, “Perceptual evaluation of speech quality (pesq): An objective method for end-to-end speech quality assessment of narrow- band telephone networks and speech codecs,” Rec. ITU-T P . 862, 2001. 4
work page 2001
-
[66]
Best practices and lessons learned on synthetic data,
R. Liu, J. Wei, F. Liu, C. Si, Y . Zhang, J. Rao, S. Zheng, D. Peng, D. Yang, D. Zhou et al., “Best practices and lessons learned on synthetic data,” arXiv preprint arXiv:2404.07503 , 2024. 6
-
[67]
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Y . Yue, Z. Chen, R. Lu, A. Zhao, Z. Wang, S. Song, and G. Huang, “Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?” arXiv preprint arXiv:2504.13837, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022. 10
work page 2022
-
[70]
The Power of Scale for Parameter-Efficient Prompt Tuning
B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter- efficient prompt tuning,” arXiv preprint arXiv:2104.08691 , 2021. 10
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.