{"total":115,"items":[{"citing_arxiv_id":"2606.18570","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Streamlining Analysis and Design of Two-Dimensional Electronic Spectroscopy using Machine Learning","primary_cat":"physics.chem-ph","submitted_at":"2026-06-17T00:42:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A Gaussian mixture model is used to learn spectral densities from 2DES experiments, enabling extraction of vibronic couplings, spectral extrapolation, and optimized experiment selection across simulated and experimental systems.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"(c,f) GMM and reference spectral densities and linear absorption spectra (inset). In panels c and f a simulated reference spectral density is shown as it is not experimentally measurable. V. NILE BLUE IN ETHANOL EXPERIMENTAL DETAILS 2DES measurements of Nile blue in ethanol were performed using a fully non-collinear, BOXCARS geometry setup. Full details can be found inSon, et. al.[4] In short, the 800 nm output out of a Ti:Saph regenerative amplifier (Coherent Libra) was passed through an argon tube (∼20 UNIT) wherein it undergoes self-phase modulation to generate white light. This white light is then compressed using double-angle chirp mirrors (Ultrafast Innovations) to a pulse width of∼12 fs and usable spectral range of∼550-700 nm, as determined by transient grating frequency-resolved optical"},{"citing_arxiv_id":"2606.16009","ref_index":103,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bridging the Usability Gap: Lessons from Interpreting Studies for Machine Interpreting Design","primary_cat":"cs.CL","submitted_at":"2026-06-14T20:41:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28741","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Prophetic Decoding to Unlock Visual Search in LVLMs","primary_cat":"cs.CV","submitted_at":"2026-05-27T17:01:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SeProD is a plug-and-play self-prophetic decoding framework that combines pre- and post-training LVLM capabilities via probability-based sampling to improve coherent visual search and multi-step reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27894","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs","primary_cat":"cs.CV","submitted_at":"2026-05-27T03:18:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes the first unified incomplete video-language model that processes missing modalities and serves as a plug-and-play module to boost existing VLMs on multi-modal tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21611","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation","primary_cat":"cs.CV","submitted_at":"2026-05-20T18:17:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniVL unifies vision and language into one mask-rendered input processed by an OCR backbone to condition diffusion models for spatially grounded image generation without a standalone text encoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19859","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Eyes on VLM: Benchmarking Gaze Following and Social Gaze Prediction in Vision Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T13:50:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EyeVLM benchmark finds that current VLMs underperform specialized visual models on gaze following and social gaze prediction, with fine-tuning narrowing but not closing the gap.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17152","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual and Multimodal LLMs in the Wild: Building for Low-Resource Languages","primary_cat":"cs.CL","submitted_at":"2026-05-16T20:56:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A tutorial synthesizing foundations, recent models such as PALO and Maya, and low-cost methods for tri-modal multilingual AI in resource-constrained settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16932","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MORN: Metacognitive Object-Goal Regulation for Resource-Rational Long-Horizon Navigation","primary_cat":"cs.RO","submitted_at":"2026-05-16T10:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MORN augments frozen VLM-based object navigation agents with a System 2 meta-controller using Potentiality Index, Persistence Gating, and Evidence Accumulation to improve goal completion rate from 0.23 to 0.30 and reduce wasted steps on the HM3D dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14708","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StyleTextGen: Style-Conditioned Multilingual Scene Text Generation","primary_cat":"cs.CV","submitted_at":"2026-05-14T11:24:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StyleTextGen proposes a dual-branch style encoder, text style consistency loss, and mask-guided inference to achieve superior style consistency and cross-lingual performance in multilingual scene text generation on a new bilingual benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08709","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniShield: Unified Face Attack Detection via KG-Informed Multimodal Reasoning","primary_cat":"cs.CV","submitted_at":"2026-05-09T05:44:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UniShield introduces a knowledge-graph-informed multimodal framework that improves unified detection of physical and digital face attacks through instruction tuning and consistency-optimized reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Green circles are shared features across attacks, and purple circles are type-specific features; edges indicate relations. 2.2 MLLMs for Multimodal Reasoning in Face Security MLLMs align a vision encoder with large language models, enabling joint visual understanding and textual generation with emergent reasoning capabilities. General alignment paradigms such as BLIP-2 [11] and MiniGPT-4 [32] demonstrate strong instruction-following and multimodal reasoning, motivating their adoption in security-oriented perception. In face security, emerging MLLM-based approaches explore explainable detection. For FAS, FaceShield [25] employs MLLMs to provide spoofing decisions with textual explanations, and CEPL [30] further investigates promptable/consistent evidence learning."},{"citing_arxiv_id":"2605.07544","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Pixels to Prompts: Vision-Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-08T10:17:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"An explanatory book that supplies a clear mental map and intuition for how Vision-Language Models combine vision and language capabilities.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Conceptual Captions (CC3M) [72] follows a related philosophy at smaller scale: captions are extracted from web alt-text and titles, then aggressively cleaned, normal- ized, and hypernymed to remove personally identifiable information and low-quality strings, yielding∼3.3M image-caption pairs suitable for training captioning models (seeFigure5.1). ArelaxedversionofthispipelineleadstoConceptual12M(CC12M)[13], which trades some annotation cleanliness for greater diversity and long-tail coverage, making it more appropriate for large-scale representation learning. Open datasets such as LAION-400M and LAION-5B push this paradigm to the billion-example regime. LAION-5B [71] contains5.85B CLIP-filtered image-text pairs collected from the web, with associated CLIP embeddings and metadata."},{"citing_arxiv_id":"2605.03352","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Can Multimodal Large Language Models Understand Pathologic Movements? A Pilot Study on Seizure Semiology","primary_cat":"cs.CV","submitted_at":"2026-05-05T04:14:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MLLMs achieve zero-shot recognition of seizure semiological features better than fine-tuned vision models on most tested features, with signal enhancement and faithful explanations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01449","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VisInject: Disruption != Injection -- A Dual-Dimension Evaluation of Universal Adversarial Attacks on Vision-Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-02T13:56:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Universal adversarial attacks cause output perturbation 90 times more often than precise target injection in VLMs, with only 2 verbatim successes out of 6615 tests.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec), 2023. doi: 10.1145/3605764.3623985. URL https://arxiv.org/abs/2302.12173. [12] J. Richard Landis and Gary G. Koch. The measurement of observer agreement for categorical data.Biometrics, 33(1):159-174, 1977. doi: 10.2307/2529310. URL https://doi.org/10. 2307/2529310. [13] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023. URLhttps://arxiv.org/abs/ 2301.12597. [14] Yifan Li, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, and Ji-Rong Wen."},{"citing_arxiv_id":"2604.22492","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MTT-Bench: Predicting Social Dominance in Mice via Multimodal Large Language Models","primary_cat":"eess.IV","submitted_at":"2026-04-24T12:20:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuned multimodal LLMs predict mouse social dominance from raw tube test videos with high agreement to traditional rankings.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"els rely heavily on pose accuracy and handcrafted behavioral priors. In contrast, we introduce a new paradigm where pre- trained MLLMs are fine-tuned to capture subtle behavioral dynamics and produce zero-shot dominance predictions from raw interaction videos. Multimodal Large Language Models for Scientific Rea- soning.Multimodal large language models (MLLMs), such as BLIP-2 [25], OpenFlamingo [26], and InternVL [27], have shown remarkable success in visual reasoning, captioning, and video understanding tasks. By jointly modeling image/video inputs with textual prompts, these models can support few-shot or zero-shot generalization in open-ended domains. Several recent studies (e.g., MouseGPT, 2025) have explored using vision-language pretraining to analyze rodent behavior with"},{"citing_arxiv_id":"2604.19728","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-04-21T17:51:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"coupled to specific algorithmic decisions, limiting research flexibility. At the same time, data scarcity remains a fundamental bottleneck in robotics. Robot interaction data is severely constrained relative to data used for language and vision models, especially in diversity and in signal density per token [5]. As robot policies continue to scale, the relative importance of non-robotics data only grows [35]. Despite this data disparity, most open-source VLA frameworks focus narrowly on the action training stage, treating the upstream data recipe as fixed or out-of-scope. Such separation is problematic: data decisions made during LLM and VLM pretraining have direct consequences for downstream robotics performance. Exploring the design space requires a framework that treats the entire pipeline, from pretraining"},{"citing_arxiv_id":"2604.15994","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams","primary_cat":"cs.AI","submitted_at":"2026-04-17T12:16:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReactBench benchmark shows MLLMs suffer over 30% performance drop on complex topological reasoning tasks versus basic ones when evaluated on chemical reaction diagrams.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15804","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2026-04-17T08:05:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14779","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning","primary_cat":"cs.CV","submitted_at":"2026-04-16T08:39:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14707","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery","primary_cat":"cs.MM","submitted_at":"2026-04-16T07:15:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[35] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.arXiv preprint arXiv:2301.12597(2023). [36] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv preprint arXiv:1908.03557(2019). [37] Tingle Li, Baihe Huang, Xiaobin Zhuang, Dongya Jia, Jiawei Chen, Yuping Wang, Zhuo Chen, Gopala Anumanchipalli, and Yuxuan Wang. 2025. Sounding that Object: Interactive Object-Aware Image-to-Audio Generation. InInternational Conference on Machine Learning (ICML). [38] Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, and Xie Chen. 2025. MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows."},{"citing_arxiv_id":"2604.16517","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SmoGVLM: A Small, Graph-enhanced Vision-Language Model","primary_cat":"cs.CV","submitted_at":"2026-04-15T13:44:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A graph-enhanced 1.3B-parameter VLM achieves up to 16.24% gains and outperforms larger VLMs by integrating structured knowledge via GNNs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13803","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation","primary_cat":"cs.CV","submitted_at":"2026-04-15T12:38:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"unclear","context_text":"Definition 4 (Pressure Conversion Rate). Let σ(1) k (pi) ∈ {0, 1} indicate sycophancy at Turn 1 and σ(2) k (pi) ∈ {0, 1} indicate sycophancy at Turn 2 (only administered if σ(1) k (pi) = 0). The pressure conversion rate quantifies how often a model that initially resists is subsequently persuaded: Π(mk) = P i:σ(1) k (pi)=0 σ(2) k (pi) PM i=1 ⊮ h σ(1) k (pi) = 0 i . (4) With these quantities defined, our central research question can be stated precisely. Proposition 1 (Brain Alignment and Sycophancy Resistance). If a model's visual encoder develops representations that more faithfully mirror the computations of the human visual cortex, then that model should be less susceptible to adversarial linguistic pressure that contradicts visual evidence."},{"citing_arxiv_id":"2604.13448","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Study of Failure Modes in Two-Stage Human-Object Interaction Detection","primary_cat":"cs.CV","submitted_at":"2026-04-15T04:01:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A diagnostic study shows that two-stage HOI models fail differently across scene configurations like multi-person and rare interactions, revealing that aggregate benchmark accuracy does not imply robust visual reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11095","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bottleneck Tokens for Unified Multimodal Retrieval","primary_cat":"cs.LG","submitted_at":"2026-04-13T07:12:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"production, rather than as a structural constraint on how information is com- pressed. The generative loss operates alongside or after embedding extraction, without being architecturally coupled with a dedicated compression mechanism. 2.3 Learnable Tokens for Representation Compression Input-side compression.Learnable tokens are widely used asinput-sidecompres- sors (e.g., Perceiver [5], Q-Former [13], Flamingo [1]) to adapt visual features for LLMs. These modules facilitate modality alignment but do not produce retrieval embeddings. Learnable tokens for embedding extraction.Conversely,output-sidelearnable to- ken approaches are rarer in unified retrieval. NV-Embed [11] introduces a Latent Attention Layer as an explicit replacement for<EOS>pooling."},{"citing_arxiv_id":"2604.04905","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality","primary_cat":"cs.CV","submitted_at":"2026-04-06T17:50:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ClickAIXR combines controller-based object selection in XR with on-device VLM inference to enable private, precise multimodal queries about real objects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.03231","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning","primary_cat":"cs.CV","submitted_at":"2026-04-03T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[33] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. (2022), 12888-12900. [34] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Evaluating Object Hallucination in Large Vision-Language Models.arXiv preprint arXiv:2305.10355(2023). [35] Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. 2024. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858(2024). [36] Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Jinfa Huang, Junwu Zhang, Munan Ning, and Li Yuan. 2024. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models."},{"citing_arxiv_id":"2604.03117","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-03T15:42:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UCGP is a universal physical adversarial patch that compromises cross-modal semantic alignment in IR-VLMs through curved-grid parameterization and representation-space disruption.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LLaVA[2], we build both classification models and generative IR-VLMs on a unified infrared semantic corpus. Infrared- COCO provides about 17k infrared images and 12k matched QA pairs[2, 29]. We evaluate LLaVA-1.5[33], LLaVA-1.6[34], OpenFlamingo[35], BLIP-2[36], and InstructBLIP[37] for cap- tioning/VQA, and OpenAI CLIP[1], OpenCLIP[38], Meta- CLIP[39], and EVA-CLIP[40] for classification. This model suitespansbothcontrastiveclassificationandgenerativeseman- tic interfaces. All models start from natural-image-pretrained weights,freezethebackbone,anduseLoRAforadaptation[ 41]; captioning/VQA are fine-tuned with next-token prediction and classification with InfoNCE. Unless otherwise specified, later"},{"citing_arxiv_id":"2603.09921","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition","primary_cat":"cs.CV","submitted_at":"2026-03-10T17:18:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"WikiCLIP delivers an efficient contrastive baseline for open-domain visual entity recognition that improves accuracy by 16% on OVEN unseen entities and runs nearly 100 times faster than leading generative models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.19423","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UniRec: Unified Multimodal Encoding for LLM-Based Recommendations","primary_cat":"cs.IR","submitted_at":"2026-01-27T10:02:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniRec unifies heterogeneous recommendation modalities via specialized encoders, triplet representations, and hierarchical modeling to outperform prior multimodal LLM recommenders by up to 15% on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.05127","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LooseRoPE: Content-aware Attention Manipulation for Semantic Harmonization","primary_cat":"cs.GR","submitted_at":"2026-01-08T17:17:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LooseRoPE modulates RoPE in diffusion attention maps to continuously trade off between preserving a pasted object's identity and harmonizing it with its new surroundings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10554","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grounding Everything in Tokens for Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-11T11:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.10362","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Visual Funnel: Resolving Contextual Blindness in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-11T07:22:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Visual Funnel resolves contextual blindness in MLLMs by constructing an entropy-scaled portfolio of hierarchically structured image crops that preserves both local detail and global context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.21740","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A cross-species neural foundation model for end-to-end speech decoding","primary_cat":"cs.CL","submitted_at":"2025-11-21T21:25:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A cross-species pretrained neural encoder combined with end-to-end training and audio LLMs reduces word error rate in neural speech decoding from 24.69% to 10.22% while aligning attempted and imagined speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.16719","ref_index":69,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SAM 3: Segment Anything with Concepts","primary_cat":"cs.CV","submitted_at":"2025-11-20T18:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SAM 3 introduces promptable concept segmentation that doubles accuracy of prior systems on images and videos while improving standard SAM segmentation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.15578","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning","primary_cat":"cs.CV","submitted_at":"2025-11-19T16:09:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AVATAAR reports relative gains of 5-8% over baseline on CinePile benchmark categories through agentic feedback for long video QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.10292","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models","primary_cat":"cs.CV","submitted_at":"2025-11-13T13:29:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RUDDER creates a persistent visual anchor by extracting CARD from prefill residuals and modulating its injection via an adaptive Beta Gate, cutting CHAIR_S by 24.4% and CHAIR_i by 23.6% on average across LLaVA, Idefics2, InstructBLIP and Qwen2.5-VL with >96% throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.17765","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen3-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-09-22T13:26:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-modal Qwen counterparts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09794","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity","primary_cat":"cs.AI","submitted_at":"2025-09-11T18:53:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.10236","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Navigating the Challenges of AI-Generated Image Detection in the Wild: What Truly Matters?","primary_cat":"cs.CV","submitted_at":"2025-07-14T12:56:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The ITW-SM dataset and targeted optimization of detector design choices yield a 26.87% average AUC improvement for state-of-the-art AI-generated image detectors under real-world social media conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.01955","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks","primary_cat":"cs.CV","submitted_at":"2025-07-02T17:59:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multimodal foundation models achieve respectable but sub-specialist performance on semantic vision tasks and weaker results on geometric tasks when evaluated through prompt chaining on established benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.20215","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Qwen2.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-26T04:17:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.06223","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RedDiffuser: Auditing Multimodal Safety Failures in Vision-Language Models via Reinforced Diffusion","primary_cat":"cs.CV","submitted_at":"2025-03-08T13:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RedDiffuser is a reinforced diffusion framework that generates adversarial visual contexts to audit and expose widespread multimodal safety failures in VLMs, increasing unsafe response rates by up to 10.69% on LLaVA with transfer to other models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.14721","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts","primary_cat":"cs.CL","submitted_at":"2024-11-22T04:28:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MolReFlect introduces a teacher-student framework that automatically creates fine-grained molecule-text alignments to achieve SOTA results on molecule-caption translation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2411.02327","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance","primary_cat":"cs.CV","submitted_at":"2024-11-04T17:50:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.18715","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ChatSearch: a Dataset and a Generative Retrieval Model for General Conversational Image Retrieval","primary_cat":"cs.CV","submitted_at":"2024-10-24T13:19:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Presents ChatSearch dataset and ChatSearcher generative model for conversational image retrieval on open-domain images, claiming superior performance on the new dataset and competitive results elsewhere.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17247","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction","primary_cat":"cs.CV","submitted_at":"2024-10-22T17:59:53+00:00","verdict":"ACCEPT","verdict_confidence":"HIGH","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.02713","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLaVA-Video: Video Instruction Tuning With Synthetic Data","primary_cat":"cs.CV","submitted_at":"2024-10-03T17:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.07388","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Recent Advances in Multimodal Affective Computing: An NLP Perspective","primary_cat":"cs.CL","submitted_at":"2024-09-11T16:24:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Survey organizing multimodal affective computing research around four NLP tasks, method paradigms, datasets, evaluation protocols, and future directions while releasing a resource repository.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Among the open-source innovations, Flamingo [70] represents an early effort to integrate visual features with LLMs using cross-attention layers. BLIP-2 [71] introduces a trainable adaptor module (Q-Former) that efficiently connects a pre-trained image encoder with a pre-trained LLM, ensuring precise alignment of visual and textual information. Similarly, MiniGPT-4 [72] achieves visual and textual alignment through a linear projection layer. InstructBLIP [73] advances the field by focusing on vision-language instruction tuning, building upon BLIP-2, and requiring a deeper understanding and larger datasets for effective training. LLaV A [74] integrates CLIP's image encoder with LLaMA's language decoder to enhance"},{"citing_arxiv_id":"2409.01704","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model","primary_cat":"cs.CV","submitted_at":"2024-09-03T08:41:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.01652","ref_index":98,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2024-09-03T06:45:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ReKep encodes robotic tasks as optimizable Python functions over 3D keypoints that are generated automatically from language and RGB-D input, enabling real-time hierarchical planning on single- and dual-arm platforms without task-specific data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Blip: Bootstrapping language-image pre-training for uni- fied vision-language understanding and generation. In International conference on machine learning, pages 12888-12900. PMLR, 2022. [97] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023. [98] J. Li, D. Li, S. Savarese, and S. Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023. [99] H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao. Copa: General robotic manipulation through spatial constraints of parts with foundation models. arXiv preprint arXiv:2403."},{"citing_arxiv_id":"2408.13257","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?","primary_cat":"cs.CV","submitted_at":"2024-08-23T17:59:51+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}