hub Canonical reference

Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226

Anygpt: Unified multimodal llm with discrete sequence modeling , author= · 2025 · arXiv 2402.12226

Canonical reference. 83% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 83% of classified citations

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 baseline 1 method 1

citation-polarity summary

background 5 baseline 1

representative citing papers

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs

cs.CV · 2026-06-01 · unverdicted · novelty 7.0

AVI-Bench is a cognitively inspired benchmark that evaluates Omni-MLLMs on joint audio-visual tasks and reveals substantial limitations in current models.

PolySLGen: Online Multimodal Speaking-Listening Reaction Generation in Polyadic Interaction

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PolySLGen generates contextually appropriate and temporally coherent multimodal speaking and listening reactions for polyadic interactions by fusing group motion and social cues.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

Deep Multimodal Learning with Missing Modality: A Survey

cs.CV · 2024-09-12 · unverdicted · novelty 7.0

This survey provides the first comprehensive overview of deep multimodal learning methods designed to remain robust when some input modalities are absent.

Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.

Benchmarking and Enhancing VLM for Compressed Image Understanding

cs.CV · 2025-12-24 · unverdicted · novelty 6.0

Introduces a benchmark for VLMs on compressed images and a universal adaptor to improve performance across codecs and bitrates.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

cs.CV · 2024-09-06 · unverdicted · novelty 6.0

VILA-U unifies visual understanding and generation inside one autoregressive next-token prediction model, removing separate diffusion components while claiming near state-of-the-art results.

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.

Context Unrolling in Omni Models

cs.CV · 2026-04-23 · unverdicted · novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

Qwen2.5-Omni Technical Report

cs.CL · 2025-03-26 · conditional · novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

cs.CV · 2024-03-27 · unverdicted · novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

cs.CV · 2025-01-03 · conditional · novelty 4.0

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

cs.CV · 2025-03-16 · unverdicted · novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

citing papers explorer

Showing 1 of 1 citing paper after filters.

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction cs.CV · 2025-01-03 · conditional · none · ref 2
VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer