SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
Sharegpt4v: Improving large multi-modal models with better captions
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 6roles
background 3polarities
background 3representative citing papers
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
Inferix provides an optimized inference engine for semi-autoregressive block-diffusion decoding to support high-quality, variable-length video generation in world simulation applications.
citing papers explorer
-
SOMA: Efficient Multi-turn LLM Serving via Small Language Model
SOMA estimates a local response manifold from early turns and adapts a small surrogate model via divergence-maximizing prompts and localized LoRA fine-tuning for efficient multi-turn serving.
-
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
-
Anisotropic Modality Align
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
-
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
-
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset
MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.
-
Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix provides an optimized inference engine for semi-autoregressive block-diffusion decoding to support high-quality, variable-length video generation in world simulation applications.