Pith Number

pith:DANAGOQM

pith:2025:DANAGOQMYSONYCLBHI5VPKZFV3

not attested not anchored not stored refs resolved

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Caifeng Shan, Chaoyou Fu, Haojia Lin, Haoyu Cao, Heting Gao, Ke Li, Long Ma, Ran He, Rongrong Ji, Xiaoyu Liu, Xiawu Zheng, Xing Sun, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Zuwei Long

A multi-stage training method allows large language models to handle vision and speech together for near real-time interaction.

arxiv:2501.01957 v4 · 2025-01-03 · cs.CV · cs.SD · eess.AS

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{DANAGOQMYSONYCLBHI5VPKZFV3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

C2weakest assumption

The multi-stage training can be balanced so that speech capabilities are added without degrading the pre-existing vision-language capacity, an assumption stated in the description of the progressive training methodology.

C3one line summary

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

References

77 extracted · 77 resolved · 28 Pith anchors

[1] Visual Instruction Tuning 2023 · arXiv:2304.08485

[2] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling 2024

[3] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding 2023 · arXiv:2306.02858

[4] Speechact: Towards generating whole-body motion from speech.IEEE TVCG 2025

[5] video-salmonn: Speech-enhanced audio-visual large language models 2024

Formal links

2 machine-checked theorem links

Cited by

24 papers in Pith

Beyond Words: Multimodal LLM Knows When to Speak

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action

Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

Receipt and verification

First computed	2026-05-17T23:38:13.077493Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

181a033a0cc49cdc09613a3b57ab25aef8b643bb7933a9cc4b8dc3d30518d1bd

Aliases

arxiv: 2501.01957 · arxiv_version: 2501.01957v4 · doi: 10.48550/arxiv.2501.01957 · pith_short_12: DANAGOQMYSON · pith_short_16: DANAGOQMYSONYCLB · pith_short_8: DANAGOQM

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/DANAGOQMYSONYCLBHI5VPKZFV3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 181a033a0cc49cdc09613a3b57ab25aef8b643bb7933a9cc4b8dc3d30518d1bd

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "0226ae62a6421b6814d91b94483cc947f92cefd1a23c4c3619dbc03f7bc32d4b",
    "cross_cats_sorted": [
      "cs.SD",
      "eess.AS"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-03T18:59:52Z",
    "title_canon_sha256": "6e3f027dc6145dfab500c2837be05fae09a466ca6207ecf900e63fdf2471478b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.01957",
    "kind": "arxiv",
    "version": 4
  }
}