pith. sign in
Pith Number

pith:DANAGOQM

pith:2025:DANAGOQMYSONYCLBHI5VPKZFV3
not attested not anchored not stored refs resolved

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Caifeng Shan, Chaoyou Fu, Haojia Lin, Haoyu Cao, Heting Gao, Ke Li, Long Ma, Ran He, Rongrong Ji, Xiaoyu Liu, Xiawu Zheng, Xing Sun, Xiong Wang, Yi-Fan Zhang, Yunhang Shen, Zuwei Long

A multi-stage training method allows large language models to handle vision and speech together for near real-time interaction.

arxiv:2501.01957 v4 · 2025-01-03 · cs.CV · cs.SD · eess.AS

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{DANAGOQMYSONYCLBHI5VPKZFV3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

C2weakest assumption

The multi-stage training can be balanced so that speech capabilities are added without degrading the pre-existing vision-language capacity, an assumption stated in the description of the progressive training methodology.

C3one line summary

VITA-1.5 integrates vision and speech into a single LLM through multi-stage training, delivering competitive benchmark results on image, video, and speech tasks with near real-time response speed.

References

77 extracted · 77 resolved · 28 Pith anchors

[1] Visual Instruction Tuning 2023 · arXiv:2304.08485
[2] AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling 2024
[3] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding 2023 · arXiv:2306.02858
[4] Speechact: Towards generating whole-body motion from speech.IEEE TVCG 2025
[5] video-salmonn: Speech-enhanced audio-visual large language models 2024

Formal links

2 machine-checked theorem links

Cited by

24 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.077493Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

181a033a0cc49cdc09613a3b57ab25aef8b643bb7933a9cc4b8dc3d30518d1bd

Aliases

arxiv: 2501.01957 · arxiv_version: 2501.01957v4 · doi: 10.48550/arxiv.2501.01957 · pith_short_12: DANAGOQMYSON · pith_short_16: DANAGOQMYSONYCLB · pith_short_8: DANAGOQM
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/DANAGOQMYSONYCLBHI5VPKZFV3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 181a033a0cc49cdc09613a3b57ab25aef8b643bb7933a9cc4b8dc3d30518d1bd
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "0226ae62a6421b6814d91b94483cc947f92cefd1a23c4c3619dbc03f7bc32d4b",
    "cross_cats_sorted": [
      "cs.SD",
      "eess.AS"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-01-03T18:59:52Z",
    "title_canon_sha256": "6e3f027dc6145dfab500c2837be05fae09a466ca6207ecf900e63fdf2471478b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2501.01957",
    "kind": "arxiv",
    "version": 4
  }
}