Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang; Tianshu Yu; Yiwen Guo; Zhipeng Li

REVIEW 4 major objections 6 minor 3 references

Ex-Omni claims that an omni-modal language model can natively generate speech-synchronized 3D facial animation by decoupling semantic reasoning from dense temporal generation.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · deepseek-v4-flash

2026-08-03 03:43 UTC pith:IKW3NDTO

load-bearing objection A solid systems contribution — new integrated capability for OLLMs — but the headline synchronization claim is teacher-referential and the human study is too small; worth a serious referee. the 4 major comments →

arxiv 2602.07106 v2 pith:IKW3NDTO submitted 2026-02-06 cs.CV cs.AIcs.CL

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Haoyu Zhang , Zhipeng Li , Yiwen Guo , Tianshu Yu This is my paper

classification cs.CV cs.AIcs.CL

keywords omni-modal large language models3D facial animationblendshape coefficientsspeech-to-face generationtoken-as-query gated fusionspeech unitsmulti-stage trainingaudio-visual synchronization

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Ex-Omni sets out to give omni-modal large language models (OLLMs) a missing output: 3D facial animation that moves in sync with generated speech. The paper's central claim is that the mismatch between the LLM's discrete, token-level semantics and the dense, smooth time structure of facial motion can be overcome by decoupling the two. The model uses discrete speech units as temporal scaffolding and a token-as-query gated fusion mechanism to decide when and how semantic information enters the face generator. It also contributes a large multi-stage dataset of text, speech, and synthetically labeled facial animation. The payoff, if correct, is a single open-source model that speaks and makes a face at the same time, with better lip-sync and lower marginal face-generation latency than cascaded speech-then-face pipelines, while keeping speech understanding competitive.

Core claim

On its own terms, the paper establishes that an omni-modal LLM can natively generate ARKit-52 blendshape coefficients together with speech, in a non-autoregressive facial decoder, rather than routing through a separate speech-then-animation pipeline. The key move is to stop asking the LLM to emit dense motion directly: the LLM reasons, a lightweight speech generator produces discrete speech units, and those units become the temporal skeleton on which the face decoder builds, with gated cross-attention selectively injecting semantic context. The paper reports that this joint native generation is preferred by human evaluators for lip-speech synchronization and scores lower lip vertex error tha

What carries the argument

The load-bearing design is the unified token-as-query gated fusion (TQGF): in every fusion step, the incoming token sequence always acts as the query and upstream semantic representations act as key/value context, with a sigmoid gate learned from the query deciding how much semantic conditioning enters at each frame. Around it sit two supporting choices: discrete speech units predicted autoregressively by a small speech generator give the face a stable temporal grid, and the facial decoder predicts all 52 blendshape coefficients in parallel with a hybrid frame-wise and velocity loss, using periodic positional encodings that bias toward rhythmic mouth motion.

Load-bearing premise

The evaluation assumes that the audio-to-face teacher model that created the facial training labels is a valid reference for natural facial animation, so the central quantitative advantage partly measures how well the model imitates its own teacher.

What would settle it

Take a small set of real, professionally captured speech-and-blendshape sequences that no model has been trained on, run the native model and the best cascaded pipeline on them, and compare lip vertex error against those real recordings; if the native advantage disappears, the claimed benefit is teacher-imitation rather than generalizable joint generation. A cheaper version is a larger blinded human study that scores full-face naturalness, not just lip-sync, with more than eight evaluators.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Speech and 3D facial animation can be produced by one model rather than a speech model followed by a separate face model.
The decoupling principle (semantic tokens as scaffolding, gated injection for dense outputs) likely transfers to other dense temporal modalities such as hand gestures or body motion.
Because blendshape coefficients are identity-agnostic, the same trained decoder can drive any avatar rig, not just the training template.
Cascaded pipelines of this kind derive most of their facial quality from the downstream face model; native generation avoids that information bottleneck.
The reported sub-20 millisecond marginal face latency means adding a face to an omni-model need not add a perceptible delay.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The headline quantitative gain is measured against the same teacher model that produced the training labels, so part of the gap likely reflects imitation fidelity; a fair reader should weight the human A/B study at least as heavily as the lip-vertex numbers.
A natural extension would be to replace synthetic teacher labels with real motion-capture data for a subset and test whether the native advantage survives outside the teacher's distribution.
The same gated bridge could let LLMs drive other continuous outputs, e.g., gesture, gaze, or prosody, where hidden semantic states are too coarse to supervise directly.
The long-form speech truncation observed in the paper suggests the speech-unit budget, not the face decoder, is the next bottleneck for full conversational use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

A solid systems contribution — new integrated capability for OLLMs — but the headline synchronization claim is teacher-referential and the human study is too small; worth a serious referee.

read the letter

Ex-Omni is a real systems contribution: it is one of the first open-source OLLMs that natively outputs speech plus ARKit-52 3D facial animation, and the speech-unit scaffolding plus TQGF design are sensible engineering choices. The InstructEx dataset (1.2M samples, staged training) is also a tangible asset. The paper is honestly written — it flags the teacher-bias issue in §5.2 and Appendix A.3, and lists limitations in A.5. That level of candor counts for something.

What is genuinely new is the task integration, not the components. The authors admit that TQGF follows published gated attention (Qiu et al., 2025), the speech units follow GLM-4-Voice, and periodic positional encoding follows UniTalk. The contribution is the system and the dataset, not new math.

Now the soft spots, in proportion. The central S2F evaluation is teacher-referential: Audio2Face-3D generated the training blendshape labels in Stages III and IV and is also the fixed LVE reference in §5.1. The large LVE margins in Table 2 (e.g., 3.866 vs 6.530 on A2F-Bench) therefore mostly measure how well the model imitates its teacher. The defense that Audio2Face-3D is a strong professional-quality proxy does not remove the confound: an independent S2F model could produce equally natural but different motion and be unfairly penalized. The paper explicitly acknowledges this in §5.2 and Appendix A.3, which is good, but acknowledgment is not a fix.

The human A/B study is independent but small: 8 evaluators, 20 pairs per comparison, no significance testing, and it only compares Native Ex-Omni against Ex-Omni-based cascaded variants. That supports the direction but is not enough to carry the headline claim on its own.

The latency claim is weaker still: Table 10 reports only Ex-Omni's own RTF, speech TTFT, and face latency, with no cascaded baselines. The abstract's "lower face-generation latency than cascaded pipelines" is therefore unsupported. Also, the TQGF ablation is mixed: removing it improves Ex-A2F-EN (3.184 vs 3.377) while slightly worsening A2F-Bench (3.682 vs 3.667). The paper interprets this as a balance across languages and notes higher overhead, but that is not a clean win. Minor point: the abstract calls the dataset "InstructS2SF-1200K" while the body calls it "InstructEx."

My take: the paper deserves a serious referee. The integrated capability is real, and the training recipe is detailed enough to reproduce. But the evaluation needs rework before the synchronization and latency claims can be accepted. A revised version should use an independent reference (or a much larger, significance-tested human study), add actual cascaded latency comparisons, and clarify what the LVE numbers do and do not mean. If the authors deliver that, this becomes a useful baseline for the subfield. If not, the paper remains a promising but unvalidated system.

Referee Report

4 major / 6 minor

Summary. The paper proposes Ex-Omni, an open-source omni-modal large language model (OLLM) that natively generates speech and 3D facial animation (ARKit-52 blendshapes) from text/speech instructions. The core design decouples semantic reasoning from temporal generation: an LLM (Qwen3-8B) produces text hidden states, a speech unit generator uses token-as-query gated fusion (TQGF) to predict discrete speech units, and a non-autoregressive facial decoder predicts blendshape coefficients from speech-unit-derived queries plus speech-generator context. The authors introduce a multi-stage training recipe (ASR alignment, TTS pretraining, speech-face co-training, joint fine-tuning) and a dataset called InstructEx/InstructS2SF; the face labels are synthesized with NVIDIA Audio2Face-3D. Experiments report competitive speech understanding (VoiceBench), reasonable TTS (Seed-TTS-Eval), a large LVE advantage over cascaded S2F baselines, a small human A/B preference study, ablations of the proposed components, and latency measurements. The paper claims better audio-visual synchronization and lower face-generation latency than cascaded pipelines while preserving OLLM capabilities.

Significance. If the empirical claims held, this would be a useful early open-source contribution: it demonstrates a plausible way to extend OLLMs with native speech-aligned 3D facial animation, with a clear separation of semantic and temporal modeling. The architecture is well motivated, the training stages are detailed, and the ablations in Table 6 (e.g., the effect of the velocity loss and of speech-context conditioning) are informative. The dataset, once released, could support further work. However, the central quantitative evidence for the headline synchronization advantage is currently self-referential (the evaluation reference is also the label generator), and the supporting human and latency evidence is too thin to independently carry the claim. The contribution is therefore promising and likely salvageable, but the paper as submitted does not yet establish its central claim.

major comments (4)

[§5.1 and §4, Table 2] The headline S2F/T2F advantage is measured by LVE against Audio2Face-3D, the same model that generated the blendshape training labels in Stages III and IV (10K TTS&Face and 59.34K S2S&Face + 20K TTS&Face). Native Ex-Omni is thus compared with a reference it was trained to imitate, while cascaded EmoTalk/UniTalker baselines are not. The large margins (CommonEval 4.754 vs 6.527; A2F-Bench 3.866 vs 6.530) may substantially reflect teacher reproduction rather than general facial-animation quality. The acknowledgement in §5.2 that Audio2Face-3D is a strong proxy does not remove this confound. Please evaluate on held-out professional mocap data not produced by the teacher, or provide independent perceptual evidence on the actual Table 2 comparisons with inferential statistics.
[Table 3 and §5.1] The human A/B study uses 8 evaluators and 20 pairs per comparison, with no confidence intervals or significance tests. Moreover, the comparison is only Native Ex-Omni vs Ex-Omni-based cascaded variants (Ex-Omni+EmoTalk, Ex-Omni+UniTalker), so the strongest Table 2 cascaded baselines (e.g., Qwen2.5-Omni+UniTalker) are absent. This does not yet establish that native generation beats cascaded OLLM pipelines perceptually. Report per-comparison sample sizes, a paired test (e.g., Wilcoxon), and include at least one competitive non-Ex-Omni cascaded baseline.
[Table 10 and §A.4] The abstract claims 'lower face-generation latency than cascaded pipelines', but Table 10 reports only Ex-Omni (RTF 2.158, Speech TTFT 0.029 s, Face Latency 0.012 s) with no cascaded comparison. 'Face Latency' appears to measure decoder time after speech units are available, whereas a cascaded system's face latency must include the OLLM speech-generation stage plus the downstream S2F model. Without measuring the same metric on EmoTalk/UniTalker cascades, the latency claim is unsupported. Either add direct comparisons or qualify the claim.
[§3.5 vs Table 7] Stage III is described as training the speech generator on TTS data paired with blendshape annotations, but Table 7 freezes the Speech Generator (lr = 0) and lists only the Facial Decoder with lr = 1e-3. This is internally contradictory and matters for reproducing the co-training protocol. Clarify which modules are updated in Stage III; if the speech generator is frozen, revise the prose accordingly.

minor comments (6)

[Abstract/§1 and §4] The dataset name is inconsistent: the abstract calls it 'InstructS2SF-1200K' while §1 and §4 call it 'InstructEx'. Also, the abstract's '1200K samples' is not obviously consistent with Table 1, whose listed counts sum to well over 1.2M; please define exactly which subset has 1200K samples.
[Table 3 caption] The caption says 'D3 and D6 denotes UniTalker-B-D3 and UniTalker-B-D3'; the second should presumably be 'UniTalker-B-D6'.
[Eq. (14)] 'Therfore' should be 'Therefore'.
[Figure 2] 'EX-Omni' in the figure caption/architecture label is inconsistent with the paper's 'Ex-Omni' naming.
[Table 10 caption] 'Latency ana' appears to be an incomplete word; use 'Latency analysis'.
[§5.1] The translation of A2F-Bench is attributed to 'GPT-4o (Fu et al., 2025)', but Fu et al. 2025 is VITA-1.5; the GPT-4o system card is Hurst et al. 2024. The citation should be corrected.

Circularity Check

1 steps flagged

Headline S2F advantage is teacher-referential: Audio2Face-3D generates the blendshape training labels (Stages III/IV) and is also the fixed LVE reference, so the large quantitative gap partly measures imitation of the teacher rather than independent synchronization quality.

specific steps

fitted input called prediction [Section 4 (Stages III/IV) and Section 5.1/5.2 (LVE evaluation)]
"Therefore, we adopt the state-of-the-art Audio2Face-3D model (Chung et al., 2025) as a high-quality teacher to generate blendshape annotations as structured temporal supervision based on the Stage II speech data. ... we construct a corresponding synthetic S2F subset by pairing these same 59.34k samples with high-fidelity blendshape annotations generated by the Audio2Face-3D model. ... Therefore, we adopt a reference-based evaluation protocol using NVIDIA Audio2Face-3D as a fixed external reference. ... Finally, we note that Ex-Omni is trained using blendshape annotations generated by Audio2Fac"

The model's facial decoder is trained to regress exactly to Audio2Face-3D blendshape outputs (Lbs + Lvel losses), and the headline S2F metric (LVE) then measures distance to that same model. Native Ex-Omni is therefore optimized to minimize this specific distance, while cascaded baselines were not trained on Audio2Face-3D labels; the large LVE advantage in Table 2 is at least partly a fit-to-teacher artifact. The paper acknowledges this confound but defends Audio2Face-3D as a strong proxy, which does not remove the circularity: an equally natural but independently generated animation would be penalized. The human A/B study is independent but small (8 evaluators, 20 pairs), lacks significance testing, and only compares Native Ex-Omni against Ex-Omni-based cascades, so the central quantitati

full rationale

This is not a mathematical derivation, so circularity must be assessed on the evidence chain. The paper's central claim, 'better audio-visual synchronization' than cascaded pipelines, rests on Table 2's LVE numbers, which use NVIDIA Audio2Face-3D as the fixed reference. Section 4 shows that Audio2Face-3D also generated the blendshape annotations used as supervision in Stage III (10K TTS&Face) and Stage IV (59.34K S2S&Face). Thus the facial decoder is trained to imitate the same model against which it is scored. Native Ex-Omni's reported advantage (e.g., 3.866 vs. best cascaded 6.530 on A2F-Bench) is therefore confounded by construction. The paper explicitly acknowledges this bias in §5.2 and Appendix A.3, and attempts to compensate with a human A/B preference study; that study provides some independent grounding and prevents the score from reaching 8-10, but its scale (8 evaluators × 20 pairs), lack of significance testing, and restriction to Ex-Omni-based cascades are insufficient to carry the headline synchronization claim alone. The latency claim is also not supported by comparison with cascaded baselines, though that is a correctness gap rather than circularity. On balance: one central quantitative result partially reduces to its own training signal, so the appropriate score is 6.

Axiom & Free-Parameter Ledger

3 free parameters · 4 axioms · 0 invented entities

The paper introduces no new physical or conceptual entities. TQGF is a named module but is an instantiation of published gated cross-attention. The main postulates are domain assumptions about blendshape sufficiency, the Audio2Face-3D teacher/reference, and the usefulness of speech-unit scaffolding.

free parameters (3)

λvel (velocity loss weight) = 0.3
Weight of the temporal smoothness term in Lface = Lbs + λvel·Lvel; set empirically in §3.6. Facial quality directly depends on this hand-tuned value.
P (periodic RoPE period) = 25
Period of the periodic positional encoding for the face decoder; empirically set to 25 in Appendix A.2. Affects temporal modeling of facial motion.
α (periodic RoPE scale) = 1.0
Scaling factor in ṽt = (t mod P) / α in Appendix A.2; set empirically alongside P.

axioms (4)

domain assumption ARKit-52 blendshape coefficients are identity-agnostic and sufficient for natural, transferable 3D facial animation.
Used to justify the 52-dimensional face output and the use of shared rendering templates (§3.4, Appendix A.2). The paper's own limitation section admits higher-level expressions and emotions are not modeled.
domain assumption Audio2Face-3D is a strong proxy for high-quality facial motion and is valid as both a label generator and an evaluation reference.
This is the load-bearing measurement assumption: A2F-3D generates all S2F supervision in Stages III–IV (§4) and is the LVE reference in §5.1. The paper asserts this but does not validate against independent motion-capture data.
domain assumption Discrete speech units provide sufficient temporal scaffolding, and speech-generator hidden states carry facially relevant cues.
Core architectural design choice of the paper. Ablations support it internally, but no independent evidence establishes that speech units are the right interface for facial dynamics.
domain assumption Frozen Qwen3-8B, Whisper, and GLM4-Voice decoder retain their capabilities when used inside Ex-Omni.
The LLM and speech decoder are used as fixed semantic and acoustic backbones (§A.2); their robustness inside the new pipeline is assumed rather than separately evaluated.

pith-pipeline@v1.3.0-alltime-deepseek · 19440 in / 14471 out tokens · 144608 ms · 2026-08-03T03:43:25.884361+00:00 · methodology

0 comments

read the original abstract

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

Figures

Figures reproduced from arXiv: 2602.07106 by Haoyu Zhang, Tianshu Yu, Yiwen Guo, Zhipeng Li.

**Figure 2.** Figure 2: Model architecture of EX-Omni. 2021; Hong et al., 2022) mainly focus on 2D facial animation generation, a field that has become mature after years of research. In recent years, 3D facial animation generation (Richard et al., 2021; Xing et al., 2023; Peng et al., 2023b,a; Fan et al., 2024; Peng et al., 2025) has gradually received more attention. These methods have generally focused on predictions based on… view at source ↗

**Figure 3.** Figure 3: Case study on 3D facial animation generation. The figure highlights mouth-opening behaviors aligned [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Loss curves on different stages with different parameters’ LLMs. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Response audio duration distribution [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Average WER distribution across different [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 linked inside Pith

[2]

InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3387–3396

Depth-aware generative adversarial network for talking head video generation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3387–3396. Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Alek- sander Madry, Alex Baker-Whitcomb, Alex Beutel, A...

Pith/arXiv arXiv 2024
[2021]

In2021 IEEE/CVF International Conference on Computer Vision, pages 3847–3856

FACIAL: synthesizing dynamic talking face with implicit attribute learning. In2021 IEEE/CVF International Conference on Computer Vision, pages 3847–3856. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of...

Pith/arXiv arXiv 2023
[2022]

InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5723–5738

Speecht5: Unified-modal encoder-decoder pre- training for spoken language processing. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5723–5738. Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chen- liang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In IEEE Conferen...

Pith/arXiv arXiv 2019

[1] [2]

InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3387–3396

Depth-aware generative adversarial network for talking head video generation. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 3387–3396. Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Alek- sander Madry, Alex Baker-Whitcomb, Alex Beutel, A...

Pith/arXiv arXiv 2024

[2] [2021]

In2021 IEEE/CVF International Conference on Computer Vision, pages 3847–3856

FACIAL: synthesizing dynamic talking face with implicit attribute learning. In2021 IEEE/CVF International Conference on Computer Vision, pages 3847–3856. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of...

Pith/arXiv arXiv 2023

[3] [2022]

InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5723–5738

Speecht5: Unified-modal encoder-decoder pre- training for spoken language processing. InProceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5723–5738. Lele Chen, Ross K. Maddox, Zhiyao Duan, and Chen- liang Xu. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In IEEE Conferen...

Pith/arXiv arXiv 2019