PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation

Soumya Mazumdar; Tapas Samanta; Vineet Kumar Rakesh

arxiv: 2604.08037 · v1 · submitted 2026-04-09 · 💻 cs.CR · cs.AI· cs.CV· cs.LG

PrivFedTalk: Privacy-Aware Federated Diffusion with Identity-Stable Adapters for Personalized Talking-Head Generation

Soumya Mazumdar , Vineet Kumar Rakesh , Tapas Samanta This is my paper

Pith reviewed 2026-05-10 18:08 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CVcs.LG

keywords federated learningprivacy preservationtalking-head generationdiffusion modelsLoRA adapterspersonalized generationsecure aggregationdifferential privacy

0 comments

The pith

A federated framework trains personalized talking-head videos on local devices using shared diffusion models and private identity adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Centralized training of talking-head generators requires pooling sensitive face videos and speech, which users and devices cannot share. PrivFedTalk instead keeps all raw data local and lets clients learn small LoRA identity adapters while a common diffusion backbone is updated through federated aggregation. The method introduces identity-stable weighting of client updates based on device-computed consistency signals and adds temporal regularization to keep generated frames coherent. Experiments show the system runs end-to-end on constrained hardware and produces usable outputs, demonstrating that collaborative personalized generation is possible without data centralization.

Core claim

The framework trains a conditional latent diffusion model across clients by having each device learn a lightweight LoRA identity adapter from its private audio-visual data, aggregates the adapters with Identity-Stable Federated Aggregation that weights updates by on-device identity consistency and temporal stability scalars, applies Temporal-Denoising Consistency regularization to limit inter-frame drift, and protects updates with secure aggregation plus client-level differential privacy.

What carries the argument

Identity-Stable Federated Aggregation (ISFA), which derives privacy-safe scalar weights from each client's local estimates of identity consistency and temporal stability to combine updates from heterogeneous devices.

If this is right

Raw personal video and audio never leave the device, satisfying strict privacy constraints for media generation.
Lightweight adapters cut communication volume while still allowing the shared backbone to benefit from many users.
Temporal regularization reduces flicker and identity shifts that otherwise appear in federated video generation.
The pipeline runs on low-memory GPUs and supports client-parallel training across varied hardware.
Secure aggregation and differential privacy can be added on top of the adapter updates without breaking the training loop.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of local adapters plus stability-weighted aggregation could apply to other personalized video or audio generative tasks.
Future work could replace the scalar reliability signals with learned client-selection mechanisms that further reduce communication.
Once the shared backbone is trained, fully on-device fine-tuning of the adapter could produce per-user models that never require federation again.
Standardized privacy-utility benchmarks for generative models would make it easier to compare this approach against centralized alternatives.

Load-bearing premise

On-device estimates of identity consistency and temporal stability give reliable signals for weighting client updates without leaking identity information or creating bias.

What would settle it

A side-by-side evaluation in which the federated model produces measurably higher identity drift or lower frame-to-frame coherence than an otherwise identical centralized baseline would disprove the core feasibility claim.

Figures

Figures reproduced from arXiv: 2604.08037 by Soumya Mazumdar, Tapas Samanta, Vineet Kumar Rakesh.

**Figure 2.** Figure 2: Practical execution pipeline of the PrivFedTalk implementation. Configuration files specify model, fed [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Validation curves over 100 federated communication rounds for PrivFedTalk. The three plots show validation [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Talking-head generation has advanced rapidly with diffusion-based generative models, but training usually depends on centralized face-video and speech datasets, raising major privacy concerns. The problem is more acute for personalized talking-head generation, where identity-specific data are highly sensitive and often cannot be pooled across users or devices. PrivFedTalk is presented as a privacy-aware federated framework for personalized talking-head generation that combines conditional latent diffusion with parameter-efficient identity adaptation. A shared diffusion backbone is trained across clients, while each client learns lightweight LoRA identity adapters from local private audio-visual data, avoiding raw data sharing and reducing communication cost. To address heterogeneous client distributions, Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals computed from on-device identity consistency and temporal stability estimates. Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising. To limit update-side privacy risk, secure aggregation and client-level differential privacy are applied to adapter updates. The implementation supports both low-memory GPU execution and multi-GPU client-parallel training on heterogeneous shared hardware. Comparative experiments on the present setup across multiple training and aggregation conditions with PrivFedTalk, FedAvg, and FedProx show stable federated optimization and successful end-to-end training and evaluation under constrained resources. The results support the feasibility of privacy-aware personalized talking-head training in federated environments, while suggesting that stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrivFedTalk sketches a federated privacy pipeline for personalized talking-head diffusion using local LoRA adapters and new weighting/regularization tricks, but the experiments stay too high-level to show whether those tricks deliver.

read the letter

The paper combines a shared conditional latent diffusion backbone with per-client LoRA adapters trained on private audio-visual data, then aggregates via ISFA that weights updates by on-device identity consistency and temporal stability scalars, plus TDC to limit frame-to-frame drift. Secure aggregation and client-level differential privacy are layered on top, with some attention to low-memory and multi-GPU execution.

Referee Report

3 major / 2 minor

Summary. The paper proposes PrivFedTalk, a privacy-aware federated framework for personalized talking-head generation. It combines conditional latent diffusion models with parameter-efficient LoRA adapters for identity-specific fine-tuning on local client data. A key component is Identity-Stable Federated Aggregation (ISFA), which weights client LoRA updates using on-device computed scalars for identity consistency and temporal stability. Temporal-Denoising Consistency (TDC) regularization is used to maintain coherence across frames. Privacy protections include secure aggregation and client-level differential privacy. The abstract indicates that experiments comparing PrivFedTalk to FedAvg and FedProx under various conditions demonstrate stable optimization and successful training, supporting the feasibility of such privacy-preserving personalized generation, while noting the need for more rigorous evaluations.

Significance. This work tackles a timely and important challenge in privacy-preserving machine learning for generative models, particularly for sensitive applications like personalized video synthesis. By avoiding data sharing and using efficient adapters, it could enable collaborative training across devices. The integration of custom aggregation and regularization tailored to diffusion processes is conceptually novel. However, given the preliminary nature of the reported results, the significance is currently limited until stronger empirical support is provided. If validated, it may inspire further research in federated generative AI.

major comments (3)

[Abstract] The assertion of 'stable federated optimization and successful end-to-end training and evaluation under constrained resources' lacks any accompanying quantitative metrics, such as loss values, generation quality scores (e.g., FID, SSIM), identity similarity measures, or statistical significance tests. Without these, the central claim of feasibility cannot be properly evaluated.
[§4.2 (ISFA)] The weighting of client updates in ISFA depends on privacy-safe on-device scalars. There is no provided analysis, formula, or ablation study demonstrating that these scalars (identity consistency and temporal stability) are robust to data heterogeneity, do not leak identity information, or correlate with improved generation quality. This assumption is critical to distinguishing ISFA from standard methods like FedAvg.
[§5 (Experiments)] The manuscript itself states that 'stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.' This self-assessment highlights that the current experiments are insufficient to support the paper's conclusions, particularly regarding privacy-utility trade-offs and component contributions.

minor comments (2)

[Terminology] New terms such as ISFA and TDC are introduced without sufficient contrast to existing techniques in federated learning or diffusion model regularization.
[Implementation] Details on how low-memory GPU execution and multi-GPU training are achieved would benefit from pseudocode or specific configuration parameters.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript. We appreciate the recognition of the timeliness of privacy-preserving federated approaches for generative talking-head models. We address each major comment below and commit to revisions that strengthen the empirical and analytical support without overstating current results.

read point-by-point responses

Referee: [Abstract] The assertion of 'stable federated optimization and successful end-to-end training and evaluation under constrained resources' lacks any accompanying quantitative metrics, such as loss values, generation quality scores (e.g., FID, SSIM), identity similarity measures, or statistical significance tests. Without these, the central claim of feasibility cannot be properly evaluated.

Authors: We agree that the abstract statement would be strengthened by explicit quantitative references. The manuscript reports comparative training runs across aggregation methods that exhibit convergence and successful local generation, but these are described qualitatively in the abstract. In the revised version we will update the abstract to cite specific observed outcomes from Section 5 (e.g., relative loss reduction and stability indicators) while preserving the feasibility focus, and we will add a concise metrics summary table. revision: yes
Referee: [§4.2 (ISFA)] The weighting of client updates in ISFA depends on privacy-safe on-device scalars. There is no provided analysis, formula, or ablation study demonstrating that these scalars (identity consistency and temporal stability) are robust to data heterogeneity, do not leak identity information, or correlate with improved generation quality. This assumption is critical to distinguishing ISFA from standard methods like FedAvg.

Authors: Section 4.2 defines ISFA as weighting LoRA updates by locally computed scalars for identity consistency and temporal stability, chosen to avoid raw data sharing. We acknowledge that the current text does not supply the explicit weighting formula, a formal privacy argument, or an ablation against FedAvg. In revision we will insert the mathematical definition of the scalar-based weighting, a short analysis showing that only aggregated scalars are shared under secure aggregation (preventing direct identity leakage), and an ablation table comparing ISFA to FedAvg/FedProx on stability and generation metrics. revision: yes
Referee: [§5 (Experiments)] The manuscript itself states that 'stronger component-wise, privacy-utility, and qualitative claims need further standardized evaluation.' This self-assessment highlights that the current experiments are insufficient to support the paper's conclusions, particularly regarding privacy-utility trade-offs and component contributions.

Authors: We note our own caveat in the manuscript. The experiments demonstrate end-to-end feasibility under resource and privacy constraints by comparing convergence behavior of PrivFedTalk against FedAvg and FedProx. To address the referee's concern we will expand Section 5 with component ablations for ISFA and TDC, privacy-utility curves across differential-privacy budgets, and additional standardized metrics (FID, identity cosine similarity) on held-out test sets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework described via experimental feasibility without self-referential derivations

full rationale

The paper describes a federated diffusion framework (PrivFedTalk) with LoRA adapters, ISFA weighting via on-device scalars, and TDC regularization, supported by comparative experiments against FedAvg and FedProx. No equations, first-principles derivations, or predictions are presented in the abstract or high-level description that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on observed training stability and end-to-end feasibility under privacy constraints rather than internal consistency loops, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Based on abstract only; the framework assumes standard properties of diffusion models and LoRA adaptation hold in federated settings, with no explicit free parameters or new physical entities listed.

axioms (2)

domain assumption Diffusion models can be trained collaboratively via federated averaging of a shared backbone while keeping identity-specific parameters local.
Invoked in the description of the shared backbone and per-client adapters.
domain assumption On-device estimates of identity consistency and temporal stability can serve as privacy-safe weighting signals.
Central to the ISFA component.

invented entities (2)

Identity-Stable Federated Aggregation (ISFA) no independent evidence
purpose: Weight client updates using privacy-safe scalar reliability signals from identity consistency and temporal stability.
New aggregation method introduced to handle heterogeneous client distributions.
Temporal-Denoising Consistency (TDC) regularization no independent evidence
purpose: Reduce inter-frame drift, flicker, and identity drift during federated denoising.
New regularization term added to the training objective.

pith-pipeline@v0.9.0 · 5593 in / 1643 out tokens · 53758 ms · 2026-05-10T18:08:11.019939+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Identity-Stable Federated Aggregation (ISFA) weights client updates using privacy-safe scalar reliability signals derived from on-device identity consistency and temporal stability estimates.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Temporal-Denoising Consistency (TDC) regularization is introduced to reduce inter-frame drift, flicker, and identity drift during federated denoising.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Denoising diffusion probabilistic models,

J. Ho, A. N. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020
[2]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695

work page 2022
[3]

Makeittalk: Speaker-aware talking- head animation,

Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makeittalk: Speaker-aware talking- head animation,”ACM Transactions on Graphics, vol. 39, no. 6, pp. 221:1–221:15, 2020

work page 2020
[4]

A lip sync expert is all you need for speech to lip generation in the wild,

K. R. Prajwal, R. Mukhopadhyay, V . Namboodiri, and C. V . Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492

work page 2020
[5]

HR-Net: A landmark based high realistic face reenactment network,

Q. Ren, Z. Lu, H. Wu, J. Zhang, and Z. Dong, “HR-Net: A landmark based high realistic face reenactment network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 11, pp. 6347–6359, 2023

work page 2023
[6]

Multimodal emotional talking face generation based on action units,

J. Lyu, X. Lan, G. Hu, H. Jiang, W. Gan, J. Wang, and J. F. Xue, “Multimodal emotional talking face generation based on action units,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 4026–4038, 2025. 14 APREPRINT- APRIL10, 2026

work page 2025
[7]

PFedLAH: Personalized federated learning with lookahead for adaptive cross-modal hashing,

Y . Chen, H. Lin, Z. Yang, and J. Long, “PFedLAH: Personalized federated learning with lookahead for adaptive cross-modal hashing,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 8, pp. 8359–8371, 2025

work page 2025
[8]

Privacy-preserving image retrieval based on thumbnail- preserving visual features,

D. An, X. Zhang, D. Hao, R. Zhao, and Y .-S. Zhang, “Privacy-preserving image retrieval based on thumbnail- preserving visual features,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 8, pp. 7719–7731, 2025

work page 2025
[9]

Communication-efficient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag¨uera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 54, 2017

work page 2017
[10]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProceedings of Machine Learning and Systems (MLSys), 2020

work page 2020
[11]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017

work page 2017
[12]

Calibrating noise to sensitivity in private data analysis,

C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” inTheory of Cryptography Conference (TCC), ser. Lecture Notes in Computer Science, vol. 3876, 2006, pp. 265–284

work page 2006
[13]

Differentially private federated learning: A client level perspective,

R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,” 2017

work page 2017
[14]

Learning differentially private recurrent language models,

B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially private recurrent language models,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018
[15]

Face2face: Real-time face capture and reenactment of rgb videos,

J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2387–2395

work page 2016
[16]

First order motion model for image animation,

A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 7135–7145

work page 2019
[17]

Ad-nerf: Audio driven neural radiance fields for talking head synthesis,

Y . Guo, K. Chen, S. Liang, Y .-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5764–5774

work page 2021
[18]

Pirenderer: Controllable portrait image generation via semantic neural rendering,

Y . Ren, G. Li, Y . Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 739–13 748

work page 2021
[19]

One-shot high-fidelity talking-head synthesis with deformable neural radiance field,

W. Li, L. Zhang, D. Wang, B. Zhao, Z. Wang, M. Chen, B. Zhang, Z. Wang, L. Bo, and X. Li, “One-shot high-fidelity talking-head synthesis with deformable neural radiance field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 17 969–17 978

work page 2023
[20]

3d gaussian blendshapes for head avatar animation,

S. Ma, Y . Weng, T. Shao, and K. Zhou, “3d gaussian blendshapes for head avatar animation,” inACM SIGGRAPH 2024 Conference Papers, 2024

work page 2024
[21]

Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,

K. Cho, J. Lee, H. Yoon, Y . Hong, J. Ko, S. Ahn, and S. Kim, “Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 10 985–10 994

work page 2024
[22]

Out of time: Automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” inComputer Vision – ACCV 2016 Workshops, ser. Lecture Notes in Computer Science, vol. 10117. Springer, 2017, pp. 251–263

work page 2016
[23]

Facetalk: Audio-driven motion diffusion for neural parametric head models,

S. Aneja, J. Thies, A. Dai, and M. Nießner, “Facetalk: Audio-driven motion diffusion for neural parametric head models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 21 263–21 273

work page 2024
[24]

Talking head generation with probabilistic audio-to- visual diffusion priors,

Z. Yu, Z. Yin, D. Zhou, D. Wang, F. Wong, and B. Wang, “Talking head generation with probabilistic audio-to- visual diffusion priors,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 7611–7621

work page 2023
[25]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,” 2024

work page 2024
[26]

Advancements in talking head gen- eration: A comprehensive review of techniques, metrics, and challenges,

V . K. Rakesh, S. Mazumdar, R. P. Maity, S. Pal, A. Das, and T. Samanta, “Advancements in talking head gen- eration: A comprehensive review of techniques, metrics, and challenges,”The Visual Computer, vol. 42, p. 9, 2026. 15 APREPRINT- APRIL10, 2026

work page 2026
[27]

Bootstrapping audio-visual video segmentation by strengthening audio cues,

T. Chen, Z. Tan, T. Gong, Q. Chu, Y . Wu, B. Liu, N. Yu, L. Lu, and J. Ye, “Bootstrapping audio-visual video segmentation by strengthening audio cues,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2398–2409, 2025

work page 2025
[28]

Scaffold: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020, pp. 5132–5143

work page 2020
[29]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” inProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016, pp. 308–318

work page 2016
[30]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Foundations and Trends in Theo- retical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014

work page 2014
[31]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” in2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 3–18

work page 2017
[32]

Deep leakage from gradients,

L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAdvances in Neural Information Processing Systems, 2019, pp. 14 747–14 756

work page 2019
[33]

How to backdoor federated learning,

E. Bagdasaryan, A. Veit, Y . Hua, D. Estrin, and V . Shmatikov, “How to backdoor federated learning,” inProceed- ings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 108, 2020, pp. 2938–2948

work page 2020
[34]

Efficient parametrization of multi-domain deep neural networks,

S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient parametrization of multi-domain deep neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8119–8127

work page 2018
[35]

Towards a unified view of parameter-efficient transfer learning,

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[36]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[37]

UniFRD: A unified method for facial image restora- tion based on diffusion probabilistic model,

M. Jian, R. Wang, X. Yu, F. Xu, H. Yu, and K. M. Lam, “UniFRD: A unified method for facial image restora- tion based on diffusion probabilistic model,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 494–13 506, 2024

work page 2024
[38]

LRS3-TED: a large-scale dataset for visual speech recognition,

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv (Cornell University), 9 2018

work page 2018
[39]

Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset,

Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3660–3669. 16 APREPRINT- APRIL10, 2026 Author Biographies Soumya Mazumdaris a student researcher pursuing a B.S. in Data Science and Appl...

work page 2021

[1] [1]

Denoising diffusion probabilistic models,

J. Ho, A. N. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in Neural Information Processing Systems, vol. 33, 2020, pp. 6840–6851

work page 2020

[2] [2]

High-resolution image synthesis with latent diffusion models,

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 684–10 695

work page 2022

[3] [3]

Makeittalk: Speaker-aware talking- head animation,

Y . Zhou, X. Han, E. Shechtman, J. Echevarria, E. Kalogerakis, and D. Li, “Makeittalk: Speaker-aware talking- head animation,”ACM Transactions on Graphics, vol. 39, no. 6, pp. 221:1–221:15, 2020

work page 2020

[4] [4]

A lip sync expert is all you need for speech to lip generation in the wild,

K. R. Prajwal, R. Mukhopadhyay, V . Namboodiri, and C. V . Jawahar, “A lip sync expert is all you need for speech to lip generation in the wild,” inProceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 484–492

work page 2020

[5] [5]

HR-Net: A landmark based high realistic face reenactment network,

Q. Ren, Z. Lu, H. Wu, J. Zhang, and Z. Dong, “HR-Net: A landmark based high realistic face reenactment network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 11, pp. 6347–6359, 2023

work page 2023

[6] [6]

Multimodal emotional talking face generation based on action units,

J. Lyu, X. Lan, G. Hu, H. Jiang, W. Gan, J. Wang, and J. F. Xue, “Multimodal emotional talking face generation based on action units,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 5, pp. 4026–4038, 2025. 14 APREPRINT- APRIL10, 2026

work page 2025

[7] [7]

PFedLAH: Personalized federated learning with lookahead for adaptive cross-modal hashing,

Y . Chen, H. Lin, Z. Yang, and J. Long, “PFedLAH: Personalized federated learning with lookahead for adaptive cross-modal hashing,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 8, pp. 8359–8371, 2025

work page 2025

[8] [8]

Privacy-preserving image retrieval based on thumbnail- preserving visual features,

D. An, X. Zhang, D. Hao, R. Zhao, and Y .-S. Zhang, “Privacy-preserving image retrieval based on thumbnail- preserving visual features,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 8, pp. 7719–7731, 2025

work page 2025

[9] [9]

Communication-efficient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Ag¨uera y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 54, 2017

work page 2017

[10] [10]

Federated optimization in heterogeneous networks,

T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V . Smith, “Federated optimization in heterogeneous networks,” inProceedings of Machine Learning and Systems (MLSys), 2020

work page 2020

[11] [11]

Practical secure aggregation for privacy-preserving machine learning,

K. Bonawitz, V . Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” inProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017

work page 2017

[12] [12]

Calibrating noise to sensitivity in private data analysis,

C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” inTheory of Cryptography Conference (TCC), ser. Lecture Notes in Computer Science, vol. 3876, 2006, pp. 265–284

work page 2006

[13] [13]

Differentially private federated learning: A client level perspective,

R. C. Geyer, T. Klein, and M. Nabi, “Differentially private federated learning: A client level perspective,” 2017

work page 2017

[14] [14]

Learning differentially private recurrent language models,

B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning differentially private recurrent language models,” inInternational Conference on Learning Representations (ICLR), 2018

work page 2018

[15] [15]

Face2face: Real-time face capture and reenactment of rgb videos,

J. Thies, M. Zollh ¨ofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2387–2395

work page 2016

[16] [16]

First order motion model for image animation,

A. Siarohin, S. Lathuili `ere, S. Tulyakov, E. Ricci, and N. Sebe, “First order motion model for image animation,” inAdvances in Neural Information Processing Systems, vol. 32, 2019, pp. 7135–7145

work page 2019

[17] [17]

Ad-nerf: Audio driven neural radiance fields for talking head synthesis,

Y . Guo, K. Chen, S. Liang, Y .-J. Liu, H. Bao, and J. Zhang, “Ad-nerf: Audio driven neural radiance fields for talking head synthesis,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 5764–5774

work page 2021

[18] [18]

Pirenderer: Controllable portrait image generation via semantic neural rendering,

Y . Ren, G. Li, Y . Chen, T. H. Li, and S. Liu, “Pirenderer: Controllable portrait image generation via semantic neural rendering,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13 739–13 748

work page 2021

[19] [19]

One-shot high-fidelity talking-head synthesis with deformable neural radiance field,

W. Li, L. Zhang, D. Wang, B. Zhao, Z. Wang, M. Chen, B. Zhang, Z. Wang, L. Bo, and X. Li, “One-shot high-fidelity talking-head synthesis with deformable neural radiance field,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 17 969–17 978

work page 2023

[20] [20]

3d gaussian blendshapes for head avatar animation,

S. Ma, Y . Weng, T. Shao, and K. Zhou, “3d gaussian blendshapes for head avatar animation,” inACM SIGGRAPH 2024 Conference Papers, 2024

work page 2024

[21] [21]

Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,

K. Cho, J. Lee, H. Yoon, Y . Hong, J. Ko, S. Ahn, and S. Kim, “Gaussiantalker: Real-time talking head synthesis with 3d gaussian splatting,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 10 985–10 994

work page 2024

[22] [22]

Out of time: Automated lip sync in the wild,

J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the wild,” inComputer Vision – ACCV 2016 Workshops, ser. Lecture Notes in Computer Science, vol. 10117. Springer, 2017, pp. 251–263

work page 2016

[23] [23]

Facetalk: Audio-driven motion diffusion for neural parametric head models,

S. Aneja, J. Thies, A. Dai, and M. Nießner, “Facetalk: Audio-driven motion diffusion for neural parametric head models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 21 263–21 273

work page 2024

[24] [24]

Talking head generation with probabilistic audio-to- visual diffusion priors,

Z. Yu, Z. Yin, D. Zhou, D. Wang, F. Wong, and B. Wang, “Talking head generation with probabilistic audio-to- visual diffusion priors,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 7611–7621

work page 2023

[25] [25]

Aniportrait: Audio-driven synthesis of photorealistic portrait animation,

H. Wei, Z. Yang, and Z. Wang, “Aniportrait: Audio-driven synthesis of photorealistic portrait animation,” 2024

work page 2024

[26] [26]

Advancements in talking head gen- eration: A comprehensive review of techniques, metrics, and challenges,

V . K. Rakesh, S. Mazumdar, R. P. Maity, S. Pal, A. Das, and T. Samanta, “Advancements in talking head gen- eration: A comprehensive review of techniques, metrics, and challenges,”The Visual Computer, vol. 42, p. 9, 2026. 15 APREPRINT- APRIL10, 2026

work page 2026

[27] [27]

Bootstrapping audio-visual video segmentation by strengthening audio cues,

T. Chen, Z. Tan, T. Gong, Q. Chu, Y . Wu, B. Liu, N. Yu, L. Lu, and J. Ye, “Bootstrapping audio-visual video segmentation by strengthening audio cues,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 35, no. 3, pp. 2398–2409, 2025

work page 2025

[28] [28]

Scaffold: Stochastic controlled averaging for federated learning,

S. P. Karimireddy, S. Kale, M. Mohri, S. J. Reddi, S. U. Stich, and A. T. Suresh, “Scaffold: Stochastic controlled averaging for federated learning,” inProceedings of the 37th International Conference on Machine Learning (ICML), 2020, pp. 5132–5143

work page 2020

[29] [29]

Deep learning with differential privacy,

M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang, “Deep learning with differential privacy,” inProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2016, pp. 308–318

work page 2016

[30] [30]

The algorithmic foundations of differential privacy,

C. Dwork and A. Roth, “The algorithmic foundations of differential privacy,”Foundations and Trends in Theo- retical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014

work page 2014

[31] [31]

Membership inference attacks against machine learning models,

R. Shokri, M. Stronati, C. Song, and V . Shmatikov, “Membership inference attacks against machine learning models,” in2017 IEEE Symposium on Security and Privacy (SP), 2017, pp. 3–18

work page 2017

[32] [32]

Deep leakage from gradients,

L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAdvances in Neural Information Processing Systems, 2019, pp. 14 747–14 756

work page 2019

[33] [33]

How to backdoor federated learning,

E. Bagdasaryan, A. Veit, Y . Hua, D. Estrin, and V . Shmatikov, “How to backdoor federated learning,” inProceed- ings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS), ser. Proceedings of Machine Learning Research, vol. 108, 2020, pp. 2938–2948

work page 2020

[34] [34]

Efficient parametrization of multi-domain deep neural networks,

S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Efficient parametrization of multi-domain deep neural networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 8119–8127

work page 2018

[35] [35]

Towards a unified view of parameter-efficient transfer learning,

J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Towards a unified view of parameter-efficient transfer learning,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[36] [36]

Lora: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[37] [37]

UniFRD: A unified method for facial image restora- tion based on diffusion probabilistic model,

M. Jian, R. Wang, X. Yu, F. Xu, H. Yu, and K. M. Lam, “UniFRD: A unified method for facial image restora- tion based on diffusion probabilistic model,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 12, pp. 13 494–13 506, 2024

work page 2024

[38] [38]

LRS3-TED: a large-scale dataset for visual speech recognition,

T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv (Cornell University), 9 2018

work page 2018

[39] [39]

Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset,

Z. Zhang, L. Li, Y . Ding, and C. Fan, “Flow-guided one-shot talking face generation with a high-resolution audio- visual dataset,” in2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 3660–3669. 16 APREPRINT- APRIL10, 2026 Author Biographies Soumya Mazumdaris a student researcher pursuing a B.S. in Data Science and Appl...

work page 2021