Secure Joint Source-Channel Coding of Multimodal Semantic Sources

Denis Kozlov; Mahtab Mirmohseni; Rahim Tafazolli

arxiv: 2605.14334 · v2 · pith:ZZ2YVBXNnew · submitted 2026-05-14 · 💻 cs.IT · math.IT

Secure Joint Source-Channel Coding of Multimodal Semantic Sources

Denis Kozlov , Mahtab Mirmohseni , Rahim Tafazolli This is my paper

Pith reviewed 2026-05-15 02:02 UTC · model grok-4.3

classification 💻 cs.IT math.IT

keywords joint source-channel codingmultimodal sourceswiretap channelssecrecy boundsrate-distortion-perceptionequivocation constraintsinformation theoretic security

0 comments

The pith

The fundamental secrecy limit for multimodal sources over wiretap channels decomposes into compression level, secret key rate, and wiretap channel statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines secure joint source-channel coding where an encoder observes samples from an arbitrary non-empty subset of modalities such as image, audio, and sensor data. These samples are sent over a discrete memoryless wiretap channel, and the legitimate receiver must reconstruct all modalities while meeting per-modality distortion and perception constraints plus per-subset equivocation requirements. The authors derive matching converse and achievability bounds that separate the secrecy limit into three distinct operational parts. A reader would care because the separation shows how to allocate resources between source compression, key management, and channel properties when protecting complex multimodal transmissions.

Core claim

The paper establishes converse and achievability bounds on transmission rate, fidelity, and secrecy for joint source-channel coding of multimodal sources over discrete memoryless wiretap channels. Under per-modality distortion and perception constraints and per-subset equivocation constraints, the fundamental limit for secrecy consists of three operationally distinct components: the level of compression, the secret key rate, and the statistics of the wiretap channel.

What carries the argument

The multimodal extension of the rate-distortion-perception function that incorporates per-subset equivocation constraints, allowing secrecy to be expressed as the sum of compression, key rate, and channel statistics.

If this is right

Secure coding schemes can optimize each of the three components independently to reach the secrecy limit.
The bounds remain valid no matter which non-empty subset of modalities the encoder observes.
Different modalities can be assigned distinct distortion and perception targets without losing the overall secrecy decomposition.
Equivocation can be tuned specifically for each possible subset of modalities that is transmitted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition suggests that semantic communication systems could adjust perception constraints to trade meaning preservation against secrecy without redesigning the entire scheme.
Real deployments handling correlated modalities like video and audio may need to verify how closely the three-component split holds when statistical dependence exceeds the model.
The framework could guide resource allocation in networks where secret key material is scarce but channel statistics are favorable.
Testing the bounds on measured multimodal datasets would reveal how much the i.i.d. assumption affects practical secrecy rates.

Load-bearing premise

The source consists of i.i.d. samples from an arbitrary non-empty subset of modalities and the wiretap channel is discrete memoryless.

What would settle it

For a two-modality source with known statistics transmitted over a binary symmetric wiretap channel, measure the minimal achievable equivocation rate and test whether it exactly matches the sum of the compression term, the secret key rate, and the channel's secrecy capacity.

Figures

Figures reproduced from arXiv: 2605.14334 by Denis Kozlov, Mahtab Mirmohseni, Rahim Tafazolli.

**Figure 1.** Figure 1: System model. sample of modality i, with j ∈ [k]. We extend this to subsets via SA,j .= {Si,j}i∈A, which collects the j-th sample across all modalities in A. The complement of a set A is denoted as Ac . The complement of some event E is shown as E¯. We denote [m] = {1, . . . , m}. The power set of [m] without empty set is 2 [m] ∗ .= {A ⊆ [m] : A ̸= ∅}. For some A ∈ 2 [m] ∗ we define parent sets as, pa(A) .… view at source ↗

read the original abstract

We study the problem of secure joint source-channel coding for multimodal semantic sources transmitted over noisy wiretap channels. The source model consists of $m$ modalities (e.g., image, audio, and sensor data), all represented as random variables. The encoder observes independent and identically distributed samples of an arbitrary non-empty subset of modalities. The samples are encoded and transmitted over a discrete memoryless wiretap channel. The legitimate receiver reconstructs all modalities. We extend the rate-distortion-perception problem formulation to multimodal sources. We establish converse and achievability bounds on the fundamental limits of transmission rate, fidelity, and secrecy, under per-modality distortion and perception constraints, and per-subset equivocation constraints. We show that the fundamental limit for secrecy consists of three operationally distinct components: the level of compression, the secret key rate, and the statistics of the wiretap channel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends secure JSCC to multimodal sources with a claimed three-component secrecy split, but the abstract leaves the proofs and subset-coupling details unverified.

read the letter

The main new piece is the extension of joint source-channel coding with secrecy to multimodal sources where the encoder sees only a random non-empty subset of modalities while the receiver reconstructs the full set. It adds per-modality distortion and perception constraints plus per-subset equivocation at the wiretapper, then claims the secrecy limit decomposes into compression level, secret key rate, and wiretap channel statistics. That decomposition is the operational claim worth checking. The setup is a reasonable step beyond single-modality or non-secure versions, and the i.i.d. discrete-memoryless assumptions keep the model standard for information theory. If the bounds hold, the three-part split could guide system design for things like secure sensor fusion or multimedia transmission. The formulation itself is clean and the abstract states both achievability and converse results, which is the right structure. The soft spot is the potential coupling the stress-test note flags. When modalities are statistically dependent, the equivocation under partial observation may not separate additively into the three terms without leftover cross terms from the missing modalities. If the single-letter expressions treat the observed subset in isolation, the claimed operational separation could be an artifact rather than general. Since only the abstract is in front of me, I cannot verify whether the proofs make those cross terms vanish or simply assume them away. The i.i.d. and memoryless restrictions are also standard but narrow the scope. This is for information theorists working on semantic communications or wiretap problems. A reader already familiar with rate-distortion-perception and secrecy would get the most from the formulation and the proposed decomposition. It deserves a serious referee because the problem is well-posed and the claims are specific enough to be tested against the derivations. Send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper investigates secure joint source-channel coding for multimodal sources consisting of m modalities (e.g., image, audio) transmitted over a discrete memoryless wiretap channel. The encoder observes i.i.d. samples from an arbitrary non-empty random subset of modalities, encodes them, and transmits over the channel; the legitimate receiver must reconstruct the full multimodal source. The work extends the rate-distortion-perception framework to this multimodal setting and derives converse and achievability bounds on the fundamental limits of transmission rate, per-modality distortion/perception, and per-subset equivocation. The central claim is that the secrecy limit decomposes operationally into three distinct components: the level of compression (via rate-distortion-perception), the secret key rate, and the statistics of the wiretap channel.

Significance. If the claimed decomposition is rigorously established, the result would provide a clean operational separation of secrecy contributions in multimodal semantic communication, which is novel and could inform practical secure coding schemes that allocate resources separately to compression, key generation, and channel coding. The extension of rate-distortion-perception to multimodal sources with subset observation is a useful technical contribution, and the per-subset equivocation model captures realistic partial-observation scenarios.

major comments (2)

[Abstract / Sec. IV (main theorem)] Abstract and main result (presumably Theorem 1 or equivalent in Sec. IV): the claimed operational decomposition of secrecy into compression level, secret key rate, and wiretap statistics requires that cross terms arising from statistical dependence among modalities vanish under random subset observation. The per-subset equivocation constraint combined with full multimodal reconstruction at the receiver may induce non-additive coupling through the joint distribution; the converse and achievability must explicitly demonstrate that single-letter expressions remain additive without residual dependence terms. Without this verification, the separation may be an artifact of the modeling choice rather than a general fact.
[Sec. V (converse)] Converse proof (Sec. V or equivalent): the bound on equivocation must account for the fact that the encoder observes only a random subset while the receiver reconstructs all modalities. It is unclear whether the single-letter characterization fully incorporates the inference of missing modalities via the joint source distribution or whether additional mutual-information terms appear that couple the three claimed components.

minor comments (2)

[Sec. II (problem formulation)] Notation: the random subset selection process and the per-subset equivocation measure should be defined with an explicit indicator random variable or probability distribution over subsets to avoid ambiguity in the problem statement.
[Sec. III] The extension of the rate-distortion-perception function to multimodal sources is stated but the precise definition of the perception constraint across modalities (e.g., whether it is joint or per-modality) could be clarified with an equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity.

read point-by-point responses

Referee: [Abstract / Sec. IV (main theorem)] Abstract and main result (presumably Theorem 1 or equivalent in Sec. IV): the claimed operational decomposition of secrecy into compression level, secret key rate, and wiretap statistics requires that cross terms arising from statistical dependence among modalities vanish under random subset observation. The per-subset equivocation constraint combined with full multimodal reconstruction at the receiver may induce non-additive coupling through the joint distribution; the converse and achievability must explicitly demonstrate that single-letter expressions remain additive without residual dependence terms. Without this verification, the separation may be an artifact of the modeling choice rather than a general fact.

Authors: We thank the referee for highlighting this point. In the achievability and converse proofs of the main result (Theorem 1), the subset selection is independent of the source realizations and i.i.d. across time. This independence, combined with the per-subset definition of equivocation, ensures that cross terms from modality dependence factor out and are absorbed into the rate-distortion-perception component. The single-letter expressions remain additive with no residual coupling. We will add a clarifying remark and a short supporting lemma immediately after the theorem statement in the revised manuscript. revision: yes
Referee: [Sec. V (converse)] Converse proof (Sec. V or equivalent): the bound on equivocation must account for the fact that the encoder observes only a random subset while the receiver reconstructs all modalities. It is unclear whether the single-letter characterization fully incorporates the inference of missing modalities via the joint source distribution or whether additional mutual-information terms appear that couple the three claimed components.

Authors: The converse derivation in Section V explicitly uses the joint source distribution to bound the equivocation when the encoder observes only a random subset. The inference of unobserved modalities at the receiver is accounted for within the compression-level term of the secrecy limit; the resulting single-letter bound separates cleanly into the three components without additional coupling terms. We agree the presentation of this step can be strengthened and will expand the relevant information-inequality steps and add a short explanatory paragraph in the revised Section V. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived from standard single-letter information measures

full rationale

The paper extends the rate-distortion-perception problem to multimodal sources observed over random subsets and derives converse and achievability bounds on rate, distortion, perception, and equivocation for a DM wiretap channel. The stated decomposition of the secrecy limit into compression level, secret key rate, and wiretap statistics is obtained directly from the single-letter expressions for mutual information and equivocation rate under the per-subset and per-modality constraints; none of these quantities is defined in terms of the others or obtained by fitting parameters to the target result. The model assumptions (i.i.d. samples, arbitrary non-empty subset, discrete memoryless channel) are stated explicitly and do not embed the final decomposition. No load-bearing step reduces to a self-citation, ansatz smuggled via citation, or renaming of a known empirical pattern. The derivation is therefore self-contained against external information-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard discrete memoryless channel and i.i.d. source assumptions plus the extension of rate-distortion-perception to multimodal per-modality constraints; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Sources are i.i.d. samples from an arbitrary non-empty subset of modalities
Stated in the abstract as the source model
domain assumption Channel is discrete memoryless wiretap channel
Stated in the abstract

pith-pipeline@v0.9.0 · 5453 in / 1327 out tokens · 37347 ms · 2026-05-15T02:02:03.066345+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that the fundamental limit for secrecy consists of three operationally distinct components: the level of compression, the secret key rate, and the statistics of the wiretap channel.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery of Peano arithmetic unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the multimodal rate-distortion-perception function is given by R(D,P) = inf I(S_E; Ŝ) s.t. per-modality distortion and perception constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.