ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding

Phuc H. Nguyen; Quy N. Duong; Trung T. Nguyen; Van-Dinh Nguyen

arxiv: 2605.16388 · v1 · pith:WK6DCDOEnew · submitted 2026-05-11 · 💻 cs.CV

ChronoSC: Task-Oriented Semantic Communication via Temporal-to-Color Encoding

Phuc H. Nguyen , Trung T. Nguyen , Quy N. Duong , Van-Dinh Nguyen This is my paper

Pith reviewed 2026-05-20 22:05 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic communicationvideo question answeringtemporal encodingdeep joint source-channel codingbandwidth reductionChrono-Color StackingCLEVRER dataset

0 comments

The pith

Chrono-Color Stacking projects video temporal dynamics into one static image to enable extreme bandwidth reduction for VideoQA while reusing pre-trained models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a semantic communication system for video question answering that first collapses the time axis of a video clip into color variations across a single composite frame. This compact image travels through a joint source-channel coder and is reconstructed explicitly at the receiver so that an off-the-shelf vision-language model can answer questions directly. The approach avoids pixel-perfect video reconstruction or heavy spatiotemporal networks and instead measures success by end-task accuracy. Experiments indicate that the resulting pipeline uses far less spectrum than sending raw frames yet still supports reliable answers on the CLEVRER benchmark.

Core claim

ChronoSC shows that a lightweight, lossless temporal-to-color projection called Chrono-Color Stacking can encode the dynamics of a video sequence into one static image, which a DeepJSCC transceiver then transmits and reconstructs; the resulting image is fed to a frozen BLIP model that achieves high VideoQA accuracy, delivering up to 192 times bandwidth savings relative to raw video transmission on the CLEVRER dataset.

What carries the argument

Chrono-Color Stacking, a projection that maps successive video frames into the color channels of a single static image to compress temporal information before transmission.

If this is right

Video question answering can be performed with data volumes comparable to single images rather than full video streams.
Pre-trained vision-language models can be used at the receiver without task-specific fine-tuning or retraining on compressed representations.
Semantic communication systems for video can prioritize task accuracy over visual fidelity.
Low-resource edge devices become viable for video reasoning tasks when bandwidth is limited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same temporal-to-color idea could extend to other video tasks such as action recognition or event detection where timing matters.
Real-time analytics pipelines at the network edge would see lower latency and energy use from the reduced transmission volume.
Datasets with faster motion or denser object interactions would expose whether the color stacking loses critical timing cues.
Adding lightweight error protection targeted at temporal features could further stabilize accuracy under severe channel noise.

Load-bearing premise

The stacking projection keeps all task-relevant temporal information intact and the reconstructed image remains clear enough for a pre-trained model to answer questions correctly without retraining.

What would settle it

A controlled test on CLEVRER videos in which temporal events essential to correct answers, such as object interaction order, produce systematically wrong outputs after chrono-image reconstruction and noisy transmission.

Figures

Figures reproduced from arXiv: 2605.16388 by Phuc H. Nguyen, Quy N. Duong, Trung T. Nguyen, Van-Dinh Nguyen.

**Figure 1.** Figure 1: End-to-end SC framework for ChronoSC, where the input video is compressed into a single stacked semantic image [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Comparison of VQA accuracy across different methods. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of semantic transmission under extreme noise (SNR = 0 dB): (a) Sampled input video frames showing a moving sphere and a stationary cylinder. (b) Chrono-Color Stacking encodes the temporal dynamics into a single static image (i). Despite severe channel noise (ii), the resulting color trails remain discernible, enabling the fine-tuned BLIP decoder to correctly infer motion and object attributes… view at source ↗

read the original abstract

Semantic communication (SC) aims to reduce transmission overhead by conveying task-relevant information rather than raw data. However, existing SC approaches for video largely focus on pixel-level reconstruction or rely on complex spatiotemporal pipelines, leading to excessive bandwidth usage and latency that are unsuitable for low-resource deployments. In this paper, we propose ChronoSC, a task-oriented semantic communication framework for Video Question Answering (VideoQA). ChronoSC introduces Chrono-Color Stacking, a lightweight and lossless projection scheme that encodes temporal video dynamics into a single static image, enabling extreme temporal compression before transmission. This compact semantic representation is transmitted using a lightweight Deep Joint Source-Channel Coding (DeepJSCC) transceiver and explicitly reconstructed at the receiver. Unlike latent-space methods, explicit visual reconstruction enables the direct reuse of pre-trained vision-language models; specifically, a pre-trained BLIP model is employed to infer answers from noisy, reconstructed chrono-images. Experiments on the CLEVRER dataset show that ChronoSC achieves up to 192 times bandwidth reduction compared to raw video transmission while maintaining high VideoQA accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChronoSC turns video time into color in one image for cheap transmission and direct BLIP reuse on VideoQA, but the 192x bandwidth claim rests on unshown checks that the encoding and reconstruction keep event order intact under noise.

read the letter

The main thing here is that ChronoSC stacks temporal video information into the colors of a single static image, sends the compact result over a DeepJSCC link, reconstructs the image explicitly, and feeds it straight to a pre-trained BLIP model for VideoQA. On CLEVRER this reportedly cuts bandwidth by up to 192 times versus raw video while holding accuracy up. The explicit reconstruction step is the practical piece that lets them avoid training a new model from scratch and just reuse an existing vision-language one. That choice keeps the whole pipeline lightweight and suitable for low-resource devices, which is a clear engineering win over heavier spatiotemporal pipelines common in prior semantic communication work. The projection itself is presented as lossless, which is what makes the bandwidth numbers possible without losing task-relevant dynamics. The paper shows honest engagement with the usual trade-offs in this area by focusing on direct visual output rather than opaque latents. The soft spots sit in the missing verification steps. The abstract states that the color mapping preserves temporal order and that reconstruction quality is good enough for off-the-shelf BLIP, yet it gives no ablations on how distinct sequences map to unique hues or how channel noise shifts those values and scrambles cause-effect chains. CLEVRER questions hinge on precise collision order and trajectories, so any collapse or distortion there would undercut the accuracy claim. The stress-test note on causal order under reconstruction noise matches what is not shown. This work is for people in task-oriented semantic communication who want a deployable compression trick that plays nicely with current vision models. A reader looking for practical bandwidth savings in video QA would get something concrete to test. It deserves peer review so the authors can supply the fidelity checks and noise ablations that would let others judge whether the central numbers hold.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes ChronoSC, a task-oriented semantic communication framework for Video Question Answering. It introduces Chrono-Color Stacking, a projection that encodes temporal video dynamics into a single static image for extreme compression. This representation is transmitted via a lightweight DeepJSCC transceiver, explicitly reconstructed at the receiver, and fed to a pre-trained BLIP model for inference. Experiments on the CLEVRER dataset report up to 192 times bandwidth reduction relative to raw video transmission while maintaining high VideoQA accuracy.

Significance. If the central claims hold, the work offers a practical route to extreme bandwidth reduction for video tasks by converting temporal information into a static image that reuses off-the-shelf vision-language models without task-specific fine-tuning. This could be valuable for low-resource deployments where complex spatiotemporal pipelines are prohibitive.

major comments (1)

Experimental evaluation: The headline result of 192x bandwidth reduction with high VideoQA accuracy on CLEVRER rests on the unverified assumption that Chrono-Color Stacking preserves causal event order (e.g., collision sequences and object trajectories) under DeepJSCC reconstruction noise. No ablation is reported on temporal fidelity metrics or color-distortion effects that would directly test whether the pre-trained BLIP can recover task-critical dynamics without fine-tuning.

minor comments (2)

Notation: The description of the Chrono-Color Stacking projection would benefit from an explicit equation or pseudocode block showing how frame indices are mapped to color channels.
Figure clarity: The diagram illustrating the overall pipeline should include a side-by-side comparison of original video frames, the stacked chrono-image, and the reconstructed version to allow visual assessment of information loss.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the single major comment on experimental evaluation below and agree that additional analysis will strengthen the claims.

read point-by-point responses

Referee: Experimental evaluation: The headline result of 192x bandwidth reduction with high VideoQA accuracy on CLEVRER rests on the unverified assumption that Chrono-Color Stacking preserves causal event order (e.g., collision sequences and object trajectories) under DeepJSCC reconstruction noise. No ablation is reported on temporal fidelity metrics or color-distortion effects that would directly test whether the pre-trained BLIP can recover task-critical dynamics without fine-tuning.

Authors: We appreciate this observation. Chrono-Color Stacking is a deterministic, order-preserving projection that maps video frames to distinct color channels of a single image while retaining exact temporal sequence information prior to transmission. The end-to-end DeepJSCC training and the downstream VideoQA accuracy on CLEVRER (which explicitly queries causal events, trajectories, and collision sequences) provide indirect but task-relevant evidence that critical dynamics survive reconstruction noise. Nevertheless, we agree that explicit ablations would be more direct. In the revised manuscript we will add: (i) quantitative color-distortion analysis (per-channel PSNR and CIEDE2000) between original and reconstructed chrono-images, and (ii) a temporal-fidelity metric consisting of a lightweight frame-order classifier applied to the reconstructed images, together with its correlation to VideoQA accuracy. These additions will be reported without any fine-tuning of the pre-trained BLIP model. revision: yes

Circularity Check

0 steps flagged

No circularity: framework and results are independent engineering proposal

full rationale

The paper introduces Chrono-Color Stacking as a new projection method to encode temporal video into a static image, transmits it via DeepJSCC, and evaluates end-to-end on the external CLEVRER dataset using a pre-trained BLIP model. Bandwidth reduction (192x) and VideoQA accuracy are reported from direct experiments comparing to raw transmission; no equations define a quantity in terms of itself, no fitted parameters are relabeled as predictions, and no self-citations form a load-bearing chain. The derivation chain consists of a concrete pipeline whose performance claims rest on external benchmarks rather than internal redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full methods, equations, and experimental details unavailable. No free parameters, axioms, or invented entities can be audited beyond the high-level scheme names.

invented entities (1)

Chrono-Color Stacking no independent evidence
purpose: Lossless projection of temporal video dynamics into a single static image
Introduced as the core lightweight encoding scheme in the abstract.

pith-pipeline@v0.9.0 · 5729 in / 1234 out tokens · 74409 ms · 2026-05-20T22:05:16.560599+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8, flipAt512 echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Chrono-Color Stacking... parameter-free projection... θ_t ← t/T · θ_max; F'_t ← HueRotate... I_sem ← max_t(F')... T=8 frames... BCR = k/(T×H×W×3)
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z, z_monotone_absolute unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lossless projection scheme that encodes temporal video dynamics... explicit visual reconstruction enables direct reuse of pre-trained vision-language models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

SC-GIR: Goal-oriented semantic com- munication via invariant representation learning for image transmission,

S. H. Wanasekara, V .-D. Nguyen, K.-S. Wong, M.-D. Nguyen, S. Chatzinotas, and O. A. Dobre, “SC-GIR: Goal-oriented semantic com- munication via invariant representation learning for image transmission,” IEEE Trans. Mobi. Comput., pp. 1–15, 2025

work page 2025
[2]

Overview of the H.264/A VC video coding standard,

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/A VC video coding standard,”IEEE Trans. Circ. Syst. Video Tech., vol. 13, no. 7, pp. 560–576, 2003

work page 2003
[3]

Semantic communications: Principles and challenges,

Z. Qin, X. Tao, J. Lu, W. Tong, and G. Y . Li, “Semantic communications: Principles and challenges,”arXiv preprint arXiv:2201.01389, 2021

work page arXiv 2021
[4]

VideoQA-SC: Adaptive semantic communication for video question answering,

G. Li, S. Wang, Z. Gao, Q. Guo, and G. Y . Li, “VideoQA-SC: Adaptive semantic communication for video question answering,”IEEE J. Sel. Areas Commun., 2025, (To appear)

work page 2025
[5]

Slotformer: Unsupervised visual dynamics simulation with object-centric models,

Z. Wu, N. Dvornik, K. Greff, T. Kipf, and A. Garg, “Slotformer: Unsupervised visual dynamics simulation with object-centric models,” arXiv preprint arXiv:2210.05861, 2022

work page arXiv 2022
[6]

Hierarchical conditional relation networks for video question answering,

T. M. Le, V . Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” inProc. IEEE/CVF Conf. Comp. Visi. Patt. Recog. (CVPR), 2020, pp. 9969–9978

work page 2020
[7]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Inter. Conf. Machine Learning (ICML), 2022

work page 2022
[8]

CLEVRER: Collision events for video representation and reasoning,

K. Yi, C. Gan, Y . Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum, “CLEVRER: Collision events for video representation and reasoning,” inProc. Inter. Conf. Machine Learning (ICML), 2020

work page 2020
[9]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Networ., vol. 5, no. 3, pp. 567–579, 2019

work page 2019
[10]

SwinJSCC: Taming Swin transformer for deep joint source-channel coding,

K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Networ., 2024

work page 2024
[11]

Task-oriented multi-user semantic communications for VQA task,

H. Xie, Z. Qin, and G. Y . Li, “Task-oriented multi-user semantic communications for VQA task,”IEEE Wire.s Commun. Lett., vol. 11, no. 3, pp. 553–557, 2021

work page 2021
[12]

Video reconstruction from a single motion blurred image using learned dynamic phase coding,

E. Yosef, S. Elmalem, and R. Giryes, “Video reconstruction from a single motion blurred image using learned dynamic phase coding,”Scientific Reports, vol. 13, p. 13625, 2023

work page 2023
[13]

VDM-MD: Video diffusion model for motion deblur- ring,

Y . Zhonget al., “VDM-MD: Video diffusion model for motion deblur- ring,”arXiv preprint arXiv:2501.12604, 2025

work page arXiv 2025

[1] [1]

SC-GIR: Goal-oriented semantic com- munication via invariant representation learning for image transmission,

S. H. Wanasekara, V .-D. Nguyen, K.-S. Wong, M.-D. Nguyen, S. Chatzinotas, and O. A. Dobre, “SC-GIR: Goal-oriented semantic com- munication via invariant representation learning for image transmission,” IEEE Trans. Mobi. Comput., pp. 1–15, 2025

work page 2025

[2] [2]

Overview of the H.264/A VC video coding standard,

T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the H.264/A VC video coding standard,”IEEE Trans. Circ. Syst. Video Tech., vol. 13, no. 7, pp. 560–576, 2003

work page 2003

[3] [3]

Semantic communications: Principles and challenges,

Z. Qin, X. Tao, J. Lu, W. Tong, and G. Y . Li, “Semantic communications: Principles and challenges,”arXiv preprint arXiv:2201.01389, 2021

work page arXiv 2021

[4] [4]

VideoQA-SC: Adaptive semantic communication for video question answering,

G. Li, S. Wang, Z. Gao, Q. Guo, and G. Y . Li, “VideoQA-SC: Adaptive semantic communication for video question answering,”IEEE J. Sel. Areas Commun., 2025, (To appear)

work page 2025

[5] [5]

Slotformer: Unsupervised visual dynamics simulation with object-centric models,

Z. Wu, N. Dvornik, K. Greff, T. Kipf, and A. Garg, “Slotformer: Unsupervised visual dynamics simulation with object-centric models,” arXiv preprint arXiv:2210.05861, 2022

work page arXiv 2022

[6] [6]

Hierarchical conditional relation networks for video question answering,

T. M. Le, V . Le, S. Venkatesh, and T. Tran, “Hierarchical conditional relation networks for video question answering,” inProc. IEEE/CVF Conf. Comp. Visi. Patt. Recog. (CVPR), 2020, pp. 9969–9978

work page 2020

[7] [7]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,

J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” inProc. Inter. Conf. Machine Learning (ICML), 2022

work page 2022

[8] [8]

CLEVRER: Collision events for video representation and reasoning,

K. Yi, C. Gan, Y . Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum, “CLEVRER: Collision events for video representation and reasoning,” inProc. Inter. Conf. Machine Learning (ICML), 2020

work page 2020

[9] [9]

Deep joint source- channel coding for wireless image transmission,

E. Bourtsoulatze, D. B. Kurka, and D. G ¨und¨uz, “Deep joint source- channel coding for wireless image transmission,”IEEE Trans. Cogn. Commun. Networ., vol. 5, no. 3, pp. 567–579, 2019

work page 2019

[10] [10]

SwinJSCC: Taming Swin transformer for deep joint source-channel coding,

K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “SwinJSCC: Taming Swin transformer for deep joint source-channel coding,”IEEE Trans. Cogn. Commun. Networ., 2024

work page 2024

[11] [11]

Task-oriented multi-user semantic communications for VQA task,

H. Xie, Z. Qin, and G. Y . Li, “Task-oriented multi-user semantic communications for VQA task,”IEEE Wire.s Commun. Lett., vol. 11, no. 3, pp. 553–557, 2021

work page 2021

[12] [12]

Video reconstruction from a single motion blurred image using learned dynamic phase coding,

E. Yosef, S. Elmalem, and R. Giryes, “Video reconstruction from a single motion blurred image using learned dynamic phase coding,”Scientific Reports, vol. 13, p. 13625, 2023

work page 2023

[13] [13]

VDM-MD: Video diffusion model for motion deblur- ring,

Y . Zhonget al., “VDM-MD: Video diffusion model for motion deblur- ring,”arXiv preprint arXiv:2501.12604, 2025

work page arXiv 2025