TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

Haoran Wang; Jun Li; Meng Yee Michael Chuah; Ruofei Bai; Wei Yun Yau; Yuteng Sun; Zhengguo Li

arxiv: 2601.14945 · v2 · pith:NZYKX3PJnew · submitted 2026-01-21 · 💻 cs.RO · cs.AI

TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control

Yuteng Sun , Haoran Wang , Ruofei Bai , Zhengguo Li , Jun Li , Meng Yee Michael Chuah , Wei Yun Yau This is my paper

classification 💻 cs.RO cs.AI

keywords semantictidalloopactionbaselinesexecutionhigh-frequencylatency

0 comments

read the original abstract

Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency, limiting them to low-frequency batch-and-execute paradigm. This frequency mismatch creates an execution blind spot, causing failures in dynamic environments where targets move during the open-loop execution window. We propose TIDAL (Temporally Interleaved Diffusion and Action Loop), a hierarchical framework that decouples semantic reasoning from high-frequency actuation. TIDAL operates as a backbone-agnostic module for diffusion-based VLAs, using a dual-frequency architecture to redistribute the computational budget. Specifically, a low-frequency macro-intent loop caches semantic embeddings, while a high-frequency micro-control loop interleaves single-step flow integration with execution. This design enables approximately 9 Hz control updates on edge hardware (vs. approximately 2.4 Hz baselines) without increasing marginal overhead. To handle the resulting latency shift, we introduce a temporally misaligned training strategy where the policy learns predictive compensation using stale semantic intent alongside real-time proprioception. Additionally, we address the insensitivity of static vision encoders to velocity by incorporating a differential motion predictor. TIDAL is architectural, making it orthogonal to system-level optimizations. Experiments show a 2x performance gain over open-loop baselines in dynamic interception tasks. Despite a marginal regression in static success rates, our approach yields a 4x increase in feedback frequency and extends the effective horizon of semantic embeddings beyond the native action chunk size. Under non-paused inference protocols, TIDAL remains robust where standard baselines fail due to latency.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Flash-WAM: Modality-Aware Distillation for World Action Models
cs.LG 2026-06 unverdicted novelty 6.0

Flash-WAM introduces modality-specific consistency parametrizations to distill joint video-action diffusion models to single-step inference, delivering 23x speedup with preserved benchmark performance.