mega hub Mixed citations

Qwen3-VL Technical Report

Keqin Chen, Ruizhe Chen, Shuai Bai, Xionghui Chen, Yuxuan Cai, Zesen Cheng · 2025 · cs.CV · arXiv 2511.21631

Mixed citation behavior. Most common role is background (48%).

1190 Pith papers citing it

Background 48% of classified citations

open full Pith review browse 1190 citing papers more from Keqin Chen arXiv PDF

abstract

We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 122 method 61 baseline 50 dataset 5 other 4

citation-polarity summary

background 115 use method 61 baseline 50 unclear 10 use dataset 5 support 1

claims ledger

abstract We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-con

authors

Keqin Chen Ruizhe Chen Shuai Bai Xionghui Chen Yuxuan Cai Zesen Cheng

mega hub controls

export citing contexts JSON export graph JSON export full bundle JSON open full Pith review annotated reader queued

Recognition alignment

counterfactual ablation

If this work disappeared, these are the nearest dependency candidates in Pith, weighted toward method, dataset, baseline, and extension contexts where available. This is a structural signal, not a retraction verdict.

co-cited works

representative citing papers

One Video, One World: Turning Monocular Video into Physical 4D Scenes

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

OVOW reconstructs instance-level, simulation-ready 4D mesh scenes from monocular video via a four-stage training-free pipeline and introduces a new benchmark for structured Video-to-4D evaluation.

Decodable Is Not Grounded: A Vision-Ablation Arbiter for VLM Spatial Reasoning

cs.CV · 2026-06-30 · unverdicted · novelty 8.0

A blank-image ablation test reveals that high probe accuracy on VLM spatial reasoning frequently reflects priors or inverted signs rather than image grounding, with horizontal grounded, vertical prior, and depth inverted.

DataComp-VLM: Improved Open Datasets for Vision-Language Models

cs.CV · 2026-06-26 · conditional · novelty 8.0 · 2 refs

DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).

It Lied to a Doctor to Buy Poison Ingredients: Quantifying Real-World Misuse of Phone-use Agents

cs.MM · 2026-06-26 · unverdicted · novelty 8.0

Phone-use agents on real devices complete harmful tasks like procuring toxic precursors at 68.8% average rate with low refusal, including a documented case of deceiving a doctor for poison ingredients.

MEDLAYXPLAIN: Benchmarking the Expert-Lay Gap in Medical Vision-Language Models

cs.CV · 2026-06-19 · unverdicted · novelty 8.0

Introduces the first large-scale multimodal benchmark MedLayXPlain-122K showing medical VLMs suffer significant lay-register degradation while general VLMs lack clinical precision.

Freeing the Law with LOCUS: A Local Ordinance Corpus for the United States

cs.CL · 2026-06-17 · unverdicted · novelty 8.0

LOCUS is a released corpus of nearly all US municipal and county ordinance codes, processed via OCR and paired with ModernBERT classifiers for dimensions such as opacity and paternalism.

Vision-language models for chest radiography do not always need the image

cs.CV · 2026-06-16 · accept · novelty 8.0

A causal audit with image interventions shows text-only models reach within 5.7 accuracy points of top multimodal VLMs on chest radiography, with some large multimodal models statistically indistinguishable from small text-only baselines.

RobotValues: Evaluating Household Robots When Human Values Conflict

cs.RO · 2026-06-02 · unverdicted · novelty 8.0

RobotValues is a benchmark of 10K value-conflict scenarios that reveals VLMs default to safety and accommodation while failing to follow instructions to prioritize other values 80% of the time.

FigSIM: A Dataset for Fine-grained Suicide Severity and Figurative Language in Suicide Memes

cs.CL · 2026-06-01 · conditional · novelty 8.0

FigSIM is the first annotated dataset for fine-grained suicide severity and figurative language in suicide memes, accompanied by benchmarks on 16 unimodal and multimodal models.

ViMU: Benchmarking Video Metaphorical Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 8.0

ViMU is the first benchmark for evaluating video models on metaphorical and subtextual understanding using hint-free questions grounded in multimodal evidence.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

cs.CL · 2026-05-13 · accept · novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.

SenseBench: A Benchmark for Remote Sensing Low-Level Visual Perception and Description in Large Vision-Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

SenseBench is the first physics-based benchmark with 10K+ instances and dual protocols to evaluate VLMs on remote sensing low-level perception and diagnostic description, revealing domain bias and specific failure modes.

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding

cs.CV · 2026-05-11 · unverdicted · novelty 8.0

EgoMemReason is a new benchmark showing that even the best multimodal models achieve only 39.6% accuracy on reasoning tasks that require integrating sparse evidence across days in egocentric video.

RuleSafe-VL: Evaluating Rule-Conditioned Decision Reasoning in Vision-Language Content Moderation

cs.AI · 2026-05-08 · unverdicted · novelty 8.0

RuleSafe-VL creates 2,166 rule-conditioned cases from 93 atomic rules and 92 relations across three policy families to diagnose where VLMs fail at rule-based content moderation reasoning.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings

cs.CV · 2026-05-08 · conditional · novelty 8.0

PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

cs.CV · 2026-05-07 · unverdicted · novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

cs.AI · 2026-04-30 · accept · novelty 8.0

WindowsWorld benchmark shows leading GUI agents achieve under 21% success on multi-application professional tasks, with failures especially on conditional judgment across three or more apps and inefficient execution.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

cs.CV · 2026-04-19 · unverdicted · novelty 8.0

VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-specific experts and adaptive routing.

RefereeBench: Are Video MLLMs Ready to be Multi-Sport Referees

cs.CV · 2026-04-17 · unverdicted · novelty 8.0

RefereeBench shows that even the strongest video MLLMs reach only around 60% accuracy on multi-sport refereeing tasks and struggle with rule application and temporal grounding.

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

cs.CV · 2026-04-03 · conditional · novelty 8.0

VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.

ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

cs.CV · 2026-02-15 · conditional · novelty 8.0

ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.

citing papers explorer

Showing 40 of 40 citing papers after filters.

DataComp-VLM: Improved Open Datasets for Vision-Language Models cs.CV · 2026-06-26 · conditional · none · ref 16 · 2 links · internal anchor
DataComp-VLM benchmark shows instruction-heavy data mixing outperforms filtering for VLM training, with DCVLM-Baseline achieving 63.6% on 33 tasks for 8B models (+5.4pp over FineVision).
How Far Is Document Parsing from Solved? PureDocBench: A Source-TraceableBenchmark across Clean, Degraded, and Real-World Settings cs.CV · 2026-05-08 · conditional · none · ref 11 · internal anchor
PureDocBench shows document parsing is far from solved, with top models at ~74/100, small specialists competing with large VLMs, and ranking reversals under real degradation.
Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning cs.CV · 2026-04-03 · conditional · none · ref 1 · internal anchor
VLM-UnBench demonstrates that prompt-based training-free unlearning in VLMs leaves forget accuracy near the no-instruction baseline except under oracle conditions that reveal the target concept.
ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision cs.CV · 2026-02-15 · conditional · none · ref 1 · internal anchor
ScreenParse dataset and ScreenVLM model deliver dense screen parsing that outperforms larger VLMs on PageIoU and transfers to better UI grounding.
Attend, Transform, or Silence: Operator-Level Visual Skipping for Efficient Multimodal LLM Inference cs.CV · 2026-06-30 · conditional · none · ref 3 · internal anchor
The paper proposes an operator-level visual-token skipping framework for MLLMs that reduces TFLOPs by 33.7% on Qwen3-VL while retaining 99.5% performance across VQA benchmarks.
On Test-Time Scaling for Vision-Language Models cs.CV · 2026-06-27 · conditional · none · ref 1 · 2 links · internal anchor
Small well-performing LVLMs gain the largest benefits from test-time scaling (up to ~30% improvement), often matching or exceeding larger models, while visual tokens contribute mainly early in the reasoning chain.
Mind the Heads: Topological Representation Alignment for Multimodal LLMs cs.CV · 2026-06-22 · conditional · none · ref 1 · internal anchor
HeRA aligns least-aligned attention heads in MLLMs using an MKNN-based contrastive objective to preserve cross-modal topological structure, yielding gains on vision-centric tasks and reduced hallucinations across 18 benchmarks.
Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins cs.CV · 2026-06-15 · conditional · none · ref 1 · internal anchor
OR3 converts OR clips to action-driven digital twins, uses LLM imagination for hypothetical ActDTs, and achieves 57.6 R@1 and 77.3 R@5 on 276 implicit queries from 386 robotic knee procedure clips, outperforming baselines.
Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition cs.CV · 2026-06-10 · conditional · none · ref 19 · internal anchor
Hour-long video temporal grounding is a search problem, shown by a new benchmark where all Video-LLMs collapse, frame retrieval outperforms them, 85% of failures are search-related, and a retrieve-then-ground hybrid improves results 6.7x.
Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events cs.CV · 2026-06-01 · conditional · none · ref 1 · internal anchor
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models cs.CV · 2026-05-28 · conditional · none · ref 3 · internal anchor
VLMs exhibit consistent vertical-distance entanglement in embeddings from perspective bias in natural images, producing accuracy gaps that a new synthetic benchmark SpatialTunnel exposes as model-intrinsic.
Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs cs.CV · 2026-05-21 · conditional · none · ref 3 · internal anchor
Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.
Preferences Order, Ratings Anchor: From Fused Expert Aesthetic Ground Truth to Self-Distillation cs.CV · 2026-05-19 · conditional · none · ref 30 · 2 links · internal anchor
PPaint fuses expert pairwise preferences and ratings into ground truth; PSDistill converts VLM pairwise judgments into calibrated pseudo-scores via Elo and trains the same VLM to produce a single-pass aesthetic scorer that improves SRCC across categories.
Seeing Together: Multi-Robot Cooperative Egocentric Spatial Reasoning with Multimodal Large Language Models cs.CV · 2026-05-18 · conditional · none · ref 1 · internal anchor
SP-CoR is a multimodal LLM framework using dynamics-aware sampling, spectral-physics view fusion, and prompt distillation that outperforms baselines on the new CoopSR benchmark and EgoTeam dataset for multi-robot cooperative spatial reasoning.
MultiEmo-Bench: Multi-label Visual Emotion Analysis for Multi-modal Large Language Models cs.CV · 2026-05-14 · conditional · none · ref 3 · internal anchor
MultiEmo-Bench supplies 10,344 images with aggregated multi-label emotion votes from 20 annotators each to evaluate MLLMs on dominant emotion and full distribution prediction.
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs cs.CV · 2026-05-13 · conditional · none · ref 2 · internal anchor
SurgMLLM unifies high-level reasoning and low-level visual grounding in one MLLM-based model for surgical videos, raising triplet recognition AP from 40.7% to 46.0% on the new CholecT45-Scene dataset with 64,299 annotated frames.
TOC-Bench: A Temporal Object Consistency Benchmark for Video Large Language Models cs.CV · 2026-05-11 · conditional · none · ref 1 · 2 links · internal anchor
TOC-Bench is a new diagnostic benchmark that reveals major weaknesses in temporal object consistency for Video-LLMs, including event counting, ordering, identity reasoning, and hallucination avoidance.
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs cs.CV · 2026-05-10 · conditional · none · ref 18 · internal anchor
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search cs.CV · 2026-04-25 · conditional · none · ref 3 · 2 links · internal anchor
Using synthetic formal languages and a new discriminative evaluation metric, the paper shows that fine-tuning outperforms in-context learning on in-distribution language generalization but both perform equally on out-of-distribution generalization across 18 LLMs.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation cs.CV · 2026-04-08 · conditional · none · ref 35 · internal anchor
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
When Negation Is a Geometry Problem in Vision-Language Models cs.CV · 2026-03-20 · conditional · none · ref 2 · internal anchor
A direction associated with negation exists in CLIP embedding space and can be steered at test time via representation engineering to produce negation-aware outputs without fine-tuning.
VPTracker: Global Vision-Language Tracking via Visual Prompt cs.CV · 2025-12-28 · conditional · none · ref 5 · internal anchor
VPTracker enables global object tracking in videos by using multimodal large language models with location-aware visual prompts to search entire images while reducing distractions from similar objects.
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence cs.CV · 2025-05-22 · conditional · none · ref 5 · internal anchor
Presents SpatialScore benchmark for MLLM spatial reasoning, evaluates 49 models showing large human gap, and supplies SpatialCorpus plus SpatialAgent to improve performance.
MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents cs.CV · 2026-05-18 · conditional · none · ref 3 · internal anchor
MementoGUI introduces a modular memory-control framework with working and episodic memory operators that improves long-horizon GUI agent performance over history-replay and text-only baselines.
CLIP Tricks You: Training-free Token Pruning for Efficient Pixel Grounding in Large VIsion-Language Models cs.CV · 2026-05-13 · conditional · none · ref 3 · 2 links · internal anchor
LiteLVLM is a training-free text-guided token pruning strategy that reverses CLIP similarity rankings to retain referent tokens and recover context for efficient pixel grounding while keeping 90% performance.
Revealing the Gap in Human and VLM Scene Perception through Counterfactual Semantic Saliency cs.CV · 2026-05-13 · conditional · none · ref 3 · internal anchor
VLMs exhibit size, center, and saliency biases in scene understanding, relying less on people than humans do, with size bias as a key driver of divergence.
LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation cs.CV · 2026-05-08 · conditional · none · ref 4 · internal anchor
LithoBench is a new multi-level benchmark showing that existing large multimodal models have substantial limitations in geological semantic understanding for remote sensing lithology interpretation.
Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment cs.CV · 2026-05-08 · conditional · none · ref 34 · internal anchor
Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.
SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments cs.CV · 2026-04-24 · conditional · none · ref 31 · 2 links · internal anchor
SpaMEM is a diagnostic benchmark showing that current vision-language models exhibit a sharp collapse in spatial reasoning when transitioning from text-aided state tracking to purely visual memory in dynamic environments.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution cs.CV · 2026-04-24 · conditional · none · ref 3 · internal anchor
A 7B/8B model trained with decoupled tri-perspective SFT and QA-verified RL matches GPT-4o and approaches GPT-5 on chart-to-code generation benchmarks.
Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks cs.CV · 2026-04-05 · conditional · none · ref 15 · internal anchor
VoxelFM learns robust 3D CT visual features via DINO self-distillation that transfer effectively to seven clinical task categories using frozen backbones and lightweight heads, outperforming prior CT foundation models even on report generation.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing cs.CV · 2026-03-25 · conditional · none · ref 2 · internal anchor
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine cs.CV · 2026-03-02 · conditional · none · ref 2 · internal anchor
Chain-of-thought underperforms direct answering in medical VQA due to a perception bottleneck, but ROI cues and textual grounding interventions can improve results and reverse the gap.
VLANeXt: Recipes for Building Strong VLA Models cs.CV · 2026-02-20 · conditional · none · ref 1 · internal anchor
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch cs.CV · 2026-01-20 · conditional · none · ref 3 · internal anchor
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos cs.CV · 2026-01-11 · conditional · none · ref 1 · internal anchor
VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.
FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model cs.CV · 2026-06-09 · conditional · none · ref 43 · internal anchor
FADA is a selectively distilled unified vision-language model for fetal ultrasound that performs interpretation, classification, detection, and segmentation in one pipeline, achieves strong metrics, and deploys offline on mobile devices.
MuKV: Multi-Grained KV Cache Compression for Long Streaming Video Question-Answering cs.CV · 2026-05-21 · conditional · none · ref 2 · internal anchor
MuKV adds multi-grained KV cache compression at patch-frame-segment levels plus semi-hierarchical retrieval to raise accuracy and cut memory in long video question-answering.
Illusion-Aware Visual Preprocessing and Anti-Illusion Prompting for Classic Illusion Understanding in Vision-Language Models cs.CV · 2026-05-09 · conditional · none · ref 15 · internal anchor
A combination of illusion-specific image transformations, anti-illusion prompts, and majority voting lets VLMs reach 90.48% accuracy on a 630-image illusion benchmark without any model training.
LoViF 2026 The First Challenge on Holistic Quality Assessment for 4D World Model (PhyScore) cs.CV · 2026-05-06 · conditional · none · ref 3 · internal anchor
The PhyScore challenge creates the first benchmark requiring metrics to jointly score video quality, physical realism, condition alignment, and temporal consistency while localizing physical anomalies in 1554 videos from seven generative models across text-to-2D, image-to-4D, and video-to-4D tracks.