SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is method (56%).
citation-role summary
citation-polarity summary
representative citing papers
Single-shot HDR is achieved by conditioning a video diffusion model on an LDR input to generate an exposure bracket and fusing the bracket with per-pixel weights from a lightweight UNet.
HairGPT reframes 3D hairstyle synthesis as dual-decoupled autoregressive strand sequence modeling with geometric tokenization for semantic control and rare style generation.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition a diffusion decoder for ultra-low-bitrate video reconstruction.
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Dual-Anchoring Framework mitigates progress drift via structured instruction tokens and memory drift via landmark-centric retrospective prediction, yielding 15.2% success rate gain and 24.7% on long trajectories.
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
A hardware prototype performs gaze estimation by optically encoding task-relevant features with a microlens array and mask, captured on a 4x4 phototransistor array and decoded by a small neural network, reaching 3.4 ms latency with competitive accuracy.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
ECTraj enhances consistency models for multi-agent trajectory prediction via improved student-teacher supervision and conditional top-K generation, yielding faster inference and competitive accuracy on Argoverse 2.
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
STAR-IOD applies scale-decoupled topology alignment and K-Means-based pseudo-label refinement to reduce catastrophic forgetting in remote sensing incremental object detection, reporting 1.7% and 2.1% mAP gains on new DIOR-IOD and DOTA-IOD datasets.
A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classification tasks.
A UAV-to-3DGS-to-MPM pipeline reconstructs real landslide sites with photorealistic visuals and runs physics-based simulations, validated on a Hong Kong event.
Double-Softmax Prompt Tuning uses sequential softmax normalization to create self-adaptive gradient saturation that filters noisy samples while preserving useful updates in CLIP prompt tuning.
Comparative study of DS-NeRF, TensoRF, and HashNeRF with depth-supervision and architectural variants finds no conclusive outperformance under equal training time but identifies which design choices transfer to low-data, low-compute regimes.
citing papers explorer
-
Single-Shot HDR Recovery via a Video Diffusion Prior
Single-shot HDR is achieved by conditioning a video diffusion model on an LDR input to generate an exposure bracket and fusing the bracket with per-pixel weights from a lightweight UNet.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
-
Enhancing Consistency Models for Multi-Agent Trajectory Prediction
ECTraj enhances consistency models for multi-agent trajectory prediction via improved student-teacher supervision and conditional top-K generation, yielding faster inference and competitive accuracy on Argoverse 2.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.