SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is unclear (30%).
citation-role summary
citation-polarity summary
representative citing papers
HairGPT reframes 3D hairstyle synthesis as dual-decoupled autoregressive strand sequence modeling with geometric tokenization for semantic control and rare style generation.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
A hardware prototype performs gaze estimation by optically encoding task-relevant features with a microlens array and mask, captured on a 4x4 phototransistor array and decoded by a small neural network, reaching 3.4 ms latency with competitive accuracy.
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classification tasks.
A UAV-to-3DGS-to-MPM pipeline reconstructs real landslide sites with photorealistic visuals and runs physics-based simulations, validated on a Hong Kong event.
Double-Softmax Prompt Tuning uses sequential softmax normalization to create self-adaptive gradient saturation that filters noisy samples while preserving useful updates in CLIP prompt tuning.
Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regression and classification.
A semi-supervised pipeline applies UniMatch V2 to the WeatherProof dataset by treating degraded images as unlabeled data plus test-time augmentation for semantic segmentation in adverse weather.
Comparative study of DS-NeRF, TensoRF, and HashNeRF with depth-supervision and architectural variants finds no conclusive outperformance under equal training time but identifies which design choices transfer to low-data, low-compute regimes.
citing papers explorer
-
Beyond Detection: A Structure-Aware Framework for Scene Text Tracking
SymTrack is the first systematic detection-free framework for scene text tracking that constructs benchmarks from video text spotting datasets and reports up to 11.97% AUC gains over prior trackers.
-
HairGPT: Strand-as-Language Autoregressive Modeling for Realistic 3D Hairstyle Synthesis
HairGPT reframes 3D hairstyle synthesis as dual-decoupled autoregressive strand sequence modeling with geometric tokenization for semantic control and rare style generation.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
Privatar: Scalable Privacy-preserving Multi-user VR via Secure Offloading
Privatar uses horizontal frequency partitioning and distribution-aware minimal perturbation to enable private offloading of VR avatar reconstruction, supporting 2.37x more users with modest overhead.
-
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Q-Align trains LMMs on discrete text-defined levels for visual scoring, achieving SOTA on IQA, IAA, and VQA while unifying the tasks in OneAlign.
-
Low Latency Gaze Tracking via Latent Optical Sensing
A hardware prototype performs gaze estimation by optically encoding task-relevant features with a microlens array and mask, captured on a 4x4 phototransistor array and decoded by a small neural network, reaching 3.4 ms latency with competitive accuracy.
-
Prognostic Value of Lung Ultrasound Biomarkers for Readmission Risk in Congestive Heart Failure: A Pilot Data-Driven Analysis
Pilot study uses pretrained video encoder features from lung ultrasound to predict 30-day CHF readmission, finding lower-lung views and temporal differences most informative with top MLP F1 of 0.80.
-
DocAtlas: Multilingual Document Understanding Across 80+ Languages
DocAtlas introduces model-free rendering pipelines to create DocTag-annotated datasets across 82 languages and shows DPO adaptation improves multilingual performance without base-language degradation.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
-
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classification tasks.
-
UAV-Assisted Scan-to-Simulation for Landslides Using Physics-Informed Gaussian Splatting
A UAV-to-3DGS-to-MPM pipeline reconstructs real landslide sites with photorealistic visuals and runs physics-based simulations, validated on a Hong Kong event.
-
Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models
Double-Softmax Prompt Tuning uses sequential softmax normalization to create self-adaptive gradient saturation that filters noisy samples while preserving useful updates in CLIP prompt tuning.
-
Group Cognition Learning: Making Everything Better Through Governed Two-Stage Agents Collaboration
Group Cognition Learning uses governed two-stage agents after separate modality encoding to mitigate dominance and spurious coupling, reporting state-of-the-art results on CMU-MOSI, CMU-MOSEI, and MIntRec for regression and classification.
-
A Robust Semantic Segmentation Pipeline for the CVPR 2026 8th UG2+ Challenge Track 2
A semi-supervised pipeline applies UniMatch V2 to the WeatherProof dataset by treating degraded images as unlabeled data plus test-time augmentation for semantic segmentation in adverse weather.
-
Low-Cost Neural Radiance Fields
Comparative study of DS-NeRF, TensoRF, and HashNeRF with depth-supervision and architectural variants finds no conclusive outperformance under equal training time but identifies which design choices transfer to low-data, low-compute regimes.
- Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning
- LIVEditor-14B: Lightning Unified Video Editing via In-Context Sparse Attention
- Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
- Dual-Anchoring: Addressing State Drift in Vision-Language Navigation