{"total":36,"items":[{"citing_arxiv_id":"2605.21272","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations, 2022. [61] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. MediaPipe: A framework for building perception pipelines.arXiv preprint arXiv:1906.08172, 2019. [62] Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. v: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7739-7751, 2025. [63] OpenAI."},{"citing_arxiv_id":"2605.20992","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction","primary_cat":"cs.CV","submitted_at":"2026-05-20T10:31:10+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20941","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PaintCopilot: Modeling Painting as Autonomous Artistic Continuation","primary_cat":"cs.CV","submitted_at":"2026-05-20T09:27:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PaintCopilot models painting as an open-ended autoregressive process that predicts coherent brushstrokes from partial canvas observations using a ViT target predictor, flow-matching stroke generator, and VAE region sampler.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20706","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU","primary_cat":"cs.DC","submitted_at":"2026-05-20T05:05:10+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LlamaWeb is a WebGPU backend for llama.cpp that uses static memory planning, tunable kernels, and templated multi-precision support to cut memory use by 29-33% and raise decode throughput by 45-69% versus prior browser frameworks on tested hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12297","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:51:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10087","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Initiation of Interaction Detection Framework using a Nonverbal Cue for Human-Robot Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-11T07:07:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A robot detects initiation of interaction via audio-visual fusion of speech localization and face/gaze cues, implemented as a state machine in ROS and tested on a mobile platform.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ott, F. Ramos, and B. Upcroft, \"Simple online and realtime tracking,\" in2016 IEEE international conference on image processing (ICIP). IEEE, 2016, pp. 3464-3468. [31] E. Goffman,Behavior in public places. Simon and Schuster, 2008. [32] M. Argyle, \"Non-verbal communication in human social interaction,\" Non-verbal communication, vol. 2, no. 1, 1972. [33] A. Saran, S. Majumdar, E. S. Short, A. Thomaz, and S. Niekum, \"Hu- man gaze following for human-robot interaction,\" in2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 8615-8621. [34] H. Kiilavuori, V . Sariola, M. J. Peltola, and J. K. Hietanen, \"Making eye contact with a robot: Psychophysiological responses to eye contact"},{"citing_arxiv_id":"2605.06351","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition","primary_cat":"cs.HC","submitted_at":"2026-05-07T14:33:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05694","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Physical-Facial Representation Fusion via Subject-Invariant Cross-Modal Prompt Tuning for Video-Based Emotion Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-07T05:33:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A subject-invariant cross-modal prompt-tuning method with decoupled shared-specific adapters fuses facial and rPPG features in a frozen ViT to improve video-based emotion recognition accuracy and cross-subject generalization on MAHNOB-HCI and DEAP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05367","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video","primary_cat":"cs.CV","submitted_at":"2026-05-06T18:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"First 3D SMPL-X annotations for the Ishara-500 Saudi Sign Language dataset plus a specialized monocular reconstruction pipeline claiming up to 32% hand accuracy gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03855","ref_index":23,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior","primary_cat":"cs.RO","submitted_at":"2026-05-05T15:20:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01720","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 55+ Sign Languages","primary_cat":"cs.CV","submitted_at":"2026-05-03T05:26:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SignVerse-2M provides a 2-million-clip multilingual pose-native dataset for sign language derived from public videos via DWPose preprocessing to enable robust modeling in real-world conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00288","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FaceValue: Exploring Real-Time Self-View Overlays to Prompt Meaning-Oriented Self-Awareness in Remote Meetings","primary_cat":"cs.HC","submitted_at":"2026-04-30T23:12:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A technology probe called FaceValue uses real-time self-view overlays to support meaning-oriented self-awareness in remote meetings, with participants reporting increased cue awareness and communication improvements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27871","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"D-Rex : Diffusion Rendering for Relightable Expressive Avatars","primary_cat":"cs.GR","submitted_at":"2026-04-30T13:53:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"D-Rex applies a LoRA-fine-tuned video diffusion model as an image-space post-process to add consistent relighting to any expressive full-body avatar pipeline while preserving motion and facial detail.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26186","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FASH-iCNN: Making Editorial Fashion Identity Inspectable Through Multimodal CNN Probing","primary_cat":"cs.CV","submitted_at":"2026-04-29T00:20:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multimodal CNN on 87,547 Vogue images classifies fashion houses at 78.2% top-1 accuracy, decades at 88.6%, and years at 58.3% with 2.2-year mean error, and shows texture and luminance carry most of the house-identity signal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00882","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Intervention-Based Self-Supervised Learning: A Causal Probe Paradigm for Remote Photoplethysmography","primary_cat":"cs.CV","submitted_at":"2026-04-26T09:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new intervention-based SSL paradigm for rPPG uses video editing and falsifiability checks to learn the true physiological signal instead of dominant artifacts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23532","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Emotion-Conditioned Short-Horizon Human Pose Forecasting with a Lightweight Predictive World Model","primary_cat":"cs.CV","submitted_at":"2026-04-26T04:56:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Facial emotion embeddings improve short-term pose forecasting accuracy for emotion-driven motions when fused via normalized gating in a lightweight LSTM world model, but not with simple multimodal fusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23141","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"UNSEEN: A Cross-Stack LLM Unlearning Defense against AR-LLM Social Engineering Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-25T04:49:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"UNSEEN combines AR access control, LLM unlearning to suppress profiles, and agent guardrails to defend against AR-LLM social engineering attacks, tested in a 60-person user study with 360 conversations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19702","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Face Anything: 4D Face Reconstruction from Any Image Sequence","primary_cat":"cs.CV","submitted_at":"2026-04-21T17:22:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19636","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation","primary_cat":"cs.CV","submitted_at":"2026-04-21T16:25:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoInteract adds a human-aware mixture-of-experts and spatially-structured co-generation to a diffusion transformer to synthesize videos with stable structures and physically plausible human-object contacts.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"texture-stripped HOI structure streamVh. Since this HOI stream is geometry- focused, the model is encouraged to learn spatial andinteractionrelationships rather than exploiting appearance shortcuts. Finally, both RGB videoVr and HOI streamV h are encoded into a shared latent space via a pre-trained VAE for dual-stream training. Additionally, we use off-the-shelf detectors [29,36] to obtain face and hand bounding boxes, which provide explicit supervision for the MoE router during training. 10 X. Luo et al. 4 Experiments 4.1 Training Details Dataset.We curate a large-scale HOI video dataset following Section 3.3, com- prising 40 hours of product demonstration and live-streaming videos. After qual- ity filtering, 12K high-quality clips are retained with paired RGB-HOI represen-"},{"citing_arxiv_id":"2604.17530","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Real-Time Cellist Postural Evaluation With On-Device Computer Vision","primary_cat":"cs.HC","submitted_at":"2026-04-19T16:45:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Cello Evaluator is a real-time postural feedback system for cellists running on current Android phones via on-device computer vision, validated as user-friendly by experts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16808","ref_index":3,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BioLip: Language-Generalizable Lip-Sync Deepfake Detection via Biomechanical Constraint Violation Modeling","primary_cat":"cs.CV","submitted_at":"2026-04-18T03:32:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16207","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection","primary_cat":"cs.CV","submitted_at":"2026-04-17T16:17:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AIFIND stabilizes incremental face forgery detection by aligning volatile features to invariant semantic anchors from low-level artifacts using attention and harmonization modules.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16138","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Sentiment Analysis of German Sign Language Fairy Tales","primary_cat":"cs.CL","submitted_at":"2026-04-17T15:10:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new dataset and XGBoost model predict sentiment in German Sign Language fairy tale videos from motion features at 0.631 balanced accuracy, showing body movements contribute equally to facial ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08435","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HST-HGN: Heterogeneous Spatial-Temporal Hypergraph Networks with Bidirectional State Space Models for Global Fatigue Assessment","primary_cat":"cs.CV","submitted_at":"2026-04-09T16:36:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HST-HGN uses heterogeneous spatial-temporal hypergraph networks combined with bidirectional Mamba state space models to achieve state-of-the-art driver fatigue assessment from untrimmed videos while maintaining computational efficiency for real-time use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07606","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Bootstrapping Sign Language Annotations with Sign Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-08T21:26:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A pseudo-annotation pipeline combines fingerspelling and isolated sign recognizers with K-Shot LLM estimation to produce ranked time-aligned gloss annotations from signed video and English input.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05591","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AI-Driven Modular Services for Accessible Multilingual Education in Immersive Extended Reality Settings: Integrating Speech Processing, Translation, and Sign Language Rendering","primary_cat":"cs.CE","submitted_at":"2026-04-07T08:35:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A modular XR platform integrates Whisper, NLLB, AWS Polly, RoBERTa, flan-t5, and MediaPipe to deliver real-time multilingual and International Sign support for education, with benchmarks showing AWS Polly's low latency and EuroLLM's higher BLEU score.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05475","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Synthetic Eye Movement Dataset for Script Reading Detection: Real Trajectory Replay on a 3D Simulator","primary_cat":"cs.CV","submitted_at":"2026-04-07T06:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A replay pipeline on a 3D eye simulator generates 144 sessions of synthetic eye movement video that preserves source temporal dynamics for script-reading detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04787","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AvatarPointillist: AutoRegressive 4D Gaussian Avatarization","primary_cat":"cs.CV","submitted_at":"2026-04-06T15:56:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AvatarPointillist autoregressively generates adaptive 3D point clouds via Transformer for photorealistic 4D Gaussian avatars from one image, jointly predicting animation bindings and using a conditioned Gaussian decoder.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04623","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On Optimizing Electrode Configuration for Wrist-Worn sEMG-Based Thumb Gesture Recognition","primary_cat":"cs.HC","submitted_at":"2026-04-06T12:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Extensor-side monopolar electrodes outperform flexor-side and bipolar setups for wrist sEMG thumb gesture recognition, with performance rising but leveling off as channel count increases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15336","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Facial-Expression-Aware Prompting for Empathetic LLM Tutoring","primary_cat":"cs.HC","submitted_at":"2026-03-10T08:16:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Facial expression signals via prompt integration improve empathetic responsiveness in LLM-based tutoring systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.06931","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Social Bias in Vision-Language Models with Face-Only Counterfactuals from Real Photos","primary_cat":"cs.CV","submitted_at":"2026-01-11T14:35:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLMs exhibit demographic biases in occupation and salary decisions even when only faces are altered in otherwise identical real photos.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02830","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks","primary_cat":"cs.CV","submitted_at":"2025-11-04T18:58:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DenseMarks learns a canonical 3D embedding space for human head images by training a Vision Transformer with contrastive loss on pairwise point tracks from in-the-wild videos, plus landmark and segmentation supervision.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05023","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating Idle Animation Believability: a User Perspective","primary_cat":"cs.HC","submitted_at":"2025-09-05T11:34:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Users cannot distinguish genuine from acted idle animations but perceive handmade and recorded ones differently; ReActIdle dataset released to simplify future recording.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.11149","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Comprehensive Survey of Action Quality Assessment: Method and Benchmark","primary_cat":"cs.CV","submitted_at":"2024-12-15T10:47:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"This survey proposes a modality-driven hierarchical taxonomy for AQA methods, establishes a unified benchmark for video-based approaches across datasets, and outlines research trends and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Quality Prediction Direct assessment: C3D-AVG-MTL [37], USDL/MUSDL [61],UD-AQA [62], DAE-AQA [63], LUSD-Net [64], CoFInAl [65] Contrastive assessment: 2S-CNN [40], CoRe [25], RGTSCT [38], PCLN [66], TPT [58], T2CR [67], MCoRe [68], TAQRM [41],Rhythmer [69] Skeleton-BasedApproach (Sec. 3.2) Skeletal Acquisition Methods RGB cameras: OpenPose [70], MediaPipe [71], ViTPose [72] Depth cameras: Microsoft Kinect v1/v2 [11], [73], [74], [75], [76] Marker-based MoCap systems: Qualisys [73], Vicon [77], etc. Skeletal RepresentationLearning Methods Basic methods: DCT+SVC [39], CNN+LSTM [78], ST -GCN [79] Robust mechanisms: EGCN [74], EGCN++ [75], AAST -GCN [80] ⋯ Multi-ModalityApproach (Sec. 3.3) Audio-Assisted MethodsMLP-Mixer [81], Dance-AQA [82], PAMFN [29]"},{"citing_arxiv_id":"2410.06158","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation","primary_cat":"cs.RO","submitted_at":"2024-10-08T16:00:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"data compared to previous works that utilize video pre-training. The pre-training data includes commonly used public datasets of human activities, e.g., Howto100M [8], Ego4D [9], Something-Something V2 [10], EPIC-KITCHENS [11], and Kinetics-700 [12]. To tailor the pre-training data for robot manipulation tasks, we carefully establish a data processing pipeline that includes hand filtering [13] and re-captioning [14]. In addition, we include publicly available robot datasets, e.g., RT-1 [15] and Bridge [16]. In total, the number of video clips used for pre-training is 38 million, equivalent to approximately 50 billion tokens. The distribution of human activities and video samples are illustrated in Fig. 2. GR-2 can be seamlessly fine-tuned on robot data after large-scale pre-training."},{"citing_arxiv_id":"2408.05366","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The DeepSpeak Dataset","primary_cat":"cs.CV","submitted_at":"2024-08-09T22:29:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSpeak provides over 100 hours of consented, identity-matched real and modern deepfake audiovisual content focused on talking heads, with evaluations showing existing detectors fail to generalize without retraining.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}