{"total":31,"items":[{"citing_arxiv_id":"2606.08674","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension","primary_cat":"cs.CV","submitted_at":"2026-06-07T15:23:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BioVid is a data-driven autoregressive model using 2D-encode/3D-decode tokenization and causal Transformer with EOS termination that reproduces real action duration distributions (W1 distance 1.24 frames) on NTU RGB+D drinking clips, outperforming fixed-length baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30341","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GPIC: A Giant Permissive Image Corpus for Visual Generation","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPIC is a new 28-trillion-pixel permissively licensed image corpus with 100M training examples for visual generative modeling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28940","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Neural Scaling Laws for Jet Generation","primary_cat":"hep-ph","submitted_at":"2026-05-27T18:00:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Scaling laws hold logarithmically for model size in autoregressive jet generation, with next-token loss correlating to physical metrics via sliced Wasserstein distance, but show weaker scaling for dataset size and compute due to rapid saturation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28101","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction","primary_cat":"cs.SD","submitted_at":"2026-05-27T07:54:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EigeNet applies a cross-view alternate-attention transformer with geometry modulation for few-shot novel-view RIR prediction, reporting SOTA results on simulated and real data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21977","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:11:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VINA trains a single detector on images plus video frames using a cross-modal supervised contrastive objective, yielding bidirectional gains and SOTA results on 14 image, video, and in-the-wild benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16899","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map","primary_cat":"cs.CV","submitted_at":"2026-05-16T09:21:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LASAR pairs a dual-memory system with spatio-temporal contrastive learning to induce latent cognitive maps, reporting 2-3.5% zero-shot gains on VLN-CE and VSI-Bench plus high map self-consistency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16165","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models","primary_cat":"cs.CV","submitted_at":"2026-05-15T16:45:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces ML-FOP-SOAP optimizer using Fisher-Orthogonal Projection and hierarchical folding to mitigate modality competition in multimodal autoregressive training, reporting gains over AdamW on Janus and Emu3.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13789","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ENSEMBITS: an alphabet of protein conformational ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-13T17:08:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Ensembits is the first tokenizer of protein conformational ensembles that outperforms static tokenizers on RMSF prediction and matches them on function and mutation tasks while using less pretraining data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10046","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PixelFlowCast: Latent-Free Precipitation Nowcasting via Pixel Mean Flows","primary_cat":"cs.CV","submitted_at":"2026-05-11T06:16:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PixelFlowCast delivers high-fidelity precipitation nowcasts from radar sequences using a latent-free Pixel Mean Flows predictor guided by a deterministic coarse stage and KANCondNet features.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Raissi, P. Perdikaris, and G. Karniadakis, \"Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations,\"Journal of Computational Physics, vol. 378, pp. 686-707, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0021999118307125 [39] V . L. Guen and N. Thome, \"Disentangling physical dynamics from unknown factors for unsupervised video prediction,\" inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. [40] L. Geng, J. Min, H. Geng, and X. Zhuang, \"Three-dimensional radar echo extrapolation using a physics-constrained deep learning"},{"citing_arxiv_id":"2605.09886","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Network-Efficient World Model Token Streaming","primary_cat":"cs.RO","submitted_at":"2026-05-11T02:19:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An adaptive delta-prioritization algorithm using cosine distance and Hamming-drift thresholds improves embedding distortion by 4.8-7.2% and next-token perplexity by 2.1-6.3% over periodic keyframing at matched low bitrates for tokenized driving world models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07230","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CASCADE: Context-Aware Relaxation for Speculative Image Decoding","primary_cat":"cs.CV","submitted_at":"2026-05-08T04:32:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[43] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning, 2018. URLhttps://arxiv.org/abs/1711.00937. 2 [44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URL https://arxiv.org/abs/1706. 03762. 1 [45] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023. 3 [46] Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, and Xihui Liu."},{"citing_arxiv_id":"2604.21035","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Masked-Token Prediction for Anomaly Detection at the Large Hadron Collider","primary_cat":"hep-ph","submitted_at":"2026-04-22T19:29:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The work demonstrates masked-token prediction with transformers for model-independent anomaly detection in LHC data, achieving strong results on top-rich BSM signatures like four-top production using VQ-VAE tokenization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16232","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Neuro-Symbolic ODE Discovery with Latent Grammar Flow","primary_cat":"cs.LG","submitted_at":"2026-04-17T16:46:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Latent Grammar Flow discovers ODEs by placing grammar-based equation representations in a discrete latent space, using a behavioral loss to cluster similar equations, and sampling via a discrete flow model guided by data fit and constraints.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"For example, the solutions of the ODEsu(t) + 3t, u ′(t) = sin(2t)andu(t) + 3t 2u′(t) = sin(2t)ex- hibit markedly different dynamics, despite having similar symbolic structure and corresponding rule sequences. This highlights that structural similarity does not necessarily imply behavioural similarity. To formalise this notion of behavioural similarity, we build on the work of Mežnar et al. [32], who define a distance between mathematical expressions based on distributions over their behaviour under varying constants, rather than relying on purely structural measures such as the graph edit distance. We adopt a similar perspective and tailor it to the setting of ODEs. To ensure computational tractability, we restrict the evaluation of behaviour to the intrinsic"},{"citing_arxiv_id":"2604.10471","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SID-Coord: Coordinating Semantic IDs for ID-based Ranking in Short-Video Search","primary_cat":"cs.IR","submitted_at":"2026-04-12T05:51:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SID-Coord coordinates semantic IDs with hashed item IDs via attention fusion, adaptive gating, and interest alignment, yielding +0.664% long-play rate and +0.369% playback duration gains in production search ranking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04974","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data","primary_cat":"cs.RO","submitted_at":"2026-04-04T15:37:11+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.28816","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering","primary_cat":"cs.DL","submitted_at":"2026-03-28T17:09:17+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.09691","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PaCo-FR: Patch-Pixel Aligned End-to-End Codebook Learning for Facial Representation Pre-training","primary_cat":"cs.CV","submitted_at":"2025-08-13T10:37:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PaCo-FR introduces a structured-masking and patch-codebook framework for unsupervised facial representation pre-training that claims state-of-the-art results on multiple facial tasks after training on only 2 million unlabeled images.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.09747","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FAST: Efficient Action Tokenization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2025-01-16T18:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FAST applies discrete cosine transform to robot action sequences for efficient tokenization, enabling autoregressive VLAs to succeed on high-frequency dexterous tasks and scale to 10k hours of data while matching diffusion VLA performance with up to 5x faster training.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"it scale to training dexterous generalist policies? To test this, we train the π0-FAST model from the previous section on the cross-embodied robot data mixture used by π0 [7], the largest dexterous robot manipulation dataset to date. It includes 903M timesteps from our own datasets. Additionally, 9.1% of the training mixture consists of the open-source datasets BRIDGE v2 [60], DROID [38], and OXE [52]. We compare zero-shot performance to the diffusion π0 model on the tasks from Black et al. [7] in Figure 11. Overall, we find that the autoregressive π0-FAST model matches the performance of the diffusion π0 model, including on the most challenging laundry folding task, while requiring signifi- cantly less compute for training ."},{"citing_arxiv_id":"2310.16828","ref_index":68,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TD-MPC2: Scalable, Robust World Models for Continuous Control","primary_cat":"cs.LG","submitted_at":"2023-10-25T17:57:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.02463","ref_index":65,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Shap-E: Generating Conditional 3D Implicit Functions","primary_cat":"cs.CV","submitted_at":"2023-05-03T23:59:13+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Shap-E encodes 3D assets into implicit function parameters then uses a conditional diffusion model to generate new ones from text, enabling fast multi-representation 3D asset creation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.15657","ref_index":207,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Conditional Generative Modeling all you need for Decision-Making?","primary_cat":"cs.LG","submitted_at":"2022-11-28T18:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Return-conditional diffusion models for policies outperform offline RL on benchmarks by circumventing dynamic programming and enable constraint or skill composition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.13221","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Video Diffusion Models for High-Fidelity Long Video Generation","primary_cat":"cs.CV","submitted_at":"2022-11-23T18:58:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.05221","ref_index":114,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.06125","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Text-Conditional Image Generation with CLIP Latents","primary_cat":"cs.CV","submitted_at":"2022-04-13T01:10:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Figure 3: Variations of an input image by encoding with CLIP and then decoding with a diffusion model. The variations preserve both semantic information like presence of a clock in the painting and the overlapping strokes in the logo, as well as stylistic elements like the surrealism in the painting and the color gradients in the logo, while varying the non-essential details. predict the resulting sequence using a Transformer [53] model with a causal attention mask. This results in a threefold reduction in the number of tokens predicted during inference, and improves training stability. We condition the AR prior on the text caption and the CLIP text embedding by encoding them as a preﬁx to the sequence. Additionally, we prepend a token indicating the (quantized) dot product between the text"},{"citing_arxiv_id":"2112.10741","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models","primary_cat":"cs.CV","submitted_at":"2021-12-20T18:42:55+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2110.04627","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Vector-quantized Image Modeling with Improved VQGAN","primary_cat":"cs.CV","submitted_at":"2021-10-09T18:36:00+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2105.05233","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diffusion Models Beat GANs on Image Synthesis","primary_cat":"cs.LG","submitted_at":"2021-05-11T17:50:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Diffusion models with architecture improvements and classifier guidance achieve superior FID scores to GANs on unconditional and conditional ImageNet image synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[51] proposed to use classiﬁer rejection sampling to ﬁlter out bad samples from an autoregressive likelihood-based model, and found that this technique improved FID. Most likelihood-based models also allow for low-temperature sampling [1], which provides a natural way to emphasize modes of the data distribution (see Appendix G). Other likelihood-based models have been shown to produce high-ﬁdelity image samples. VQ-V AE [65] and VQ-V AE-2 [51] are autoregressive models trained on top of quantized latent codes, greatly reducing the computational resources required to train these models on large images. These models produce diverse and high quality images, but still fall short of GANs without expensive rejection sampling and special metrics to compensate for blurriness."},{"citing_arxiv_id":"2102.01293","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2010.14701","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Laws for Autoregressive Generative Modeling","primary_cat":"cs.LG","submitted_at":"2020-10-28T02:17:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.06286","ref_index":196,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Autoencoding sensory substitution","primary_cat":"q-bio.NC","submitted_at":"2019-07-14T21:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Deep recurrent autoencoders convert images to shortened audio signals that incorporate hearing models, enabling above-chance hand posture discrimination and object reaching after a few hours of training instead of months.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Atari and Nokia games like Asteroids, Night Driver, Snake and Space Impact require rapid control from the user, and manifest simple enough, low-variance visual features, which an implicit conversion logic can exploit, resulting in lower substitution delays. By taking representative screenshots of these games, we can generate datasets that a AEV2A model can train on. As semantic segmentation [196] simpliﬁes the visual scene for self-driving cars by extracting actionable features, such segmentation can be fed as an input to an implicit V2A conversion method to represent the lower complexity imagery in soundscapes for the blind. In certain cases, in which the environment and the task to accomplish within are simple enough, one could manually design the V2A conversion; in other circumstances"}],"limit":50,"offset":0}