Neural swipe decoder trained with geometric augmentations on 1M+ swipes generalizes to unseen keyboard layouts by predicting per-point character locations and mapping via inference-time layout.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (57%).
citation-role summary
citation-polarity summary
representative citing papers
PolySpeech-100 is a new benchmark for native-level speech comprehension across 110 linguistic variants that evaluates 22 models and reports E2E advantages on dialects, robustness gaps on low-resource languages, and degradation from Chain-of-Thought prompting.
PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Syntactic belief update via generalized Rényi divergence on syntactic trees predicts garden path reading times better than lexical surprisal.
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
MS-DKC is a dataset knowledge card framework that maps image, morphology, supervision, context, and risk descriptors to design priors and failure modes, shown to produce dataset-specific model adaptations with improved metrics on DRIVE, ISIC2018, and ACDC.
SeismoGPT is a transformer autoregressive model achieving median normalized cross-correlation above 0.93 when forecasting synthetic three-component seismograms up to 240 s ahead from P- and S-wave context.
A parameter-efficient dual-encoder model with differentiable Choquet integral fusion improves underwater acoustic classification accuracy over single-encoder baselines on DeepShip and ShipsEar datasets.
Proposes adaptive multiple importance sampling for robust Bayesian model evidence estimation under parameter non-identifiability, shown to outperform deterministic methods on ecological case studies while being cheaper than MCMC.
FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised baselines in internal and external validation.
The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage across aspects.
Optimized 3x3 adversarial image filters based on edge detection generate transferable untargeted attacks on neural networks with 30-80% success using only one pass and far fewer parameters than prior methods.
CAHAL introduces a physics-informed mixture-of-experts super-resolution network for clinical MRI that conditions on resolution and anisotropy and uses edge-penalised, Fourier, and segmentation-guided losses to reduce hallucinations compared with prior generative methods.
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
Listeners detect automatic anonymization in pathological speech at 91-93% accuracy with a 30-point perceived quality drop, yet clinical severity ratings stay nearly unchanged for dysarthria, dysglossia, and dysphonia.
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
AlignGAD is a zero-shot generalized graph anomaly detection framework using a Global Unification Module, Clustering Module, and Node Discrepancy Scoring Module.
Self-supervised speech models encode multiple speaker group attributes, with ASR finetuning discarding phonetically variant information while retaining semantically variant information.
QIF neurons outperform LIF neurons in spike-based gradient descent training of spiking neural networks by avoiding discontinuities that fragment the loss landscape.
Hybrid CoMeTS-GAN plus diffusion model generates multivariate financial time series claimed to better reproduce stylized facts and inter-asset correlations than prior generative methods.
citing papers explorer
-
The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care
Transformer-derived sentiment features from therapy sessions correlate with emotional-valence components of the OQ-45 and differ significantly between patients identified as at risk of deterioration by rational and empirical outcome models.