PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (57%).
citation-role summary
citation-polarity summary
representative citing papers
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Proposes adaptive multiple importance sampling for robust Bayesian model evidence estimation under parameter non-identifiability, shown to outperform deterministic methods on ecological case studies while being cheaper than MCMC.
FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised baselines in internal and external validation.
The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage across aspects.
Optimized 3x3 adversarial image filters based on edge detection generate transferable untargeted attacks on neural networks with 30-80% success using only one pass and far fewer parameters than prior methods.
CAHAL introduces a physics-informed mixture-of-experts super-resolution network for clinical MRI that conditions on resolution and anisotropy and uses edge-penalised, Fourier, and segmentation-guided losses to reduce hallucinations compared with prior generative methods.
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
Listeners detect automatic anonymization in pathological speech at 91-93% accuracy with a 30-point perceived quality drop, yet clinical severity ratings stay nearly unchanged for dysarthria, dysglossia, and dysphonia.
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
Transformer-derived sentiment features from therapy sessions correlate with emotional-valence components of the OQ-45 and differ significantly between patients identified as at risk of deterioration by rational and empirical outcome models.
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with large language models.
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
SGDA generates synthetic faults in the frequency domain from healthy signals to augment training data for ML-based induction motor diagnostics, claiming superior accuracy.
CCNETS is a new modular causal framework using three cooperative modules and a Zoint mechanism to align synthetic data generation with classifier needs on imbalanced pattern recognition tasks.
Tensor networks developed for quantum states are reviewed as tools for machine learning models, with assessment of their potential computational, explanatory, and privacy advantages alongside remaining challenges.
citing papers explorer
-
A strongly annotated passive acoustic dataset for tropical bird monitoring
PteroSet is a new strongly annotated dataset of 563 tropical bird recordings (73.62 h) containing 15,372 time-frequency labels for 168 species, released in COCO-style JSON with a binary bird detection baseline.
-
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
FLARE is a new benchmark with 399 long videos, 87k multimodal clips, and 275k user-style queries for testing audiovisual retrieval under caption and query regimes.
-
Geo2Sound: A Scalable Geo-Aligned Framework for Soundscape Generation from Satellite Imagery
Geo2Sound generates geographically realistic soundscapes from satellite imagery via geospatial attribute modeling, semantic hypothesis expansion, and geo-acoustic alignment, achieving SOTA FAD of 1.765 on a new 20k-pair benchmark.
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
Moshi: a speech-text foundation model for real-time dialogue
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
-
Reliable model selection in the presence of parameter non-identifiability
Proposes adaptive multiple importance sampling for robust Bayesian model evidence estimation under parameter non-identifiability, shown to outperform deterministic methods on ecological case studies while being cheaper than MCMC.
-
Annotation-free deep learning for detection and segmentation of fetal germinal matrix-intraventricular hemorrhage in brain MRI
FreeHemoSeg detects fetal GMH-IVH on T2-weighted MRI with high sensitivity and specificity and moderate segmentation accuracy using pseudo-image synthesis from normal scans, outperforming supervised and unsupervised baselines in internal and external validation.
-
Aspect-Aware Content-Based Recommendations for Mathematical Research Papers
The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage across aspects.
-
Almost for Free: Crafting Adversarial Examples with Convolutional Image Filters
Optimized 3x3 adversarial image filters based on edge detection generate transferable untargeted attacks on neural networks with 30-80% success using only one pass and far fewer parameters than prior methods.
-
CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans
CAHAL introduces a physics-informed mixture-of-experts super-resolution network for clinical MRI that conditions on resolution and anisotropy and uses edge-penalised, Fourier, and segmentation-guided losses to reduce hallucinations compared with prior generative methods.
-
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
-
Perceptual implications of automatic anonymization in pathological speech
Listeners detect automatic anonymization in pathological speech at 91-93% accuracy with a 30-point perceived quality drop, yet clinical severity ratings stay nearly unchanged for dysarthria, dysglossia, and dysphonia.
-
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
MoT decouples non-embedding parameters by modality in transformers to match dense multi-modal performance with roughly one-third to one-half the FLOPs.
-
The Association of Transformer-based Sentiment Analysis with Symptom Distress and Deterioration in Routine Psychotherapy Care
Transformer-derived sentiment features from therapy sessions correlate with emotional-valence components of the OQ-45 and differ significantly between patients identified as at risk of deterioration by rational and empirical outcome models.
-
Audio Spoof Detection with GaborNet
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
-
Foundation Models Defining A New Era In Sensor-based Human Activity Recognition: A Survey And Outlook
The survey organizes foundation models for sensor-based HAR into a lifecycle taxonomy and identifies three trajectories: HAR-specific models from scratch, adaptation of general time-series models, and integration with large language models.
-
Woosh: A Sound Effects Foundation Model
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
-
Learning to Hear Broken Motors: Signature-Guided Data Augmentation for Induction-Motor Diagnostics
SGDA generates synthetic faults in the frequency domain from healthy signals to augment training data for ML-based induction motor diagnostics, claiming superior accuracy.
-
CCNETS: A Modular Causal Learning Framework for Pattern Recognition in Imbalanced Datasets
CCNETS is a new modular causal framework using three cooperative modules and a Zoint mechanism to align synthetic data generation with classifier needs on imbalanced pattern recognition tasks.
-
Quantum-inspired tensor networks in machine learning models
Tensor networks developed for quantum states are reviewed as tools for machine learning models, with assessment of their potential computational, explanatory, and privacy advantages alongside remaining challenges.