hub Mixed citations

Audioclip: Extending clip to image, text and audio

Ke Chen, Xingjian Du, Bilei Zhu, Zejun Ma, Taylor Berg-K˜irkpatrick, Shlomo Dubnov · 2022 · arXiv 3922.2022

Mixed citation behavior. Most common role is background (57%).

25 Pith papers citing it

Background 57% of classified citations

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 2 dataset 1

citation-polarity summary

background 4 extend 1 use dataset 1 use method 1

representative citing papers

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

cs.CL · 2023-05-02 · unverdicted · novelty 8.0

ciwGAN and fiwGAN models trained on isolated words spontaneously generate concatenated multi-word outputs and display early compositionality precursors.

Normative Networks for Source Separation via Local Plasticity and Dendritic Computation

cs.LG · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

Predictive Entropy Maximization performs competitive blind source separation using only local error-driven and Hebbian updates derived from a surrogate entropy objective with spectral error bounds.

Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.

Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

cs.CV · 2026-04-14 · conditional · novelty 7.0

Introduces the LDD task, ListenForge dataset built from five listening head generation methods, and MANet model that detects listening forgeries via motion inconsistencies guided by audio semantics.

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

stat.ML · 2026-01-18 · conditional · novelty 7.0

Momentum SGD incurs a provable drift-amplification penalty in nonstationary stochastic optimization that makes it worse than vanilla SGD in drift-dominated regimes, confirmed by finite-time upper bounds and minimax lower bounds under gradient-variation constraints.

TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.

Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SCALE disentangles emotion and cause representations in conversations and uses optimal transport for many-to-many global alignment, achieving SOTA on ECPEC benchmarks.

PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.

STAMP: Spatial-Temporal Adapter with Multi-Head Pooling

cs.LG · 2025-11-13 · unverdicted · novelty 6.0

STAMP adapter enables general time series foundation models to match specialized EEG foundation models on clinical classification tasks across 8 benchmarks while using few trainable parameters.

CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents

cs.SD · 2025-09-15 · unverdicted · novelty 6.0

CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

eess.AS · 2023-11-14 · unverdicted · novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.

FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

cs.LG · 2026-05-19 · unverdicted · novelty 5.0

FusionSense uses server-side fusion learning, filter-out-safe labels, and edge compaction to enable runtime-adaptive multimodal sensing that cuts energy up to 33x while preserving task quality on RGB+Depth data.

WorldSpeech: A Multilingual Speech Corpus from Around the World

cs.CL · 2026-05-09 · unverdicted · novelty 5.0 · 2 refs

WorldSpeech supplies 65k hours of multilingual aligned speech data across 76 languages and delivers 63.5% average relative WER reduction after fine-tuning ASR models on 11 typologically diverse languages.

Audio Spoof Detection with GaborNet

cs.SD · 2026-04-21 · unverdicted · novelty 5.0

GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.

R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection

cs.CV · 2026-04-19 · unverdicted · novelty 5.0

R-FLoRA combines Laplacian residual statistics with a frozen vision transformer via gated low-rank adapters, residual fusion, and contrastive alignment to achieve better accuracy and generalization than prior single-image face morphing attack detectors.

Qwen3.5-Omni Technical Report

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.

Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

cs.CL · 2025-07-17 · unverdicted · novelty 5.0

Balalaika is a data-centric annotation pipeline for Russian speech that combines semantic VAD, ASR ensembling, and prosody enrichment to build a 5.1k-hour corpus showing gains in denoising and TTS.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

Qwen2-Audio Technical Report

eess.AS · 2024-07-15 · unverdicted · novelty 4.0

Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.

Secure Password Generator Based on Secure Pseudo-Random Number Generator

cs.CR · 2025-08-25 · unverdicted · novelty 2.0

A MAC-based PRNG for passwords is implemented and shown to meet NIST SP 800-90B entropy and IID criteria.

Quantum Adversarial Machine Learning: From Classical Adaptations to Quantum-Native Methods

cs.LG · 2026-05-12 · unverdicted · novelty 1.0

A survey of quantum adversarial machine learning covering attacks, countermeasures, theoretical underpinnings, trends, and challenges.

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

eess.AS · 2026-05-10 · 2 refs

citing papers explorer

Showing 25 of 25 citing papers.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval cs.AI · 2026-05-05 · unverdicted · none · ref 17
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks cs.CL · 2023-05-02 · unverdicted · none · ref 11
ciwGAN and fiwGAN models trained on isolated words spontaneously generate concatenated multi-word outputs and display early compositionality precursors.
Normative Networks for Source Separation via Local Plasticity and Dendritic Computation cs.LG · 2026-05-19 · unverdicted · none · ref 23 · 2 links
Predictive Entropy Maximization performs competitive blind source separation using only local error-driven and Hebbian updates derived from a surrogate entropy objective with spectral error bounds.
Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations cs.CL · 2026-04-20 · unverdicted · none · ref 38
KL regularization aligning model predictions with empirical transition patterns improves macro-F1 by 9-42% in next dialogue act prediction on German counselling data and transfers to other datasets.
Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis cs.CV · 2026-04-14 · conditional · none · ref 44
Introduces the LDD task, ListenForge dataset built from five listening head generation methods, and MANet model that detects listening forgeries via motion inconsistencies guided by audio semantics.
On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization stat.ML · 2026-01-18 · conditional · none · ref 2
Momentum SGD incurs a provable drift-amplification penalty in nonstationary stochastic optimization that makes it worse than vanilla SGD in drift-dominated regimes, confirmed by finite-time upper bounds and minimax lower bounds under gradient-variation constraints.
TB-AVA: Text as a Semantic Bridge for Audio-Visual Parameter Efficient Finetuning cs.CV · 2026-05-12 · unverdicted · none · ref 8 · 2 links
TB-AVA uses text-mediated gated semantic modulation to enable efficient audio-visual alignment, achieving state-of-the-art results on AVE, AVS, and AVVP benchmarks.
Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment cs.CL · 2026-04-21 · unverdicted · none · ref 9
SCALE disentangles emotion and cause representations in conversations and uses optimal transport for many-to-many global alignment, achieving SOTA on ECPEC benchmarks.
PRISM-CTG: A Foundation Model for Cardiotocography Analysis with Multi-View SSL cs.LG · 2026-04-09 · unverdicted · none · ref 34
PRISM-CTG is the first large-scale foundation model for cardiotocography that uses multi-view self-supervised learning on unlabeled data to learn transferable representations, outperforming baselines on seven downstream tasks with external validation.
STAMP: Spatial-Temporal Adapter with Multi-Head Pooling cs.LG · 2025-11-13 · unverdicted · none · ref 7
STAMP adapter enables general time series foundation models to match specialized EEG foundation models on clinical classification tasks across 8 benchmarks while using few trainable parameters.
CodecSep: Prompt-Driven Universal Sound Separation on Neural Audio Codec Latents cs.SD · 2025-09-15 · unverdicted · none · ref 29
CodecSep performs prompt-driven universal sound separation directly in neural audio codec latents by combining a frozen DAC backbone with a lightweight FiLM-conditioned Transformer masker driven by CLAP embeddings, yielding efficiency gains over AudioSep.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 26
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models eess.AS · 2023-11-14 · unverdicted · none · ref 16
Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence cs.LG · 2026-05-19 · unverdicted · none · ref 6
FusionSense uses server-side fusion learning, filter-out-safe labels, and edge compaction to enable runtime-adaptive multimodal sensing that cuts energy up to 33x while preserving task quality on RGB+Depth data.
WorldSpeech: A Multilingual Speech Corpus from Around the World cs.CL · 2026-05-09 · unverdicted · none · ref 26 · 2 links
WorldSpeech supplies 65k hours of multilingual aligned speech data across 76 languages and delivers 63.5% average relative WER reduction after fine-tuning ASR models on 11 typologically diverse languages.
Audio Spoof Detection with GaborNet cs.SD · 2026-04-21 · unverdicted · none · ref 9
GaborNet replaces sinc functions with Gabor filters in raw-audio neural networks and is tested for audio spoof detection with augmentations in RawNet2 and RawGAT-ST.
R-FLoRA: Residual-Statistic-Gated Low-Rank Adaptation for Single-Image Face Morphing Attack Detection cs.CV · 2026-04-19 · unverdicted · none · ref 28
R-FLoRA combines Laplacian residual statistics with a frozen vision transformer via gated low-rank adapters, residual fusion, and contrastive alignment to achieve better accuracy and generalization than prior single-image face morphing attack detectors.
Qwen3.5-Omni Technical Report cs.CL · 2026-04-17 · unverdicted · none · ref 47
Qwen3.5-Omni scales an omnimodal model to hundreds of billions of parameters with 256k context, introduces ARIA for stable speech synthesis, and reports SOTA performance on 215 audio-visual benchmarks while adding multilingual and audio-visual coding capabilities.
Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech cs.CL · 2025-07-17 · unverdicted · none · ref 27
Balalaika is a data-centric annotation pipeline for Russian speech that combines semantic VAD, ASR ensembling, and prosody enrichment to build a 5.1k-hour corpus showing gains in denoising and TTS.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 23
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Qwen2-Audio Technical Report eess.AS · 2024-07-15 · unverdicted · none · ref 14
Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.
Secure Password Generator Based on Secure Pseudo-Random Number Generator cs.CR · 2025-08-25 · unverdicted · none · ref 14
A MAC-based PRNG for passwords is implemented and shown to meet NIST SP 800-90B entropy and IID criteria.
Quantum Adversarial Machine Learning: From Classical Adaptations to Quantum-Native Methods cs.LG · 2026-05-12 · unverdicted · none · ref 147
A survey of quantum adversarial machine learning covering attacks, countermeasures, theoretical underpinnings, trends, and challenges.
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations eess.AS · 2026-05-10 · unreviewed · ref 5 · 2 links
DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories cs.CL · 2026-04-22 · unreviewed · ref 42

Audioclip: Extending clip to image, text and audio

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer