Sequence Transduction with Recurrent Neural Networks

Alex Graves

arxiv: 1211.3711 · v1 · pith:YXDVAKSAnew · submitted 2012-11-14 · 💻 cs.NE · cs.LG· stat.ML

Sequence Transduction with Recurrent Neural Networks

Alex Graves This is my paper

classification 💻 cs.NE cs.LGstat.ML

keywords sequencetransductionoutputinputlearningsequencesrnnsalignment

0 comments

read the original abstract

Many machine learning tasks can be expressed as the transformation---or \emph{transduction}---of input sequences into output sequences: speech recognition, machine translation, protein secondary structure prediction and text-to-speech to name but a few. One of the key challenges in sequence transduction is learning to represent both the input and output sequences in a way that is invariant to sequential distortions such as shrinking, stretching and translating. Recurrent neural networks (RNNs) are a powerful sequence learning architecture that has proven capable of learning such representations. However RNNs traditionally require a pre-defined alignment between the input and output sequences to perform transduction. This is a severe limitation since \emph{finding} the alignment is the most difficult aspect of many sequence transduction problems. Indeed, even determining the length of the output sequence is often challenging. This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Experimental results for phoneme recognition are provided on the TIMIT speech corpus.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
EviTrack: Selection over Sampling for Delayed Disambiguation
cs.LG 2026-05 unverdicted novelty 6.0

EviTrack is a test-time inference framework that performs selection over latent trajectory hypotheses rather than marginal sampling to handle delayed disambiguation, outperforming baselines on a synthetic benchmark.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Voxtral Realtime
cs.AI 2026-02 unverdicted novelty 6.0

Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
cs.CV 2024-11 unverdicted novelty 6.0

LLaVA-CoT adds autonomous multistage reasoning to vision-language models, delivering 9.4% gains over its base model and outperforming larger models like Gemini-1.5-pro on reasoning benchmarks via a 100k annotated data...
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
cs.CL 2024-10 unverdicted novelty 6.0

ToxPrune prunes toxic subwords from BPE tokenizers in LLMs to mitigate toxic dialogue responses and improve diversity on both toxic and non-toxic models.
StepAudio 2.5 Technical Report
eess.AS 2026-05 unverdicted novelty 5.0

StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
Contextual Biasing for Streaming ASR via CTC-based Word Spotting
eess.AS 2026-05 unverdicted novelty 5.0

A streaming CTC-WS method with stateful token passing and incremental commitment for low-latency contextual biasing that reduces WER and improves keyword F-score in real-time ASR.
Regularized Entropy Information Adaptation with Temporal-Awareness Networks for Simultaneous Speech Translation
cs.LG 2026-04 unverdicted novelty 5.0

REINA-SAN and REINA-TAN add temporal context to information-based read/write policies, improving the quality-latency tradeoff in simultaneous speech translation by up to 7.1% on Normalized Streaming Efficiency.
Enhancing ASR Performance in the Medical Domain for Dravidian Languages
eess.AS 2026-04 unverdicted novelty 5.0

A hybrid confidence-aware ASR training framework with learnable weights reduces Telugu medical WER from 24.3% to 15.8% and Kannada from 31.7% to 25.4%, outperforming standard fine-tuning.
Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition
cs.CL 2026-04 unverdicted novelty 5.0

The authors introduce LLM-based semantic judgment and an agentic interaction loop that improves semantic fidelity and enables iterative corrections in automatic speech recognition beyond traditional WER.
Phoneme-Based Contextualization for Cross-Lingual Speech Recognition in End-to-End Models
cs.CL 2019-06 unverdicted novelty 5.0

An E2E ASR model with mixed wordpieces and phonemes improves foreign proper noun recognition via phoneme-level contextual biasing, showing 16% gain over grapheme-only and 8% over wordpiece-only baselines.
Contextual Biasing for Streaming ASR via CTC-based Word Spotting
eess.AS 2026-05 unverdicted novelty 4.0

Introduces a streaming CTC-WS method with stateful token passing and incremental commitment for low-latency contextual biasing in ASR, claiming reduced WER and improved keyword F-score.
MedASR: An Open-Source Model for High-Accuracy Medical Dictation
eess.AS 2026-05 unverdicted novelty 4.0

MedASR is an open-source 105M-parameter ASR model achieving 58% relative WER reduction versus Whisper Large-v3 on medical dictation.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
eess.AS 2026-04 unverdicted novelty 4.0

NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR
eess.AS 2019-07 unverdicted novelty 4.0

KLD-based speaker adaptation of seq2seq ASR achieves 25% relative WER reduction, outperforming the 18.7% gain from conventional acoustic model adaptation.