Visual-Aware Speech Recognition in Noisy Setting


Visual-Aware Speech Recognition Model Flowchart

Figure 1: Flowchart illustrating the integration of audio-visual data for enhanced speech recognition. This model employs video analysis and sound effect classification to inform the speech transcription process, effectively creating a noise-aware speech recognition system that aligns with the goal of audio-visual speech recognition.

Introduction

In many real-world scenarios, accurately transcribing speech in noisy environments poses a significant challenge for traditional Automatic Speech Recognition (ASR) systems. Current models often struggle to differentiate between background noise and spoken words, leading to a higher error rate in transcription. To address this issue, we propose an innovative approach that leverages visual cues to enhance speech recognition accuracy in noisy environments.

Our approach builds on the concept that humans use both auditory and visual signals—such as lip movements and environmental context—to improve their understanding of speech. By incorporating these visual cues into the transcription process, our audio-visual speech recognition (AVSR) model becomes more resilient to noise. Unlike traditional models that focus solely on lip motion, our model extracts broader environmental context to distinguish between speech and noise.

This project introduces a scalable data creation pipeline that mixes clean speech from the PeopleSpeech dataset with noise-labeled events from AudioSet. By correlating noise sources with visual information, the model is trained to not only transcribe speech but also predict noise labels. The result is a robust, multimodal system capable of accurate speech recognition even in challenging acoustic environments.

Dataset Creation Pipeline

Below are the 44 types of noises curated from the AudioSet dataset. Click on each noise to learn more about it.

Dataset Pipeline and Qualitative Samples

Below is an example of a qualitative sample from our dataset that demonstrates the integration of environmental noise (skateboard) with clean speech from PeopleSpeech. This example highlights how the audio changes with different Signal-to-Noise Ratio (SNR) levels.

Video Example: Skateboard Noise

Clean Speech Example: PeopleSpeech

Audio Changes with Various SNR Levels

SNR Level Audio Sample
+20 dB
+15 dB
+10 dB
+5 dB
0 dB
-5 dB
-10 dB
-15 dB
-20 dB

Model Design

Visual cues in our environment significantly aid our understanding of auditory information—a phenomenon well-studied in cognitive psychology. Inspired by how humans can focus on relevant sounds even in noisy settings, I aimed to replicate this selective attention in speech recognition models. Traditional models, while effective in controlled environments, often fail to address the complexity of real-world audio scenarios. By integrating visual data, our model aims to pinpoint relevant auditory signals with higher accuracy.

A crucial aspect of our project was the creation of a robust dataset that includes labeled noise events from AudioSet alongside clean speech from the PeopleSpeech dataset. This mix provides a diverse range of training scenarios, mimicking real-world noise conditions.

Conformer Based Model for AV Speech Recognition

Figure 4: Diagram illustrating our conformer-based model for audio-visual speech recognition. The architecture employs a pre-trained conformer for speech encoding, paired with CLIP-based visual feature extraction for multimodal processing. This design enhances transcription accuracy in noisy environments by correlating visual cues from the environment with noise sources in the audio stream.

Our model uses a conformer-based architecture designed to integrate both audio and visual modalities for improved speech recognition in noisy scenarios. The audio encoder, a pre-trained conformer, processes the speech signals, while the visual features are extracted using a pre-trained CLIP image encoder. These visual features provide contextual information about the noise sources in the environment, enabling the model to distinguish between speech and noise more effectively.

The model enhances the audio embeddings through a cross-modal multi-headed attention mechanism, aligning the visual and auditory inputs. This allows the model to make use of visual context, such as background scenes or objects, which helps in filtering out noise and improving transcription accuracy. The combined audio-visual representations are then processed by a transformer encoder, followed by a convolutional decoder, which jointly predicts the speech transcription and the noise label using a CTC loss function. This approach enables the model to outperform traditional audio-only systems in noisy conditions, even when visual input is unavailable at inference time.

Results

Model Performance at Various SNR Levels

Figure 5: Performance comparison of different speech recognition models at 10 dB SNR. The table shows Word Error Rate (WER) and noise label prediction accuracy, demonstrating that the audio-visual model (AV-UNI-SNR) significantly outperforms the audio-only model, particularly when visual information is provided at both training and inference stages.

Figure 6: Word Error Rate (WER) comparison across various Signal-to-Noise Ratio (SNR) levels, showcasing the robustness of the AV-UNI-SNR model in noisy environments. The AV-UNI-SNR model consistently maintains lower WER, even in extreme noise conditions, compared to the conformer-based audio-only model.

The results demonstrate that our visual-aware speech recognition model achieves significant improvements over traditional audio-only models, especially in noisy environments. Specifically, the AV-UNI-SNR model showed a substantial reduction in Word Error Rate (WER) and higher accuracy in noise label prediction when compared to the conformer-based baseline model.

The experiments highlight that integrating visual information allows the model to handle complex acoustic conditions more effectively. The AV-UNI-SNR model, trained across varied SNR levels, exhibited robustness across diverse noise scenarios, with the most significant gains observed when both visual and audio inputs were utilized during inference. Furthermore, the model's ability to maintain improved performance even without visual input at inference time suggests that training with visual cues enhances the model's overall understanding of acoustic environments.

These results confirm the hypothesis that leveraging visual cues significantly enhances the transcription accuracy and noise label prediction in noisy environments, marking a substantial advancement in the field of audio-visual speech recognition.

Conclusion and Future Work

The integration of visual cues with audio signals in speech recognition systems has proven to significantly enhance transcription accuracy, particularly in noisy environments. Our model, which leverages pre-trained audio and visual encoders with multi-headed attention, has shown marked improvements over traditional audio-only models. By correlating noise sources with visual cues, our approach enables more accurate transcription and noise label prediction.

Future work will involve exploring additional pre-trained speech and visual encoder checkpoints to further improve model performance. We also aim to expand the dataset by incorporating more complex and diverse noise labels from AudioSet. Additionally, extending this model to include other visual cues beyond noise labels, such as related visual events, could open new possibilities for more robust, scalable audio-visual speech recognition systems.

Acknowledgments

I would like to express my gratitude to the GSOC community, Red Hen Lab, and my mentor Karan Singla for their support throughout this project. Their insights and guidance were invaluable in shaping this innovative approach to speech recognition.

Further Progress

This work is currently in the process of being archived on arXiv. More details will be provided here once the paper is published.