Fully Convolution neural network on de-noising
Teammate: Lauren Liao
Work delivered: Implement deep learning model, Audio decoder,UX research
Abstract
Cochlear implant (CI) electronically stimulates the nerve to help those with severe hearing lost. Under noisy backgrounds, however, speech perception tasks have remained difficult for CI users. Therefore, speech enhancement (SE) is a critical component to improve speech perception examining through different noise scenarios. In this study, we focus on the fully convolutional network (FCN) model to extract clear speech signals on non-stationary noises of conversations in the background.
With recent success in SE models using log power spectrum (LPS) based deep neural networks (DNN), we propose a further advanced FCN model that preserves local information to improve speech perception for CI users. FCN model employs only fully convolutional layers to model the waveform. Without the fully connected layers, fewer weights are needed in the FCN model than the previously proposed DNN model. Other SE models focused on denoising through the magnitude spectrogram and leaving the phase in its original noisy form, which results in low quality speech signals after extraction. However, FCN model applies a new waveform in and out method that results in better extraction than the traditional models.
To evaluate FCN model’s performance, we have conducted hearing test and analyzed the short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). We have set up the hearing test using mismatched signal to noise ratio (SNR) levels and noise types of the training and testing sets to simulate a realistic situation for CI users. We use the original utterances from noisy speech as baseline within each subject to compare with speech enhanced by the DNN model and speech enhanced by the FCN models. Since we only have normal hearing subjects, we processed the speech signals through a voice operated recorder (vocoder) to mimic the sounds heard by CI patients. We processed utterances of sentences (three to four seconds) into the vocoder as vocoded speech for the hearing test. We conducted the hearing test through playing the vocoded speech to people with normal hearing and marking down the correct characters they have answered.
Results show that the proposed FCN model obtains a higher STOI and PESQ than the LPS based DNN model. From the hearing test, we have identified the speeches from the DNN model and the FCN model both have effective noise reduction with fifty percent more correct characters compared to the speeches without enhancement. Moreover, sentences with two talkers and four talkers as the background noise mixed at -3 dB and -6 dB SNRs enhanced by the FCN model has approximately the same performance as the DNN model. Although the hearing test results are similar between the two models, the number of weights, or parameters, needed for the FCN model is approximately a tenth less than the DNN model. Therefore, this indicates the FCN model outperforms the DNN model in SE for CI users. Furthermore, because of limited storage space on the CI, the decrease in parameters makes the FCN model a better choice than the DNN model.
Work delivered: Implement deep learning model, Audio decoder,UX research
Abstract
Cochlear implant (CI) electronically stimulates the nerve to help those with severe hearing lost. Under noisy backgrounds, however, speech perception tasks have remained difficult for CI users. Therefore, speech enhancement (SE) is a critical component to improve speech perception examining through different noise scenarios. In this study, we focus on the fully convolutional network (FCN) model to extract clear speech signals on non-stationary noises of conversations in the background.
With recent success in SE models using log power spectrum (LPS) based deep neural networks (DNN), we propose a further advanced FCN model that preserves local information to improve speech perception for CI users. FCN model employs only fully convolutional layers to model the waveform. Without the fully connected layers, fewer weights are needed in the FCN model than the previously proposed DNN model. Other SE models focused on denoising through the magnitude spectrogram and leaving the phase in its original noisy form, which results in low quality speech signals after extraction. However, FCN model applies a new waveform in and out method that results in better extraction than the traditional models.
To evaluate FCN model’s performance, we have conducted hearing test and analyzed the short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ). We have set up the hearing test using mismatched signal to noise ratio (SNR) levels and noise types of the training and testing sets to simulate a realistic situation for CI users. We use the original utterances from noisy speech as baseline within each subject to compare with speech enhanced by the DNN model and speech enhanced by the FCN models. Since we only have normal hearing subjects, we processed the speech signals through a voice operated recorder (vocoder) to mimic the sounds heard by CI patients. We processed utterances of sentences (three to four seconds) into the vocoder as vocoded speech for the hearing test. We conducted the hearing test through playing the vocoded speech to people with normal hearing and marking down the correct characters they have answered.
Results show that the proposed FCN model obtains a higher STOI and PESQ than the LPS based DNN model. From the hearing test, we have identified the speeches from the DNN model and the FCN model both have effective noise reduction with fifty percent more correct characters compared to the speeches without enhancement. Moreover, sentences with two talkers and four talkers as the background noise mixed at -3 dB and -6 dB SNRs enhanced by the FCN model has approximately the same performance as the DNN model. Although the hearing test results are similar between the two models, the number of weights, or parameters, needed for the FCN model is approximately a tenth less than the DNN model. Therefore, this indicates the FCN model outperforms the DNN model in SE for CI users. Furthermore, because of limited storage space on the CI, the decrease in parameters makes the FCN model a better choice than the DNN model.

ieee_urtc_2017_paper.pdf |
Report:
Paper Poster: