Rationale
This page contains listening examples of noisy and enhanced speech processed in the fully blind setup using algorithms proposed and/or evaluated in [1].
The supplementary MUSHRA test repository contains the GUI, the full set of audio files used for the experiment, the data collected from 14 listeners, and the MATLAB script for statistical analysis of MUSHRA data used in [1] to generate Figure 9, Tables III, and IV.
Setup, Algorithms and Evaluation Metrics
Below we present examples of speech enhancement under the following noise conditions:
- Female speech in modulated pink noise (SNR = 5 dB)
- Male speech in babble noise (SNR = 3 dB)
- Male speech in factory noise (SNR = 10 dB)
For each speech-in-noise record, we consider two enhancement scenarios:
- Phase-only enhancement;
- Combined magnitude & phase enhancement.
For phase enhancement, we consider the bi-phase smoothing schemes to estimate the bi-phase at harmonics followed by Fourier phase recovery schemes to compute enhanced harmonic phases from estimated bi-phase.
- Bi-phase smoothing schemes
- Fourier phase recovery schemes
For magnitude enhancement, we consider the conventional MMSE-LSA algorithm [4].
We report speech enhancement performance using the following objective evaluation metrics:
- Perceptual evaluation of speech quality (PESQ) [5];
- Short-time objective intelligibility measure (STOI) [6];
- Phase deviation (PDev) [7];
- Segmented noise attenuation (NAseg) [8].
Female Speech in Modulated Pink Noise (SNR = 5 dB)
Clean | Noisy |
---|---|
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NAseg = ∞ dB | PESQ = 1.55, STOI = 0.77, PDev = 0.64, NAseg = 0.00 dB |
Phase-Only Enhancement
Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 1.80, STOI = 0.78, PDev = 0.53, NAseg = 1.58 dB | PESQ = 1.80, STOI = 0.78, PDev = 0.59, NAseg = 1.66 dB |
Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 1.88, STOI = 0.81, PDev = 0.51, NAseg = 1.55 dB | PESQ = 1.82, STOI = 0.81, PDev = 0.53, NAseg = 1.58 dB |
Combined Magnitude & Phase Enhancement
MMSE-LSA + Unprocessed Phase (lower bound) | MMSE-LSA + Clean Phase (upper bound) |
---|---|
PESQ = 2.38, STOI = 0.81, NAseg = 10.03 dB | PESQ = 2.62, STOI = 0.85, NAseg = 11.00 dB |
MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 2.61, STOI = 0.81, PDev = 0.41, NAseg = 10.98 dB | PESQ = 2.60, STOI = 0.81, PDev = 0.49, NAseg = 11.11 dB |
MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 2.66, STOI = 0.82, PDev = 0.40, NAseg = 10.97 dB | PESQ = 2.65, STOI = 0.82, PDev = 0.43, NAseg = 11.07 dB |
Male Speech in Babble Noise (SNR = 3 dB)
Clean | Noisy |
---|---|
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NAseg = ∞ dB | PESQ = 1.66, STOI = 0.81, PDev = 0.62, NAseg = 0.00 dB |
Phase-Only Enhancement
Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 1.81, STOI = 0.77, PDev = 0.67, NAseg = 2.80 dB | PESQ = 1.74, STOI = 0.75, PDev = 0.79, NAseg = 3.11 dB |
Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 1.88, STOI = 0.81, PDev = 0.57, NAseg = 2.57 dB | PESQ = 1.75, STOI = 0.80, PDev = 0.58, NAseg = 2.73 dB |
Combined Magnitude & Phase Enhancement
MMSE-LSA + Unprocessed Phase (lower bound) | MMSE-LSA + Clean Phase (upper bound) |
---|---|
PESQ = 1.90, STOI = 0.82, NAseg = 9.15 dB | PESQ = 2.29, STOI = 0.87, NAseg = 10.65 dB |
MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 2.01, STOI = 0.80, PDev = 0.65, NAseg = 11.08 dB | PESQ = 1.92, STOI = 0.78, PDev = 0.64, NAseg = 12.01 dB |
MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 1.97, STOI = 0.82, PDev = 0.55, NAseg = 10.78 dB | PESQ = 1.93, STOI = 0.80, PDev = 0.57, NAseg = 11.14 dB |
Male Speech in Factory Noise (SNR = 10 dB)
Clean | Noisy |
---|---|
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NAseg = ∞ dB | PESQ = 2.17, STOI = 0.88, PDev = 0.36, NAseg = 0.00 dB |
Phase-Only Enhancement
Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 2.28, STOI = 0.88, PDev = 0.35, NAseg = 1.49 dB | PESQ = 2.26, STOI = 0.87, PDev = 0.41, NAseg = 1.71 dB |
Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 2.28, STOI = 0.86, PDev = 0.32, NAseg = 1.31 dB | PESQ = 2.23, STOI = 0.86, PDev = 0.35, NAseg = 1.43 dB |
Combined Magnitude & Phase Enhancement
MMSE-LSA + Unprocessed Phase (lower bound) | MMSE-LSA + Clean Phase (upper bound) |
---|---|
PESQ = 2.55, STOI = 0.88, NAseg = 9.19 dB | PESQ = 2.80, STOI = 0.91, NAseg = 9.89 dB |
MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee | MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer |
---|---|
PESQ = 2.59, STOI = 0.88, PDev = 0.33, NAseg = 9.86 dB | PESQ = 2.57, STOI = 0.88, PDev = 0.41, NAseg = 10.25 dB |
MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed) | MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed) |
---|---|
PESQ = 2.65, STOI = 0.89, PDev = 0.31, NAseg = 9.67 dB | PESQ = 2.61, STOI = 0.89, PDev = 0.34, NAseg = 9.86 dB |
Bibliography
[1] S.Y. Barysenka and V.I. Vorobiov, “SNR-based inter-component phase estimation using bi-phase prior statistics for single-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 31, pp. 2365-2381, 2023.
[2] S.Y. Barysenka, V.I. Vorobiov, and P. Mowlaee, “Single-channel speech enhancement using inter-component phase relations,” Speech Communication, vol. 99, pp. 144–160, 2018.
[3] H. Bartelt, A.W. Lohmann, and B. Wirnitzer, “Phase and amplitude recovery from bispectra,” Applied Optics, vol. 23, no. 18, pp. 3121–3129, Sep 1984.
[4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error log-spectral amplitude estimator,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 443–445, Apr. 1985.
[5] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)–a new method for speech quality assessment of telephone networks and codecs,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 749–752, Aug. 2001.
[6] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, Sept 2011.
[7] A. Gaich and P. Mowlaee, “On speech quality estimation of phase-aware single-channel speech enhancement,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 216–220.
[8] T. Fingscheidt, S. Suhadi, and S. Stan, “Environment-optimized speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 825–834, 2008.