SNR-Based Inter-Component Phase Estimation Using Bi-Phase Prior Statistics for Single-Channel Speech Enhancement

Siarhei Y. Barysenka and Vasili I. Vorobiov

Rationale

This page contains listening examples of noisy and enhanced speech processed in the fully blind setup using algorithms proposed and/or evaluated in [1].

The supplementary MUSHRA test repository contains the GUI, the full set of audio files used for the experiment, the data collected from 14 listeners, and the MATLAB script for statistical analysis of MUSHRA data used in [1] to generate Figure 9, Tables III, and IV.

Setup, Algorithms and Evaluation Metrics

Below we present examples of speech enhancement under the following noise conditions:

Female speech in modulated pink noise (SNR = 5 dB)
Male speech in babble noise (SNR = 3 dB)
Male speech in factory noise (SNR = 10 dB)

For each speech-in-noise record, we consider two enhancement scenarios:

Phase-only enhancement;
Combined magnitude & phase enhancement.

For phase enhancement, we consider the bi-phase smoothing schemes to estimate the bi-phase at harmonics followed by Fourier phase recovery schemes to compute enhanced harmonic phases from estimated bi-phase.

Bi-phase smoothing schemes
- Smooth Everywhere [2] (benchmark): bi-phase trajectories are smoothed along their whole durations;
- Binary Hypothesis [1] (proposed): bi-phase trajectories are smoothed only at regions determined from bi-phase statistics using binary SNR-dependent detector.
Fourier phase recovery schemes
- Barysenka-Vorobiov-Mowlaee [2]: leverages only limited set of three-component bi-phase vectors, namely with H₁ = 1;
- Bartelt-Lohmann-Wirnitzer [3]: leverages all bi-phase vectors, both three-component and two-component ones.

For magnitude enhancement, we consider the conventional MMSE-LSA algorithm [4].

We report speech enhancement performance using the following objective evaluation metrics:

Perceptual evaluation of speech quality (PESQ) [5];
Short-time objective intelligibility measure (STOI) [6];
Phase deviation (PDev) [7];
Segmented noise attenuation (NAseg) [8].

Female Speech in Modulated Pink Noise (SNR = 5 dB)

Clean	Noisy
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NA_seg = ∞ dB	PESQ = 1.55, STOI = 0.77, PDev = 0.64, NA_seg = 0.00 dB

Phase-Only Enhancement

Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 1.80, STOI = 0.78, PDev = 0.53, NA_seg = 1.58 dB	PESQ = 1.80, STOI = 0.78, PDev = 0.59, NA_seg = 1.66 dB

Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 1.88, STOI = 0.81, PDev = 0.51, NA_seg = 1.55 dB	PESQ = 1.82, STOI = 0.81, PDev = 0.53, NA_seg = 1.58 dB

Combined Magnitude & Phase Enhancement

MMSE-LSA + Unprocessed Phase (lower bound)	MMSE-LSA + Clean Phase (upper bound)
PESQ = 2.38, STOI = 0.81, NA_seg = 10.03 dB	PESQ = 2.62, STOI = 0.85, NA_seg = 11.00 dB

MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 2.61, STOI = 0.81, PDev = 0.41, NA_seg = 10.98 dB	PESQ = 2.60, STOI = 0.81, PDev = 0.49, NA_seg = 11.11 dB

MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 2.66, STOI = 0.82, PDev = 0.40, NA_seg = 10.97 dB	PESQ = 2.65, STOI = 0.82, PDev = 0.43, NA_seg = 11.07 dB

Male Speech in Babble Noise (SNR = 3 dB)

Clean	Noisy
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NA_seg = ∞ dB	PESQ = 1.66, STOI = 0.81, PDev = 0.62, NA_seg = 0.00 dB

Phase-Only Enhancement

Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 1.81, STOI = 0.77, PDev = 0.67, NA_seg = 2.80 dB	PESQ = 1.74, STOI = 0.75, PDev = 0.79, NA_seg = 3.11 dB

Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 1.88, STOI = 0.81, PDev = 0.57, NA_seg = 2.57 dB	PESQ = 1.75, STOI = 0.80, PDev = 0.58, NA_seg = 2.73 dB

Combined Magnitude & Phase Enhancement

MMSE-LSA + Unprocessed Phase (lower bound)	MMSE-LSA + Clean Phase (upper bound)
PESQ = 1.90, STOI = 0.82, NA_seg = 9.15 dB	PESQ = 2.29, STOI = 0.87, NA_seg = 10.65 dB

MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 2.01, STOI = 0.80, PDev = 0.65, NA_seg = 11.08 dB	PESQ = 1.92, STOI = 0.78, PDev = 0.64, NA_seg = 12.01 dB

MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 1.97, STOI = 0.82, PDev = 0.55, NA_seg = 10.78 dB	PESQ = 1.93, STOI = 0.80, PDev = 0.57, NA_seg = 11.14 dB

Male Speech in Factory Noise (SNR = 10 dB)

Clean	Noisy
PESQ = 4.50, STOI = 1.00, PDev = 0.00, NA_seg = ∞ dB	PESQ = 2.17, STOI = 0.88, PDev = 0.36, NA_seg = 0.00 dB

Phase-Only Enhancement

Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 2.28, STOI = 0.88, PDev = 0.35, NA_seg = 1.49 dB	PESQ = 2.26, STOI = 0.87, PDev = 0.41, NA_seg = 1.71 dB

Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 2.28, STOI = 0.86, PDev = 0.32, NA_seg = 1.31 dB	PESQ = 2.23, STOI = 0.86, PDev = 0.35, NA_seg = 1.43 dB

Combined Magnitude & Phase Enhancement

MMSE-LSA + Unprocessed Phase (lower bound)	MMSE-LSA + Clean Phase (upper bound)
PESQ = 2.55, STOI = 0.88, NA_seg = 9.19 dB	PESQ = 2.80, STOI = 0.91, NA_seg = 9.89 dB

MMSE-LSA + Smooth Everywhere + Barysenka-Vorobiov-Mowlaee	MMSE-LSA + Smooth Everywhere + Bartelt-Lohmann-Wirnitzer
PESQ = 2.59, STOI = 0.88, PDev = 0.33, NA_seg = 9.86 dB	PESQ = 2.57, STOI = 0.88, PDev = 0.41, NA_seg = 10.25 dB

MMSE-LSA + Binary Hypothesis + Barysenka-Vorobiov-Mowlaee (proposed)	MMSE-LSA + Binary Hypothesis + Bartelt-Lohmann-Wirnitzer (proposed)
PESQ = 2.65, STOI = 0.89, PDev = 0.31, NA_seg = 9.67 dB	PESQ = 2.61, STOI = 0.89, PDev = 0.34, NA_seg = 9.86 dB

Bibliography

[1] S.Y. Barysenka and V.I. Vorobiov, “SNR-based inter-component phase estimation using bi-phase prior statistics for single-channel speech enhancement,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 31, pp. 2365-2381, 2023.

[2] S.Y. Barysenka, V.I. Vorobiov, and P. Mowlaee, “Single-channel speech enhancement using inter-component phase relations,” Speech Communication, vol. 99, pp. 144–160, 2018.

[3] H. Bartelt, A.W. Lohmann, and B. Wirnitzer, “Phase and amplitude recovery from bispectra,” Applied Optics, vol. 23, no. 18, pp. 3121–3129, Sep 1984.

[4] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error log-spectral amplitude estimator,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 33, pp. 443–445, Apr. 1985.

[5] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)–a new method for speech quality assessment of telephone networks and codecs,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 749–752, Aug. 2001.

[6] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, Sept 2011.

[7] A. Gaich and P. Mowlaee, “On speech quality estimation of phase-aware single-channel speech enhancement,” Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. 216–220.

[8] T. Fingscheidt, S. Suhadi, and S. Stan, “Environment-optimized speech enhancement,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 4, pp. 825–834, 2008.