Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks

doi:10.1016/0167-6393(94)00051-B

Speech Communication

Volume 16, Issue 2, February 1995, Pages 139-151

https://doi.org/10.1016/0167-6393(94)00051-B Get rights and content

Abstract

This paper describes a speech spectrum transformation method by interpolating multi-speakers' spectral patterns and multi-functional representation with Radial Basis Function networks. The interpolation is carried out using spectral parameters between pre-stored multiple speakers' utterance data to generate new spectrum patterns. Adaptation to a target speaker can be performed by this interpolation, which uses only a small amount of training data to generate new speech spectrum sequences close to those of the target speaker. Moreover, to obtain more precise adaptation by using a larger amount of training data, the transformation is represented by multiple interpolating functions. The multiple functions' outputs are weighted-summed, using weighting values given by RBF networks. The parameters of this multi-functional transformation are adapted by the gradient descent method. Adaptation experiments were carried out using four pre-stored speakers' data. Using only one word spoken by the target speaker for training, the distance between the target speaker's spectrum and the spectrum generated by the single interpolating function was reduced by about 35% compared with the distance between the target speaker's spectrum and the spectrum of the pre-stored speaker closest to the target. Using ten training words, the reduction rate increased to 48% by the multi-functional transformation.

Zusammenfassung

Dieser Artikel beschreibt eine Transformationsmethode für Sprachspektren, in der die Spektrumsmuster mehrerer Sprecher und die multifunktionalen Darstellungen mit Radical Basis Function Netzwerken interpoliert werden. Die Interpolation wird unter Verwendung von Spektralparametern zwischen abgespeicherten Daten von Äuβerungen einer Vielzahl von Sprechern durchgeführt, um neue Spektrumsmuster zu erzeugen. Die Anpassung an einen Zielsprecher kann durch diese Interpolation erreicht werden, die nur einer kleinen Menge von Trainingsdaten bedarf, um neue Sprachspektrumsabschnitte zu erzeugen, die denen des Zielsprechers sehr ähnlich sind. Um eine genauere Anpassung durch die Verwendung von gröβeren Mengen von Trainingsdaten zu erzielen, wird auβerdem die Transformation durch eine Vielzahl von Interpolationsfunktionen dargestellt. Das Ergebnis dieser Funktionen wird gewichtgemittelt, indem die Gewichtswerte des RBF Netzes genutzt werden. Die Parameter dieser Multifunktionstransformation wird durch die Gradientmethode angepaβt. Experimente zur Anpassung wurden mit den abgespeicherten Daten von 4 Sprechern durchgeführt. Wenn nur ein Wort des Sprechers für das Training benutzt wurde, so konnte die Distanz zwischen dem Spektrum des Zielsprechers und dem durch eine einzige Interpolation erzeugten Spektrum auf 35% reduziert werden, verglichen mit der Distanz zwischen dem Spektrum des Zielsprechers und dem abgespeicherten, dem Zielsprecher ähnlichsten Spektrum. Bei Verwendung von 10 Trainingswörtern und einer Multifunktionstransformation erhöhte sich die Reduktionsrate auf 48%.

Résumé

Dans cette contribution, nous décrivons une méthode de transformation du locuteur par interpolation de formes de références à l'aide de réseaux de fonctions à symétries radiales (Radial Basis Function ou RBF). L'intérêt principal de cette méthode d'interpolation réside dans le fait qu'elle reste exploitable même lorsque le nombre de données disponibles pour apprendre la transformation est limité. La qualité des transformations peut être améliorée de façon incrémentale en utilisant plus de données d'apprentissage et en représentant la fonction de transformation par des fonctions d'interpolation multiples. Les fonctions d'interpolation de différents ordres sont pondérées et additionnées, en utilisant les coefficients de pondération donnés par le réseau RBF. Un certain nombre d'expériences ont été conduites dans une tâche prototype de transformation (4 locuteurs). Pour démontrer les performances de la méthode même lorsque l'ensemble d'apprentissage est très réduit, nous avons mené une première expérience où les données d'apprentissage se réduisent à un mot. Nous utilisons alors une seule fonction d'interpolation. Nous obtenons à l'aide de cette fonction d'interpolation une réduction de 35% de la distance entre le spectre du locuteur cible et le spectre du locuteur de référence le plus proche du locuteur du spectre du locuteur cible. Dans une deuxième expérience, l'ensemble d'apprentissage est composé de 10 mots; la transformation par interpolation de fonctions multiples permet d'atteindre 48% de réduction.

References (17)

D.G. Childers et al.
Voice conversion
Speech Communication
(1989)
P.J. Price
Male and female voice source characteristics: Inverse filtering results
Speech Communication
(1989)
H. Valbret et al.
Voice transformation using PSOLA technique
Speech Communication
(1992)
M. Abe
A segment-based approach to voice conversion
M. Abe et al.
Voice conversion through vector quantization
B. Atal
Efficient coding of LPC parameters by temporal decomposition
D.S. Broomhead et al.
Radial basis functions, multi-variable function interpolation and adaptive networks
Royal Signals and Radar Establishment Memorandum 4148
(1988)
K. Hakoda
Methods of converting voice characteristics between male and female by modifying pole parameters

There are more references available in the full text version of this article.

Cited by (49)

An overview of voice conversion systems
2017, Speech Communication
Citation Excerpt :
Another scheme for voice conversion is to utilize the conversions built on multiple pre-stored speakers (different from the target speaker) to create the mapping function. A first attempt called speaker interpolation generates the target features using a weighted linear addition (interpolation) of multiple conversions towards multiple other pre-defined target speakers, by minimizing the difference between the target features and the converted features (Iwahashi and Sagisaka, 1994; 1995). The interpolation coefficients are estimated using only one word from the target speaker.
Voice transformation (VT) aims to change one or more aspects of a speech signal while preserving linguistic information. A subset of VT, Voice conversion (VC) specifically aims to change a source speaker’s speech in such a way that the generated output is perceived as a sentence uttered by a target speaker. Despite many years of research, VC systems still exhibit deficiencies in accurately mimicking a target speaker spectrally and prosodically, and simultaneously maintaining high speech quality. In this work we provide an overview of real-world applications, extensively study existing systems proposed in the literature, and discuss remaining challenges.
A unit selection approach for voice transformation
2014, Speech Communication
A voice transformation (VT) method that can make the utterance of a source speaker mimic that of a target speaker is described. Speaker individuality transformation is achieved by altering four feature parameters, which include the linear prediction coefficients cepstrum (LPCC), $Δ$ LPCC, LP-residual and pitch period. The main objective of this study involves construction of an optimal sequence of features selected from a target speaker’s database, to maximize both the correlation probabilities between the transformed and the source features and the likelihood of the transformed features with respect to the target model. A set of two-pass conversion rules is proposed, where the feature parameters are first selected from a database then the optimal sequence of the feature parameters is then constructed in the second pass. The conversion rules were developed using a statistical approach that employed a maximum likelihood criterion. In constructing an optimal sequence of the features, a hidden Markov model (HMM) with global control variables (GCV) was employed to find the most likely combination of the features with respect to the target speaker’s model.
The effectiveness of the proposed transformation method was evaluated using objective tests and formal listening tests. We confirmed that the proposed method leads to perceptually more preferred results, compared with the conventional methods.
Comparing ANN and GMM in a voice conversion framework
2012, Applied Soft Computing Journal
Citation Excerpt :
These models are specific to the kind of features used for mapping. For instance, Gaussian mixture models (GMMs) [10,2,17,14], vector quantization (VQ) [8], fuzzy vector quantization (FVQ) [18], linear multivariate regression (LMR) [19], dynamic frequency warping (DFW) [17], radial basis function networks (RBFNs) [11], artificial neural network (ANNs) [20,12,13,16] are widely used for mapping the vocal tract characteristics. In GMM-based approach, the joint distribution of features extracted from the speech signals of the source and target speakers is modeled as a mixture of Gaussians.
In this paper, we present a comparative analysis of artificial neural networks (ANNs) and Gaussian mixture models (GMMs) for design of voice conversion system using line spectral frequencies (LSFs) as feature vectors. Both the ANN and GMM based models are explored to capture nonlinear mapping functions for modifying the vocal tract characteristics of a source speaker according to a desired target speaker. The LSFs are used to represent the vocal tract transfer function of a particular speaker. Mapping of the intonation patterns (pitch contour) is carried out using a codebook based model at segmental level. The energy profile of the signal is modified using a fixed scaling factor defined between the source and target speakers at the segmental level. Two different methods for residual modification such as residual copying and residual selection methods are used to generate the target residual signal. The performance of ANN and GMM based voice conversion (VC) system are conducted using subjective and objective measures. The results indicate that the proposed ANN-based model using LSFs feature set may be used as an alternative to state-of-the-art GMM-based models used to design a voice conversion system.
Statistical parametric speech synthesis
2009, Speech Communication
This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future.
Audio deepfakes: A survey
2023, Frontiers in Big Data
How deep are the fakes? Focusing on audio deepfake: A survey
2021, arXiv

View all citing articles on Scopus

View full text

Speech spectrum conversion based on speaker interpolation and multi-functional representation with weighting by radial basis function networks

Abstract

Zusammenfassung

Résumé

Speech Communication

Speech Communication

Speech Communication

A segment-based approach to voice conversion

Voice conversion through vector quantization

Efficient coding of LPC parameters by temporal decomposition

Radial basis functions, multi-variable function interpolation and adaptive networks

Royal Signals and Radar Establishment Memorandum 4148

Methods of converting voice characteristics between male and female by modifying pole parameters