DOMINIQUE GENOUD1), Martigny, Switzerland
GERARD CHOLLET2), Paris, France
Footnote:
1)
IDIAP, CP 592 CH-1920 Martigny, Switzerland2)
CNRS URA-820, ENST, 46 rue Barrault, 75634 PARIS cedex 13
VOICE TRANSFORMATIONS, SOME TOOLS FOR THE IMPOSTURE OF SPEAKER VERIFICATION SYSTEMS
ABSTRACT
The transformation of the voice of a speaker into the voice of another one is an important issue to understand what are the most discriminating voice parameters between two persons. The study described here is done in the context of a text dependent speaker verification application using digit utterances. A Harmonic + Noise model is used as speech representation. The aim of this paper is to explain how to modify the voice of a source speaker (the impostor) in order to mimic the voice of a target speaker (a client of the application). Cepstral coefficients extracted from the harmonic part of the speech are used in the transformation. Then, the original noise part of the speech or a small random noise is added to the signal. The results show an important increase of the false acceptance rate when transformed speakers are used as impostors through a HMM (Hidden Markov Model) state-of-the-art speaker verification system.
1. Introduction
Understanding the characteristics of imposture is a problem for large-scale industrial speaker verification systems. In speaker verification, a trade-off is necessary between the false rejection of a client and the false acceptance of an impostor. Indeed, the impostor modelling is used to build the speaker models, the world models and to set the decision thresholds. The experiments reported here were performed on a Swiss French speaker verification database (Polycode). The client passwords are sequences of connected digits. On a test trial, such a sequence is first recognised automatically and therefore segmented. The speaker verification is done at word level allowing for a better control of the pronounced utterance.
2. The imposture system
If we suppose that an impostor could record some samples of a registered customer, it could be possible for him to transform his voice in order to mimic the client voice. In this paper, we investigate possibilities of transforming word by word a digit sequence uttered by a source speaker (the impostor) into a sequence looking like an utterance of the target speaker (the client). This is only an exploratory system showing that it is possible to fool a state-of-the-art text dependent speaker verification application.
2.1. Harmonic plus Noise modelling
The speech is modelled by a Harmonic plus Noise model (H+N) which allows for good quality spectral modifications. This model is normally used for text-to-speech applications (cf. I. Stylianou, 1996). The H+N model decomposes the speech into a harmonic part and a noise part. From the harmonic analysis of the speech, cepstral coefficients (ci ) are extracted and from the noise part reflection coefficients (ki ) are estimated. These coefficients (ci and ki ) are then used to re-synthesise the speech.
2.1.1. Analysis of the harmonic part:
The voiced part of the speech signal can be modelled as a fundamental frequency and its harmonics. The fundamental frequency of the human speech (often called pitch) is varying with the prosody. In order to estimate the harmonic amplitudes of the fundamental frequency, a signal analysis is performed pitch synchronously on short temporal windows (typically 25[ms] overlapped each 10[ms]). On every window, the pitch f0 is supposed to be constant, and the harmonics to be the sum of complex exponential functions (equation 1).
(1)
(2)
Equation (1) is composed by Ak the complex Amplitude of each harmonic k , L the number of chosen harmonics and the pitch synchronous analysis instant. The harmonics are determined by minimisation of the quadratic error e between the original signal s(t) and the estimated harmonics (equation 2). The cepstral coefficients are extracted from the harmonic analysis part. Phase and amplitude envelope estimations are performed to allow for modifications of the pitch when re-synthesising the signal. Then real cepstrum coefficients can be estimated (cf. L. Rabiner, 1993).
2.1.2. Synthesis of the harmonic part
The signal can be re-synthesised (equation 3) by re-composition of the harmonic at each synthesis time instant using a sum of cosine functions. The amplitudes ak are directly extracted from the cepstral coefficients, and the phases are extracted by re-sampling of the spectral phase envelope at the synthesis instant
.
(3)
This synthesis method allows for pitch modifications between the analysis instants and synthesis instants.
2.1.3. Analysis of the noise part
All unvoiced parts of the speech can be viewed as a noise source passed through filters.In this approach, the spectral density function of the noise is estimated by a 16-order all-pole filter using the autocorrelation method (cf. L. Rabiner, 1993). The reflection coefficients ki are then estimated on a 40[ms] window around the analysis instant .As an estimation of the maximum voicing frequency is performed (cf. I. Stylianou, 1996) the noise part can also be extracted in the voiced part of speech.
2.1.4. Synthesis of the noise part
The noise part is re-synthesised using a Gaussian noise source and a normalised lattice filter using the ki (cf. J. Markel,1976) coefficients extracted at analysis time.
2.2. Speaker transformations
Given the coefficients for the harmonic part and the noise part of a source and a target speaker, the idea is to map these coefficients from the source to the target and use them to re-synthesise the utterance of the transformed source. The cepstral coefficients are independent (cf. Rabiner, 1993), and we make the assumption that, on short speech events, the distribution of each coefficient follows a Gaussian law. If the duration of an event is a word, each Gaussian distribution N(m i,source,s i,source) of the source coefficient ci,source can be mapped to the distribution N(m i,target,s i,target) of the target for the duration of the word using equation 4.
(4)
The duration of a speech event can be part of a word determined by the state occupation of a HMM. One speaker independent HMM by word is used to align the vectors of the source and the target state by state. The distribution of each vector component in each state is then assumed Gaussian and the transformation of the source is performed state by state using the same equation (4).
2.3. Speaker re-synthesis
The transformed coefficients of the harmonic part are then injected into the synthesis part of the H+N model (see paragraph 2.1.1). As tentative trials for noise transformations gave imposture results worst than unmodified imposture tests, 2 approaches were followed: the first one keeps the original noise source re-synthesised with the transformed harmonic part. The noise source is extracted by subtracting the harmonic part from the source signal. The second approach adds only a random background noise to the transformed harmonic part. The random background noise is built from randomly selected samples of a non-speech-part utterance.
3. Experiments
3.1. Database used
The results are obtained on a database (cf. Polycode 1995) composed of 28 speakers recorded over a telephone line in several sessions. During the same session, each speaker had to say, among other sentences in French, 4 times his own 7-digit PIN code and one time a 10-digit sequence (all the digits from 0 to 9 in different order for each sequence). All these sequences are time-labelled digit by digit using a speech recogniser. Some sub-sets are extracted from this Polycode database (see figure 1).
Figure 1: The sets composing the database
3.2. The reference system
The automatic speaker verification system (ASV) used here as reference is a state-of-the art HMM (Hidden Markov Model) and works in text dependent mode. Two HMM models are created. One is speaker independent (the world model), trained with 300 speakers on a database different from Polycode. This world model is used as a normalisation model (cf. Rosenberg, 1991). A speaker model derived from the world model is then re-estimated with the Training set. The scoring is done by computing the log likelihood ratio (LLR) of the log likelihood of the speaker model Lkspeaker and the log likelihood of the world model Lkworld along an utterance (equation 5).
(5)
3.3. Test Protocol
The reference system is trained with the Training set and a speaker dependent a priori threshold (set at Equal Error Rate) is computed using the Evaluation set.
The reference system is then used with the Test set into 2 different way: 1) The impostor data of the Test set are given to the reference system, a decision is taken by comparing the scores of the utterances of the clients and impostors to a speaker dependent a priori threshold. The false acceptance (FA) rate is computed as the percent of impostor utterances that are accepted wrongly as client ones. 2) For each speaker the impostor data of the Test set (source) are transformed using the mimic systems (Gaussian and HMM) into the utterances of the Target set. These transformed utterances are then given to the reference system and a new false acceptance rate is computed.
4. Results
The table 1 shows the modification of the false acceptance rate when the Gaussian and HMM transformation systems are used with an a priori threshold. The first column indicates the modification when using the noise part of the source speaker, the transformation is not efficient in this case. The second column of table 1 shows an increase of the false acceptance rate when using a small random noise. The figure 2 shows the roc curves confirming the results given in table 1.
Reference system 4.19%±0.7 |
||
|
Noise source |
Noise random |
Gauss |
3.59%±0.6 |
14.45%±1.3 |
HMM |
4.65%±0.8 |
23.09%±1.5 |
Table 1: False acceptance rate with an a priori threshold fixed at EER using Gaussian or HMM transformations.
Figure 2: Roc curves for the different impostor transformations.
5. Conclusion
The results show that the harmonic part of the speech signal contains speaker dependent information that can be transformed to mimic another speaker. It is possible to become partly robust to these kinds of transformations by suppressing the harmonic part of the signal (cf. D. Genoud, 1998). The modelling of the noise is critical: it seems that ASV systems based on HMM are very sensitive to the noise modifications. The best speaker transformations are obtained by suppressing all the speech noise, and adding a small random noise. Further investigations will be done by using another approach than the H+N model for the analysis/synthesis part.
6. References
GENOUD D., CHOLLET G, Speech pre-processing against intentional imposture in speaker recognition, to appear in the proceedings of ICSLP-98, Sydney, 1998.
MARKEL J.D., GRAY A.M, Linear prediction of speech, Springer Verlag. 1976 Berlin
RABINER Lawrence, BIING-HWANG Juang, Fundamentals of speech recognition, Prentice Hall, 1993, Englewood Cliffs, NJ.
ROSENBERG A.E., LEE C.H., GOKOEN S, Connected Word Talker Verification Using Whole Word Hidden Markov Model", pp 381-384, in proceedings of ICASSP-191",
STYLIANOU Ioannis, Modèles Harmoniques plus Bruit combinés avec des méthodes statistiques, pour la modification de la parole et du locuteur. PhD thesis , ENST Paris 1996.