Deliberate imposture: a challenge for Automatic Speaker Verification Systems.

 

Authors: Dominique Genoud IDIAP, CP592, CH-1920 Martigny, Switzerland

Gérard Chollet CNRS URA-820, ENST, 46 rue Barrault, 75634 PARIS cedex 13, France

 

An Automatic Speaker Verification (ASV) system must realise a compromise between two part of an alternative: acceptation of valid clients and rejection of impostors. Modelling impostor is crucial for ASV systems as it is use to set a rejection threshold and build cohort models. It is quite likely that many people in the world have voice characteristics quite similar to a given client. Furthermore, the imposture could be deliberate. The impostor could have access to several kinds of information about the client:

 

1- He may have heard the client saying his password.

2- He may have recorded that password.

3- He may have recorded the voice of the client saying something else.

 

Case #2 is a difficult one for speaker verification systems and is not our concern here (text prompted approaches have been proposed for that). The proposed paper will deal with cases 1 and 3: the impostor knows the password and he has recordings of the client saying something else. This information can be used in different ways: a) he could train himself to imitate the client, b) he could synthesise the password of the client segmenting and splicing the recordings c) he obtained from that client or he could train a voice transformation vocoder to adapt his voice to that of the client.

 

In order to analyse the computer approach of imposture, some experiments concerned with strategies b) and c) are reported here. First, it is supposed that a lot of recorded client sentences are available, and that we have heard his password. An automatic speech recognition of the client sentences is performed at phoneme level. Then, a dichotomic search of the longer phonetic segments constituting the password is achieved. The phonetic segments are then concatenated to reconstitute the password. Table 1 gives the results obtained when using such an approach on the Polyvar database where 18 speakers are uttering 17 different passwords. A state of the art reference system based on a HMM speaker verification application coming from the European Picasso project. Around 5 reconstituted items for each password and each speaker are given to the reference system. The point of operation (a priori threshold) is determined on an independent test set.

 

FA% normal impostors

FA% concatenated sequences

3.81

33.86

 

As a second approach, a spectral transformation of the impostor voice in order to mimic the client voice characteristics is proposed. This constitutes a first step in the direction of a voice transformation vocoder. An Harmonic plus Noise (H+N) [1] model is used to transform the voice of a source speaker (the impostor) to the voice of a target speaker (the client). The analysis step separate the speech in an harmonic part expressed as a sum of short term pitch synchronous harmonics varying in amplitude and phase (represented by regularised cepstral coefficients), and in a noise part which is defined as a spectral probability density function modelled as a p-order all-pole filter. The adaptation source/target speaker is done for the harmonic part by mapping the source cepstral coefficients to the target cepstral coefficients assuming that each of the coefficients of the source and target parameters follows a Gaussian distribution over a certain period of time. The noise part is not transformed and is replaced by a random noise. The transformed voice is then re-synthesised pitch synchronously. Table 2 shows the increase of false acceptance rate after the transformation process. The subset of Polycode database used here is made up of 18 speakers, recorded over telephone lines. The reference system is an HMM based text dependent speaker verification system set up with an a priori threshold. For each client, a session different from the training and test set is used as the target. Each impostor test is also used as source data for the transformation.

 

FA% normal impostors

FA% transformed impostors

4.2

23.1

 

The showed results demonstrate a need for a development of new strategies, which can take account of this kind of impostures [2], some of them will be developed in the final paper.

 

References

[1] Y. Stylianou. Efficient decomposition of speech signals into a deterministic and a stochastic part. IEEE ISSPA, 1996

[2] D.Genoud, G. Chollet Speech pre-processing aginst intentional imposture in speaker recognition, in proc ICSLP98, Sydney, 1998