As classifier an MLP system is used [10]. The size of the MLP is
462 input neurons, 100 neurons on the hidden layer and 2 neurons on the
output layer. The 462 input neurons correspond to 11 consecutive input
vectors, in order to capture more long term speech events. the 2 neurons of
the output layer are the local log likelihood score (LLS) of the target
speaker () and the non-target speaker (
)(also named world or cohort). These LLS are summed
along the speech segment (using N frames) to obtain a total log
likelihood
for the target speaker and
for
the non-target speaker.
TLLR=TLLsp-TLLns
The final score used for each speech segment is TLLR which correspond to a log likelihood ratio [9].
One MLP system is built for each target speaker. Thecohort speaker data were created from around 40 male and 40 female speakers speech extracted from Switchboard database. The total amount of speech for each training condition was balanced with the amount of data for each target speaker (i.e. 1 minute).