Creating a Speaker Embedding for Speaker ID

Speaker Identification is the process of automatically determining who is speaking based on a recording of their voice. In the past Gaussian Mixture Models (GMMs) and GMM-UBMs were used for this sort or thing, however Deep Learning has basically replaced all the old classification techniques at this point, so why not speaker ID too?

This tutorial will be about creating a deep neural embedding i.e. a neural network that directly learns a mapping from speech segments to a compact Euclidean space where distances directly correspond to a measure of speaker similarity. Once this space has been trained, tasks such as speaker recognition, verification and clustering can be easily implemented using standard techniques with the embeddings as feature vectors. We'll use a deep convolutional network trained to directly optimize the embedding itself, rather than an intermediate bottleneck layer, similar to how FaceNet is used for face recognition.

There is a fair bit of stuff to get through, from datasets, to network architechture, to applications of the final embeddings, so I'll just get started.


Training and Loss
A Worked DTW Example
Incorporating DTW as SVM Kernel


If you look at FaceNet, they used millions of images to train it, but unfortunately finding a database with millions of speakers is going to be tough. The largest speech dataset I know of is librispeech which has 1266 speakers (meant for ASR, but we can use it for speaker), followed by VoxCeleb (1245 speakers), a rather challenging dataset of celebrities speaking in youtube clips. VoxCeleb is difficult because there are so many different recording environments, background noises and time between recordings. Other datasets include TIMIT (630 speakers), ANDOSL (108 speakers) and YOHO (130 speakers). I also played around with voxforge a little bit, but it would take a lot of cleaning up to get it into a workable state.

I will use librispeech, VoxCeleb, ANDOSL and YOHO pooled together for training, and timit for testing. This allows me to use timit's 630 speakers that the embedding has never seen before to test the system, and since this is the scenario that it will be used in, should provide a good indication of its ability to discriminate arbitrary speakers.

Training and Loss

We are going to be training the network with triplet loss, this allows us to directly optimise the embedding instead of using a bottleneck or some other trick. What we want for an embedding is that all speech segments from one speaker map closely together, while segments from different speakers are far apart e.g.

\begin{equation}\label{tripleteq} \norm{x^a - x^p} < \norm{x^a - x^n} \quad \forall a, \forall p, \forall n \($ In the equation above $x^a\) is the embedding for a given speech segment, \(x^p\) is the embedding for a positive exemplar (another speech segment of the same speaker) and \(x^n\) is a negative exemplar (a speech segment from a different speaker). To translate this into a loss function we can use for deep learning we use:

$$ loss = \sum_i^N \left[ \norm{f(x^a_i) - f(x^p_i)} - \norm{f(x^a_i) - f(x^n_i)} + \alpha \right] $$

Where \(f()\) is the neural network and \(\alpha\) is a small number that represents the margin. This means we need sets of three speech segements where 2 are from the same speaker and 1 from another, this is where 'triplet loss' gets its name. There are some tricks to implementing this efficiently, we don't want to have three copies of our network in memory, and in addition we don't want to use all possible triplets - just the difficult ones! The FaceNet paper goes into this a bit deeper, many triples satisfy equation \ref{tripleteq} but these don't help our training, we need hard triples i.e. the ones that violate equation \ref{tripleteq} (but not too hard!). Want we want is to mine moderately hard triples. This is done easily by taking a minibatch and running it through the network on the GPU, then computing all valid triples in the minibatch on the CPU. At this stage we can discard all the easy triples, and only keep the hard ones in the batch. This satisfies our need for moderately hard triples i.e. they are the hardest in the batch. Once we have the hard triples, we then pass the indexes of the embeddings back into tensorflow which can then index into to batch of embeddings to compute the loss.


This network will be implemented

Often in machine learning we need the distance between two objects. This will usually be e.g. the euclidean distance or cosine similarity between vectors. These distance functions only work on fixed size vectors though, and for some problems we would like to know the distance between sequences of different lengths.

Dynamic Time Warping is a method for aligning sequences and computing the distance between them, these can be time sequences like audio recordings or non-time sequences like protein sequences. In bioinformatics the algorithm is called either the Needleman-Wunsch algorithm or Smith–Waterman (these are slight variations of the same thing). It finds uses in many other areas as well.

This section will introduce DTW in a slightly mathematical way, in the next section I'll go through a worked example that shows how all this stuff is applied. Let \(P\) and \(Q\) be sequences of length \(L_P\) and \(L_Q\) respectively. The first step is to compute the distance between all the different combinations of rows. let \(p_i\) be the \(i^{th}\) row of \(P\), and \(q_j\) be the \(j^{th}\) row of \(Q\).

Here we will use the cosine distance, which is defined between vectors. We could just as easily use euclidean distance between each row here as well. The cosine distance between vectors \(p_i\) and \(q_j\) is defined as:

\begin{equation} d(p_i,q_j) = 1 - \dfrac{p_i q_j^T}{\sqrt{p_i p_i^T q_j q_j^T}}, \quad \textrm{for } i=1,\ldots, L_P \textrm{ and } j=1,\ldots, L_Q \end{equation}

The distance \(d\) is computed for all rows in \(P\) and all rows in \(Q\) to give an \(L_P \times L_Q\) distance matrix \(S\). The dynamic time warping algorithm is applied to \(S\) to find the minimum cost path. This gives a cumulative distance matrix \(D\). The matrix \(D\) defines the total cost of alignment between \((p_1,q_1)\) and \((p_{L_P}, q_{L_Q})\). A lower cost implies a better alignment, which indicates the sequences \(P\) and \(Q\) are more similar. The computation of the cumulative distance matrix \(D\) can be calculated like this:

\begin{equation} \label{eqn_dtw_D} D_{i,j} = \min(D_{i-1,j},D_{i,j-1},D_{i-1,j-1}) + S_{i,j} \end{equation}

The equation above is computed for all \(i=1,2,\ldots,L_P\) and \(j=1,2,\ldots ,L_Q\). Also, \(D_{i,j} = \emptyset\) (the empty set) for \(i\leq 0\) and/or \(j\leq 0\), and \(S_{i,j} = d(p_i,q_j)\). The final distance between \(P\) and \(Q\) is defined as

\begin{equation} D_{DTW}(P,Q) = D(L_P,L_Q) \end{equation}

where \(D\) is the cumulative distance matrix defined above for \(P\) and \(Q\), and \(D(L_P,L_Q)\) indicates the indexing of the bottom right corner of the matrix \(D\), all other values in the matrix are ignored.

A Worked DTW Example

Most DTW tutorials assume that you are aligning two 1-dimensional vectors, but in machine learning problems we almost always deal with multi-dimensional sequences e.g. audio is represented as MFCCs which have 12 dimensions, so our sequence is actually a N x 12 matrix. DTW can handle this sort of situation just fine, but I think it confuses newcomers so the example below aligns 2 sequences with 3-dimensional elements.

Our matrices \(P\) and \(Q\) will be defined as

Sequence \(P\) of length 4 Sequence \(Q\) of length 5
\begin{bmatrix} 1 & 5 & 6 \\ 4 & 6 & -3\\ 3 & 3 & 1 \\ 5 & 3 & -1 \end{bmatrix}
\begin{bmatrix} 2 & 3 & 2 \\ 5 & 2 & -3\\ 5 & 4 & 6 \\ 0 & 4 & 1 \\ 2 & 0 & -1 \end{bmatrix}

These matrices aren't that similar, but we will find the distance between them anyway. For this example we have \(L_P=4\) and \(L_Q=5\). Our first step is to build the matrix \(S\) of pairwise distances between all the rows.

Let \(p_i\) (for \(i=1,\ldots,4\)) and \(q_j\) (for \(j=1,\ldots,5\)) be the row vectors of PSSMs \(P\) and \(Q\). To compute the distance between row 1 of \(P\) (\(p_1 = [1,5,6]\)) and row 1 of \(Q\) (\(q_1 = [2,3,2])\))

\begin{align} d(p_1,q_1) &= 1 - \dfrac{p_1 q_1^T}{\sqrt{p_1 p_1^T q_1 q_1^T}}\\ &= 1 - \dfrac{29}{\sqrt{62 \times 17}}\\ &= 1 - 0.8933 = 0.1067 \end{align}

In the same way, the rest of the matrix \(S\) can be computed between all possible pairs of rows (all other combinations of \(i\) and \(j\)).

$$ S = \begin{bmatrix} 0.1067 & 1.0618 & 0.1171 & 0.1991 & 1.2272\\ 0.3789 & 0.1484 & 0.6206 & 0.3479 & 0.3701\\ 0.0541 & 0.3301 & 0.1372 & 0.2767 & 0.4870\\ 0.3031 & 0.0677 & 0.4029 & 0.5490 & 0.1685\\ \end{bmatrix} $$

The matrix \(S\) is used to compute the cumulative distance matrix \(D\) using equation \ref{eqn_dtw_D}.

\begin{equation} D_{11} = \min(D_{01},D_{10},D_{00}) + S_{11} = S_{11} = 0.1067 \end{equation}

since \(D_{01}\),\(D_{10}\) and \(D_{00}\) do not exist and are considered empty.

In the same way \(D_{21} = D_{11} + S_{21} = 0.4856\), \(D_{12} = D_{11} + S_{12} = 1.1685\) and \(D_{22} = \min(D_{12},D_{21},D_{11}) + S_{22} = 0.1067 + 0.1484 = 0.2552\). The complete matrix \(D\) is calculated to be:

$$ D = \begin{bmatrix} 0.1067 & 1.1685 & 1.2856 & 1.4847 & 2.7119\\ 0.4856 & 0.2551 & 0.8757 & 1.2236 & 1.5937\\ 0.5397 & 0.5852 & 0.3923 & 0.6690 & 1.1560\\ 0.8428 & 0.6074 & 0.7952 & 0.9413 & 0.8375\\ \end{bmatrix} $$

It is concluded that the distance \(D_{DTW}\) between \(P\) and \(Q\) is \(D_{DTW}(P,Q) = D(L_P,L_Q) = D(4,5) = 0.8375\).


In the previous section we went through how compute the distance between two sequences. We're now going to use the distance function \(D_{DTW}\) that we defined to build a nearest neighbour classifier for the utterances in ISOLET. The basic steps are, for each test utterance, to compute the distance between it and every train utterance. We identify the closest train utterance and then use the corresponding class as the class of the test utterance. That's it!

We have to implement the functions defined above in python. First we have to get a list of all the files in ISOLET and extract MFCC features for each utterance, see the source file for that stuff. Below I've shown an implementation of the DTW algorithm in numpy, but thus runs really slowly, so I rewrote it to use numba just-in-time compilation which is 500 times (!!) faster, see the code for the implementation, the basic trick is not to use any numpy or scipy calls in the loops, just pure python.

# this is the dtw distance implemented directly
def dtw_dist(p,q):
    ep = np.sqrt(np.sum(np.square(p),axis=1));
    eq = np.sqrt(np.sum(np.square(q),axis=1));
    D = 1 -,q.T)/np.outer(ep,eq) # work out D all at once
    S = np.zeros_like(D)
    Lp = np.shape(p)[0]
    Lq = np.shape(q)[0]
    N = np.shape(p)[1]
    for i in range(Lp):
        for j in range(Lq):          
            if i==0 and j==0:  S[i,j] = D[i,j]
            elif i==0: S[i,j] = S[i,j-1] + D[i,j]
            elif j==0: S[i,j] = S[i-1,j] + D[i,j]
            else: S[i,j] = np.min([S[i-1,j],S[i,j-1],S[i-1,j-1]]) + D[i,j]
    return S[-1,-1] # return the bottom right hand corner distance

# features_test, features_train are lists of numpy arrays containing MFCCs. 
# label_test and label_train are just python lists of class labels (0='A', 1='B' etc.)    
correct = 0
total = 0
for i in range(len(features_test)):
    best_dist = 10**10
    best_label = -1
    for j in range(len(features_train)):
        dist_ij = dtw2_dist(features_test[i],features_train[j])
        if dist_ij < best_dist:
            best_dist = dist_ij
            best_label = label_train[j]
    if best_label == label_test[i]: correct += 1
    total += 1

Using the above code I could achieve 79.35% accuracy using 26 filterbank, 13 cepstrum MFCC. By adding delta features and twiddling with some other parameters I got this up to 84.4% accuracy. This is not great, the scores listed by the UCI repository have 96% as the best score (using a non-DTW method). The example above uses 1-nearest neighbour, what if we increase the number of neighbours? for 3-NN we get 85.06%, for 5-NN we get 85.58%, for 11-NN we get 88.52%, for 50-NN we get 88.84% and for 100-NN we get 88.91%. So a small increase in accuracy. In the next section we will look at incorporating these distances into an SVM kernel.

Incorporating DTW as SVM Kernel

See here for a good non-mathematical guide to SVM. The SVM algorithm uses a kernel matrix as the basis of its classifications. A kernel function is valid only of it is symmetric i.e. \(K(\mathbf{x},\mathbf{y}) = K(\mathbf{x},\mathbf{y})\) and the resulting kernel matrix is positive semi-definite.

Common kernel functions include linear, polynomial and Radial Basis Function (RBF). For these kernels \(\mathbf{x}\) and \(\mathbf{y}\) must have the same dimensionality to be computable. To create a kernel function that can utilise variable length inputs, we can use \(D_{DTW}\), the distance between sequences that we defined in the previous section.

\begin{equation} K_{DTW}(\mathbf{x},\mathbf{y}) = \exp \left( \dfrac{-D_{DTW}(\mathbf{x},\mathbf{y})^2}{\gamma^2} \right) \end{equation}

One of the problems with using the DTW distance is that it is not a true distance metric like the Euclidean distance. The DTW distance does not satisfy the triangle inequality, and positive semi-definiteness of the kernel cannot be proven as simple counterexamples can be found. Nevertheless kernels built like this show good results.

To apply this we just have to do grid search on the validation set to find good parameters \(C\) and \(\gamma\) for this particular problem.

Comments !