Dynamic Time Warping for Sequence Classification
DTW is a method for aligning two sequences in an optimal manner, and in the end it gives us the alignment as well as a distance between the two sequences. With this distance we can find all the closest sequences to a particular test sequence i.e. a nearest neighbour classifier. We can actually improve on this by using the distances in a kernel and using a Support Vector Machine (SVM), which will beat nearest neighbour pretty much always.
The goals for this tutorial are to introduce Dynamic Time Warping (DTW), show a worked example, provide python code for running DTW on a real problem, and incorporating DTW into an SVM kernel so we can see how much improvement we get.
In this tutorial we'll be using the ISOLET dataset which is a collection of audio recordings each with a single spoken letter e.g. 'A', 'B' etc. Note that the version at UCI has already had a fixed number of features per utterance extracted and isn't suitable for this experiment. It is our job to classify each of the test recordings as a particular letter. This will require the use of DTW because each recording is a different length. Of course, a proper speech recognition system might do better but we are using this problem as an example for a general class of simple problems.
For a short summary of the results: using 3-nearest neighbour and the DTW distance I got 84.4% accracy on the ISOLET database. By incorporating these same distances into an SVM kernel I got 95.5% accuracy. This sort of improvement is typical, SVM is a much more robust classifier than nearest neighbour.
For feature extraction I'll be using MFCCs provided by python_speech_features, see a tutorial here.
Contents
•  Defining DTW
• A Worked DTW Example
• Using DTW on ISOLET
• Incorporating DTW as SVM Kernel
• Further Reading
Defining DTW
Often in machine learning we need the distance between two objects. This will usually be e.g. the euclidean distance or cosine similarity between vectors. These distance functions only work on fixed size vectors though, and for some problems we would like to know the distance between sequences of different lengths.
Dynamic Time Warping is a method for aligning sequences and computing the distance between them, these can be time sequences like audio recordings or non-time sequences like protein sequences. In bioinformatics the algorithm is called either the Needleman-Wunsch algorithm or Smith–Waterman (these are slight variations of the same thing). It finds uses in many other areas as well.
This section will introduce DTW in a slightly mathematical way, in the next section I'll go through a worked example that shows how all this stuff is applied. Let \(P\) and \(Q\) be sequences of length \(L_P\) and \(L_Q\) respectively. Image \(P\) is made up of MFCC features extracted from audio. It will be a \(L_P\) by 12 matrix, where each row represents the MFCC features from one frame. \(Q\) will be the same, except computed from a different piece of audio. The first step for DTW is to compute the distance between all the different combinations of rows of the matrix of features. Let \(p_i\) be the \(i^{th}\) row of \(P\), and \(q_j\) be the \(j^{th}\) row of \(Q\). In the example below I use the cosine distance, but euclidean distance is also popular.
The distance \(d\) is computed for all rows in \(P\) and all rows in \(Q\) to give an \(L_P \times L_Q\) distance matrix \(S\). The dynamic time warping algorithm (described soon) is applied to \(S\) to find the minimum cost path. This gives a cumulative distance matrix \(D\). The matrix \(D\) defines the total cost of alignment between \((p_1,q_1)\) and \((p_{L_P}, q_{L_Q})\). A lower cost implies a better alignment, which indicates the sequences \(P\) and \(Q\) are more similar. The computation of the cumulative distance matrix \(D\) can be calculated like this:
The equation above is computed for all \(i=1,2,\ldots,L_P\) and \(j=1,2,\ldots ,L_Q\). Also, \(D_{i,j} is not defined for $i\leq 0\) and/or \(j\leq 0\), and \(S_{i,j} = d(p_i,q_j)\). The final distance between \(P\) and \(Q\) is defined as
where \(D\) is the cumulative distance matrix defined above for \(P\) and \(Q\), and \(D(L_P,L_Q)\) indicates the indexing of the bottom right corner of the matrix \(D\), all other values in the matrix are ignored (unless you want to compute the path, then you backtrack through \(D\)).
A Worked DTW Example
Hopefully you weren't discouraged by the previous section, it can get a little bit dry. In this section we will use some concrete numbers and go through all the steps you need to do to apply DTW. If you want to see another example see Dan Ellis' page on dtw.
Most DTW tutorials assume that you are aligning two 1-dimensional vectors, but in machine learning problems we almost always deal with multi-dimensional sequences e.g. audio is represented as MFCCs which have 12 dimensions, so our sequence is actually a N x 12 matrix. DTW can handle this sort of situation just fine, but I think it confuses newcomers so the example below aligns 2 sequences with 3-dimensional elements.
Our matrices \(P\) and \(Q\) will be defined as
| Sequence \(P\) of length 4 | Sequence \(Q\) of length 5 | 
|---|---|
| \begin{bmatrix} 1 & 5 & 6 \\ 4 & 6 & -3\\ 3 & 3 & 1 \\ 5 & 3 & -1 \end{bmatrix} | \begin{bmatrix} 2 & 3 & 2 \\ 5 & 2 & -3\\ 5 & 4 & 6 \\ 0 & 4 & 1 \\ 2 & 0 & -1 \end{bmatrix} | 
These matrices aren't that similar, but we will find the distance between them anyway. For this example we have \(L_P=4\) and \(L_Q=5\). Our first step is to build the matrix \(S\) of pairwise distances between all the rows.
Let \(p_i\) (for \(i=1,\ldots,4\)) and \(q_j\) (for \(j=1,\ldots,5\)) be the row vectors of PSSMs \(P\) and \(Q\). To compute the distance between row 1 of \(P\) (\(p_1 = [1,5,6]\)) and row 1 of \(Q\) (\(q_1 = [2,3,2])\))
In the same way, the rest of the matrix \(S\) can be computed between all possible pairs of rows (all other combinations of \(i\) and \(j\)).
The matrix \(S\) is used to compute the cumulative distance matrix \(D\) using equation \ref{eqn_dtw_D}.
since \(D_{01}\),\(D_{10}\) and \(D_{00}\) do not exist and are considered empty.
In the same way \(D_{21} = D_{11} + S_{21} = 0.4856\), \(D_{12} = D_{11} + S_{12} = 1.1685\) and \(D_{22} = \min(D_{12},D_{21},D_{11}) + S_{22} = 0.1067 + 0.1484 = 0.2552\). The complete matrix \(D\) is calculated to be:
It is concluded that the distance \(D_{DTW}\) between \(P\) and \(Q\) is \(D_{DTW}(P,Q) = D(L_P,L_Q) = D(4,5) = 0.8375\).
Using DTW on ISOLET
In the previous section we went through how compute the distance between two sequences. We're now going to use the distance function \(D_{DTW}\) that we defined to build a nearest neighbour classifier for the utterances in ISOLET. The basic steps are, for each test utterance, to compute the distance between it and every train utterance. We identify the closest train utterance and then use the corresponding class as the class of the test utterance. That's it!
We have to implement the functions defined above in python. First we have to get a list of all the files in ISOLET and extract MFCC features for each utterance, see the source file for that stuff. Below I've shown an implementation of the DTW algorithm in numpy, but this runs really slowly, so I rewrote it to use numba just-in-time compilation which is 500 times (!!) faster, see the code for the implementation, the basic trick is not to use any numpy or scipy calls in the loops, just pure python.
# this is the dtw distance implemented directly
def dtw_dist(p,q):
    ep = np.sqrt(np.sum(np.square(p),axis=1));
    eq = np.sqrt(np.sum(np.square(q),axis=1));
    D = 1 - np.dot(p,q.T)/np.outer(ep,eq) # work out D all at once
    S = np.zeros_like(D)
    Lp = np.shape(p)[0]
    Lq = np.shape(q)[0]
    N = np.shape(p)[1]
    for i in range(Lp):
        for j in range(Lq):          
            if i==0 and j==0:  S[i,j] = D[i,j]
            elif i==0: S[i,j] = S[i,j-1] + D[i,j]
            elif j==0: S[i,j] = S[i-1,j] + D[i,j]
            else: S[i,j] = np.min([S[i-1,j],S[i,j-1],S[i-1,j-1]]) + D[i,j]
    return S[-1,-1] # return the bottom right hand corner distance
# features_test, features_train are lists of numpy arrays containing MFCCs. 
# label_test and label_train are just python lists of class labels (0='A', 1='B' etc.)    
correct = 0
total = 0
for i in range(len(features_test)):
    best_dist = 10**10
    best_label = -1
    for j in range(len(features_train)):
        dist_ij = dtw2_dist(features_test[i],features_train[j])
        if dist_ij < best_dist:
            best_dist = dist_ij
            best_label = label_train[j]
    if best_label == label_test[i]: correct += 1
    total += 1
print('accuracy:',correct/total)
Using the above code I could achieve 79.35% accuracy using 26 filterbank, 13 cepstrum MFCC. By adding delta features and twiddling with some other parameters I got this up to 84.4% accuracy. This is not great, the scores listed by the UCI repository have 96% as the best score (using a non-DTW method). The example above uses 1-nearest neighbour, what if we increase the number of neighbours? for 3-NN we get 85.06%, for 5-NN we get 85.58%, for 11-NN we get 88.52%, for 50-NN we get 88.84% and for 100-NN we get 88.91%. So a small increase in accuracy. For the above results I pooled the train and validation sets, though final accuracies are mostly identical with just the train set. In the next section we will look at incorporating these distances into an SVM kernel.
Incorporating DTW as SVM Kernel
See here for a good non-mathematical guide to SVM. The SVM algorithm uses a kernel matrix as the basis of its classifications. A kernel function is valid only of it is symmetric i.e. \(K(\mathbf{x},\mathbf{y}) = K(\mathbf{x},\mathbf{y})\) and the resulting kernel matrix is positive semi-definite.
Common kernel functions include linear, polynomial and Radial Basis Function (RBF). For these kernels \(\mathbf{x}\) and \(\mathbf{y}\) must have the same dimensionality to be computable. To create a kernel function that can utilise variable length inputs, we can use \(D_{DTW}\), the distance between sequences that we defined in the previous section.
One of the problems with using the DTW distance is that it is not a true distance metric like the Euclidean distance. The DTW distance does not satisfy the triangle inequality, and positive semi-definiteness of the kernel cannot be proven as simple counterexamples can be found. Nevertheless kernels built like this show good results.
To apply this we just have to do grid search on the validation set to find good parameters \(C\) and \(\gamma\) for this particular problem. Full code here. It does take quite a while to run.
To implement the SVM component, we first need to compute the kernel matrices. The train kernel is just the DTW distance from equation \(\ref{dtwkernel}\) computed between all pairs of train utterances. I wrote them to a file so I could do grid search later without having to recompute them all, as it can take a little while to compute, but you could very easily skip this step if you wanted.
# building train kernel    
kf = open('tr.kernel','w')
for i in range(len(features_tr)):
    for j in range(len(features_tr)):
        dist_ij = dtw2_dist(features_tr[i],features_tr[j])
        kf.write(str(i)+' '+str(j)+' '+str(dist_ij)+'\n')
kf.close()
The same is done for the test kernel, except we now compute the distance between each of the train utterances and the test utterances. The next step is to load the kernels from the file, and apply LIBSVM to get the final accuracy. The full code includes some grid search to find the best parameters.
# loading train kernel
dtw_dist = np.zeros((len(features_tr),len(features_tr)))
trk = open('tr.kernel2')
for line in trk:
    toks = line.split()
    dtw_dist[int(toks[0]),int(toks[1])] = float(toks[2])
trk.close()
c = 10
gamma = 25
kernel = np.exp(-dtw_dist/gamma)
testkernel = np.exp(-test_dtw_dist/gamma)
clf = svm.SVC(kernel='precomputed',C=c)
clf.fit(kernel, np.array(label_tr))
res = clf.predict(testkernel)
print('accuracy:',np.mean(res==label_ts))
The final accuracy I got with the svm kernel was 95.25% accuracy, much better than the 84.4% accuracy using the nearest neighbour method (this is using the exact same set of distances too!). The only downside with these methods is that they take quite a long time to run, though they are easily parallelised.
 
                
        
Comments !