Triplet loss and its uses

The triplet loss came to my attention when looking at OpenFace. The concept of directly learning an embedding (without having to fudge things by training with softmax, then using a hidden layer as the embedding) was intriguing. They published a paper: FaceNet: A Unified Embedding for Face Recognition and Clustering which explains things pretty well.

In contrast to these approaches, FaceNet directly trains its output to be a compact 128-D embedding using a tripletbased loss function based on LMNN [19]. Our triplets consist of two matching face thumbnails and a non-matching face thumbnail and the loss aims to separate the positive pair from the negative by a distance margin. The thumbnails are tight crops of the face area, no 2D or 3D alignment, other than scale and translation is performed

While looking around for more explanations, I found a description of the triplet loss in Tensorflow, which then lead me to a blog post by the Facenet guys talking about ways to speed up implementations of the triplet loss. The gist is that you run a set of examples through the network to get embeddings for each one, then using the CPU find triples of the examples that are 'hard triples' i.e. just pick some of the hardest ones in the minibatch, then you just index into the original embeddings when computing the loss and backpropagating. This means you don't have to have 3 copies of the network with tied weights lying around.

Another good paper on triplet loss: https://arxiv.org/abs/1703.07737

My interest in this is towards building a speaker identification embedding. More information soon.

Comments !