I spoke about triplet loss in the previous post. I wanted to build a speaker embedding similar to the face embedding used in FaceNet. It turns out that the guys at Baidu beat me to it: https://arxiv.org/pdf/1705.02304.pdf. They have datasets with 50k speakers! And a lot more processing power than I have. I will still try and implement their paper, perhaps try and use some of their datasets.
Previously I have identified a couple datasets I can get my hands on: Voxceleb (1245 speakers), timit (630 speakers), andosl (108 speakers) plus some smaller ones.
Comments !