Music Instrument Source Separation
Current architecture (link to GitHub):
1-D 6-layered U-Net in the time domain. Trained with 70 songs. Each model trained on one instrument: bass, drum, and vocal.
Trained with the DSD100 dataset.
Good news, everyone!
More datasets (MedleyDB) have arrived. This will enable the fine-tuning and explore a little bit more complex model: adding VQ-VAE
Don't Go - Sanulrim
Original mix: 48000 samples per second
Bass: 16000 samples per second
Drum: 16000 samples per second
Vocal: 16000 samples per second
A long way to go!
There are areas of underperformance: for instance, the model struggles to accurately separate vocals when the vocalist talks rather than sings.
I want a better and cleaner separation.
What else is coming?
Increasing the model's depth and using a higher sampling rate: capture more detailed a nuanced features
Enlarging the training dataset: help improve the model's performance and its ability to generalize to unseen data
Discretizing the latent spaces: for further expansion of the model, such as adding a transformer.