Music Instrument Source Separation

1-D 6-layered U-Net in the time domain. Trained with 70 songs. Each model trained on one instrument: bass, drum, and vocal.

Trained with the DSD100 dataset.

Good news, everyone!

More datasets (MedleyDB) have arrived. This will enable the fine-tuning and explore a little bit more complex model: adding VQ-VAE

Don't Go - Sanulrim

Original mix: 48000 samples per second

Bass: 16000 samples per second

Drum: 16000 samples per second

Vocal: 16000 samples per second

A long way to go!

There are areas of underperformance: for instance, the model struggles to accurately separate vocals when the vocalist talks rather than sings.

I want a better and cleaner separation.

What else is coming?

Increasing the model's depth and using a higher sampling rate: capture more detailed a nuanced features
Enlarging the training dataset: help improve the model's performance and its ability to generalize to unseen data
Discretizing the latent spaces: for further expansion of the model, such as adding a transformer.

Page updated

Google Sites

Report abuse