Musical audio samples generated from joint text embeddings
Zach Evans, Scott Hawley, and Katherine Crowson. "Musical audio samples generated from joint text embeddings." The Journal of the Acoustical Society of America 152, A178 (2022)
The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic musical sounds presents challenges for audio diffusion models.