Musical audio samples generated from joint text embeddings

JASA

29 Nov

The field of machine learning has benefited from the appearance of diffusion-based generative models for images and audio. While text-to-image models have become increasingly prevalent, text-to-audio generative models are currently an active area of research. We present work on generating short samples of musical instrument sounds generated by a model which was conditioned on text descriptions and the file structure labels of large sample libraries. Preliminary findings indicate that generation of wide-spectrum sounds such as percussion are not difficult, while the generation of harmonic musical sounds presents challenges for audio diffusion models.

Other Modalities

Stella Biderman

Musical audio samples generated from joint text embeddings

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation