Music ControlNet, democratizing music, & the power of SLMs (2024)

Breaking down the most advanced AI systems in the world to prepare you for your future.

5-minute weekly reads.

If there’s one ubiquitous element across almost all recent AI solutions, is that almost all of them are based on text-prompting.

  • ChatGPT for Text-to-text

  • Stable Diffusion for Text-to-image

  • Runway for Text-to-video

And so on.

Basically, AI right now is all about requesting something through text and getting whatever that model specializes in. And that includes music.

In fact, Models like MusicGen have become extremely popular, as you can literally create music using prompts like “Classic reggae track with an electronic guitar solo” and literally get something back that sounds good.

However, the fun ends pretty much there. You have no control over the generated melody, meaning that the practicality is nonexistent.

It’s a fun toy and nothing more.

But now, researchers from Adobe Research and Carnegie Mellon University, have changed the game by applying the amazing discovery that revolutionized image models to music synthesis to music, creating the first model that hands you the power to create under your own accord.

For the first time, the AI musician is now under control, and the potential is unbound.

It’s all a Transformation

If you asked me to define AI in one word, that would be ‘transformation’.

In a nutshell, what we describe as AI is nothing but a black box with a statistical algorithm inside that takes data in some form as an input and outputs data in another form, usually in the form of a prediction.

In its purest form, especially if we think about neural networks, these models are simply a huge function that models a certain task.

For instance:

  • ChatGPT is simply a huge function that has learned that, upon receiving a set of words, to predict the next one.

  • DALL-E, upon receiving a text description, predicts the noise in an image and takes it out to uncover the new image behind the noise.

And what makes neural networks so great is that, with enough data and some clever tweaking in the structure of this function, they can learn to perform any data transformation.

From words to music, and everything in between.

Heck, in last week’s newsletter, we saw how they turned brain signals from imagined movements into real robotic movements!

This capacity to model ‘anything’ is what makes neural networks be described as ‘universal function approximators’.

But why am I telling you this? Simple, because this is the intuition we need to understand how incredible Music ControlNet is.

Applying Control

Current text-to-music models like MusicGen by Meta or MusicLM by Google are already quite impressive, but lack any sense of control.

A common problem in the space

In short, they have learned very powerful global attributes of a multitude of genres and instruments, the appropriate moods of every music style, and of course the correct tempos.

Consequently, if you ask them to generate a reggae song, they will generate something that undoubtedly sounds like it.

But reggae, just like any other genre, has a manifold of different songs with varying melody, rhythms, or dynamics you can’t control with those models.

Thus, to understand how this has been solved for music, we need first to understand how it was solved for image synthesis.

What if, for instance, I wanted to generate several images in different scenarios while always depicting the same woman?

You couldn’t… until you could.

Harnessing control

To solve the image problem, Stanford University developed ControlNet, an image synthesis model that allows you to apply additional conditions besides the text prompt, like canny edges or human poses.

This results in consistent generations that stay true to the text prompt while respecting the newly set conditions.

Music ControlNet, democratizing music, & the power of SLMs (1)

But how does ControlNet work?

For standard image generation models, their variables, or weights, have specialized in one specific task, generating an image that describes the given text prompt.

But if you want to be able to provide a sketch so that the model generates the image based on it, these weights can’t do that.

You need new ones.

You can try and fine-tune the original weights, but that comes at a cost of potentially impacting the performance regarding the first objective, which is to generate an image that is semantically relevant to the prompt.

Bottom line, this line of action is always never an option.

Music ControlNet, democratizing music, & the power of SLMs (2)

Therefore, they applied a new set of weights at varying levels of the standard image generation model (they used Stable Diffusion but can be applied to any other model).

For training, they froze the weights of the standard model, and trained the new weights to learn how to ensure that the delivered image also respected the new condition, like any of the 8 shown in the image above.

The ‘zero convolution’ part means that the new weights are initially set to zero to prevent them affecting the first generations too much.

A variable being ‘zero’ doesn’t mean it doesn’t have ‘slope’ i.e. a non-zero derivative, so the model still learns.

Therefore, once trained, when combining both sets of weights to generate a new image:

  • The original frozen weights ensure the new image respects the text condition set by the user

  • The new weights ensure the structure of the new image follows the new condition

Et voilà.

Music ControlNet, democratizing music, & the power of SLMs (3)

And now, this seminal architecture has been brought into music… and the results are amazing.

A New Era

Just like Image ControlNet allows you to fix certain image conditions, Music ControlNet allows you to fix melody, rhythm, and dynamics.

Music ControlNet, democratizing music, & the power of SLMs (4)

The concept is very similar; they add three control blocks to the model, each a new pack of weights specialized in each new condition.

And just like in image synthesis, these weights encode the new conditions and add new information to the model during the decoding process, so that the new music not only respects global style but also applies the new conditions.

But here, researchers add a twist.

Modeling actual sound waves is really hard, as they range between 16 to 48 kHz, meaning that the model should learn to predict at least 16.000 sound data points per second.

Luckily, with sound, we have mel spectrograms, visual representations of the spectrum of frequencies of a signal as they vary with time.

The Mel scale is designed to mimic the non-linear human ear perception of sound, meaning it gives a more perceptually-relevant representation of audio.

A spectrogram is like a screenshot of the loudness and pitch of a sound at any given time, retaining most of the useful information from the real sound wave while having a much easier-to-model frequency (between 50 to 100 Hz).

Finally, as we already have neural networks that map spectrograms to actual sound, called vocoders, researchers here simply use a pre-trained one.

But they had one more thing, as the model also allows you to partially condition your generations, meaning that you can set the conditions for a part of the piece, and let the model improvize the rest, making Music ControlNet the first AI solution that really sets the stage for the complete disruption of the musical industry.

Democratizing Music

As you can hear for yourself, Music ControlNet represents a great leap in music synthesis.

Now, anyone can create their own music without actually having to play one single instrument.

The ethical implications of these types of technologies have yet to be understood, but what we can guarantee is that AI is leaving no stone unturned.

Read the paper here

With NLP foundation models like ChatGPT, we have finally broken down the language barrier with machines.

This, added to their core reasoning capabilities, makes them the ideal technology to be deployed, literally, everywhere.

But with current models that’s not possible because of two things, huge computing requirements and costs.

Luckily, the rapid developments in the world of Small Language Models (SLMs), to the point of getting Microsoft’s CEO, Satya Nadella, “very excited”, not only solve these issues, but the quality/price ratio is so good that they could soon shift the world’s interest away from LLMs, the current kings and queens of the AI industry, into these small guys. Undeniably, SLMs represent the key to unlocking AI’s impact across the world.

“The best things come in small wrappings” as they say, and AI could just set another example of this.

Music ControlNet, democratizing music, & the power of SLMs (2024)
Top Articles
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 5836

Rating: 4.3 / 5 (74 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.