Music ControlNet: Multiple Time-varying Controls for Music Generation (2024)

Shih-LunWu1,2*12{}^{1,2*}start_FLOATSUPERSCRIPT 1 , 2 * end_FLOATSUPERSCRIPT,ChrisDonahue11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,ShinjiWatanabe11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,and NicholasJ.Bryan22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT School of Computer Science, Carnegie Mellon University     22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Adobe Research
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTWork done during Shih-Lun’s internship at Adobe Research. Correspondence should be addressed to Shih-Lun Wu and Nicholas J.Bryan at shihlunw@cs.cmu.edu and njb@ieee.org, respectively.

Abstract

Text-to-music generation models are now capable of generating high-quality music audio in broad styles.However, text controlis primarilysuitablefor the manipulation of global musical attributes like genre, mood, andtempo,and is less suitable for precise control over time-varying attributes such as the positions of beats in time or the changing dynamics of the music.We propose Music ControlNet, a diffusion-based music generation model that offers multiple precise, time-varying controls over generated audio. To imbue text-to-music models with time-varying control, we propose an approach analogous to pixel-wise control of the image-domain ControlNet method.Specifically, we extract controls from training audio yieldingpaired data, and fine-tune a diffusion-based conditional generative model over audio spectrograms given melody, dynamics, and rhythm controls.While the image-domain Uni-ControlNet method already allows generation with any subset of controls, we devise a new masking strategy at training to allow creators to input controls that are only partially specified in time.We evaluate both on controls extracted from audioand controls we expect creators to provide,demonstrating that we can generaterealisticmusic that corresponds to control inputs in both settings.While few comparable music generation models exist,we benchmark againstMusicGen, a recent model that accepts text and melody input,and show that our model generates music that is49494949% more faithful toinput melodies despitehaving 35353535x fewer parameters, training on 11111111x less data, and enabling two additional forms of time-varying control.Sound examples can be found at https://MusicControlNet.github.io/web/.

Index Terms:

music generation, controllable generative modeling, diffusion

I Introduction

One of the pillars of musical expression is the communication of high-level ideas and emotions through precise manipulation of lower-level attributes like notes, dynamics, and rhythms. Recently, there has been an explosion of interest in training text-to-music generative models that allow creators to directly convert high-level intent (expressed as text) into music audio[1, 2, 3, 4, 5]. These models suggest an exciting new paradigm of musical expression wherein creators can instantaneously generate realistic music without the need towrite a melody,specify meter and rhythm,ororchestrate instruments.However, while dramatically more efficient, this new paradigm ignores more conventional forms of musical expression rooted in the manipulation of lower-level attributes,limiting the ability to express precise musical intent or leverage models in existing creative workflows.

Music ControlNet: Multiple Time-varying Controls for Music Generation (1)

There are two primary obstacles for adding precise control to text-based music generation methods.Firstly, relative to symbolic music representations like scores,text is a cumbersome interface for conveying precise musical attributes that vary over time.Verbose and mundane text descriptions may be needed to precisely represent even the first note of a musical scoree.g.,“the song starts at 80808080 beats per minute with a quarter note on middle C played mezzo-forte on the saxophone”.The second obstacle is an empirical one—text-to-music models tend to faithfully interpret global stylistic attributes (e.g.,genre and mood) from text, but struggle to interpret text descriptions of precise musical attributes (e.g.,notes or rhythms).This is perhaps a consequence of the relative scarcity of precise descriptions in the training data.

A potential solution to the musical imprecision of natural language is the incorporation of time-varying controls into music generation.For example, one body of work looks at synthesizing music audio from time-varying symbolic music representations like MIDI[6, 7], however this approach offers a particularly strict form of control requiring users to compose entire pieces of music beforehand. Such approaches are more similar to typical music composition processes and do not take full advantage of recent text-to-music methods.Another body of work on musical style transfer[8, 9, 10, 11, 12, 13] seeks to transform recordings from one style (e.g., genre, musical ensemble, or mood) to another while preserving the underlying composition content.However,a majority of these approaches require training an individual model per style,as opposed to the flexibility of using text to control style in a single model.

In this work, we propose Music ControlNet, a diffusion-based music generation model that offers multiple time-varying controls over the melody, dynamics, and rhythm of generated audio, in addition to global text-based stylecontrol as shown inFig.1.To incorporatesuch time-varying controls,we adapt recent work on image generation with spatial control, namely, ControlNet[14] and Uni-ControlNet[15] to enable musical controls that are composable (i.e., can generate music corresponding to any subset of controls) and further allow creators to onlypartially specify each of the controlsboth for convenience and to direct our model to musically improvise in remaining time spans of the generation.To overcome the aforementioned scarcity of precise, ground-truth control inputs, following[5, 16],we extract useful control signals directly from music during training.We evaluate our method on two different categories of control signals:(1)extracted control signals that come from example songs, which are similar to those seen during training, and(2)created control signals that we anticipate creators might want to use in a co-creation setting.Our experiments show that we can generate realistic music that accurately corresponds to control inputs in both settings.Moreover, we compare our approach against the melody control of the recently proposed MusicGen[5], showing that our model is 49494949% more faithful to melody input, despite also controlling dynamics and rhythm, having 35353535x fewer parameters, and being trained on 11111111x less data.

Our contributions include:

  • A general framework for augmenting text-to-music models withcomposable,precise, time-varying musical controls.

  • A method to enable one or more partially-specified time-varying controls at inference.

  • Effective application of our framework to melody, dynamics, and rhythm control using music feature extraction algorithms together with conditional diffusion models.

  • Demonstration that our model generalizes from extracted controls seen during training to ones we expect from creators.

II Background: Diffusion and Image Generation

II-A Diffusion Models

We use denoisingdiffusionprobabilisticmodels(DDPMs) [17, 18] as our underlying generative modeling approach for music audio.DDPMs are a class of latent generative variable model. A DDPM generates data 𝒙(0)𝒳superscript𝒙0𝒳\bm{x}^{(0)}\in\mathcal{X}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∈ caligraphic_X from Gaussian noise 𝒙(M)𝒳superscript𝒙𝑀𝒳\bm{x}^{(M)}\in\mathcal{X}bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ∈ caligraphic_X through a denoising Markov process that produces intermediate latents 𝒙(M1),𝒙(M2),,𝒙(1)𝒳superscript𝒙𝑀1superscript𝒙𝑀2superscript𝒙1𝒳\bm{x}^{(M-1)},\bm{x}^{(M-2)},\dots,\bm{x}^{(1)}\in\mathcal{X}bold_italic_x start_POSTSUPERSCRIPT ( italic_M - 1 ) end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT ( italic_M - 2 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∈ caligraphic_X, where 𝒳𝒳\mathcal{X}caligraphic_X is the data space.DDPMs can be formulated as the task of modeling the joint probability distribution of the desired output data 𝒙(0)superscript𝒙0\bm{x}^{(0)}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and all intermediate latent variables, i.e.,

pθ(𝒙(0),,𝒙(M)):=p(𝒙(M))m=1Mpθ(𝒙(m1)|𝒙(m)),assignsubscript𝑝𝜃superscript𝒙0superscript𝒙𝑀𝑝superscript𝒙𝑀superscriptsubscriptproduct𝑚1𝑀subscript𝑝𝜃conditionalsuperscript𝒙𝑚1superscript𝒙𝑚p_{\theta}(\bm{x}^{(0)},\dots,\bm{x}^{(M)})\vcentcolon=p(\bm{x}^{(M)})\prod_{m=1}^{M}p_{\theta}(\bm{x}^{(m-1)}|\bm{x}^{(m)})\,,italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) := italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ,(1)

where θ𝜃\thetaitalic_θ denotes the set of parameters to be learned, and p(𝒙(M)):=𝒩(𝟎,𝑰)assign𝑝superscript𝒙𝑀𝒩0𝑰p(\bm{x}^{(M)})\vcentcolon=\mathcal{N}(\bm{0},\bm{I})italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) := caligraphic_N ( bold_0 , bold_italic_I )is a fixed noise prior.

To create training examples, a forward diffusion process q(𝒙(0),,𝒙(M))𝑞superscript𝒙0superscript𝒙𝑀q(\bm{x}^{(0)},\dots,\bm{x}^{(M)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ) is used to gradually corrupt clean data examples 𝒙(0)superscript𝒙0\bm{x}^{(0)}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT via a Markov chain that iteratively adds noise:

q(𝒙(0),,𝒙(M))𝑞superscript𝒙0superscript𝒙𝑀\displaystyle q(\bm{x}^{(0)},\dots,\bm{x}^{(M)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ):=q(𝒙(0))m=1Mq(𝒙(m)|𝒙(m1))assignabsent𝑞superscript𝒙0superscriptsubscriptproduct𝑚1𝑀𝑞conditionalsuperscript𝒙𝑚superscript𝒙𝑚1\displaystyle\vcentcolon=q(\bm{x}^{(0)})\prod_{m=1}^{M}q(\bm{x}^{(m)}|\bm{x}^{(m-1)})\,:= italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT )(2)
q(𝒙(m)|𝒙(m1))𝑞conditionalsuperscript𝒙𝑚superscript𝒙𝑚1\displaystyle q(\bm{x}^{(m)}|\bm{x}^{(m-1)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT ):=𝒩(1βm𝒙(m1),βm𝑰),assignabsent𝒩1subscript𝛽𝑚superscript𝒙𝑚1subscript𝛽𝑚𝑰\displaystyle\vcentcolon=\mathcal{N}(\sqrt{1-\beta_{m}}\bm{x}^{(m-1)},\beta_{m}\bm{I}),:= caligraphic_N ( square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT , italic_β start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT bold_italic_I ) ,

where q(𝒙(0))𝑞superscript𝒙0q(\bm{x}^{(0)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) is the true data distribution, and β1,,βMsubscript𝛽1subscript𝛽𝑀\beta_{1},\dots,\beta_{M}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT are a sequence of parameters that define the noise level within the forward diffusion process, also known as the noise schedule.

By definition of q(𝒙(m)|𝒙(m1))𝑞conditionalsuperscript𝒙𝑚superscript𝒙𝑚1q(\bm{x}^{(m)}|\bm{x}^{(m-1)})italic_q ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUPERSCRIPT ( italic_m - 1 ) end_POSTSUPERSCRIPT ), it follows that the noised data 𝒙(m)superscript𝒙𝑚\bm{x}^{(m)}bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT at any noise level m{1,,M}𝑚1𝑀m\in\{1,\dots,M\}italic_m ∈ { 1 , … , italic_M } can be sampled in one step via:

𝒙(m):=α¯m𝒙(0)+1α¯mϵ,assignsuperscript𝒙𝑚subscript¯𝛼𝑚superscript𝒙01subscript¯𝛼𝑚bold-italic-ϵ\bm{x}^{(m)}\vcentcolon=\sqrt{\bar{\alpha}_{m}}\bm{x}^{(0)}+\sqrt{1-\bar{\alpha}_{m}}\bm{\epsilon}\,,bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT := square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG bold_italic_ϵ ,(3)

where α¯m:=m=1m(1βm)assignsubscript¯𝛼𝑚superscriptsubscriptproductsuperscript𝑚1𝑚1subscript𝛽superscript𝑚\bar{\alpha}_{m}\vcentcolon=\prod_{m^{\prime}=1}^{m}(1-\beta_{m^{\prime}})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), ϵ𝒩(𝟎,𝑰)similar-tobold-italic-ϵ𝒩0𝑰\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_italic_I ), and M𝑀Mitalic_M is the total number of noise levels or steps during training.It was shown by Ho et al.[18] that we can optimize the variational lower bound[19] of the data likelihood, i.e., pθ(𝒙(0))subscript𝑝𝜃superscript𝒙0p_{\theta}(\bm{x}^{(0)})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ),by training a function approximator, e.g., a neural network, fθ(𝒙(m),m):𝒳×𝒳:subscript𝑓𝜃superscript𝒙𝑚𝑚𝒳𝒳{f_{\theta}(\bm{x}^{(m)},m):\mathcal{X}\times\mathbb{N}\rightarrow\mathcal{X}}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m ) : caligraphic_X × blackboard_N → caligraphic_X to recover the noise ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ added via(3).More specifically, fθ(𝒙(m),m)subscript𝑓𝜃superscript𝒙𝑚𝑚f_{\theta}(\bm{x}^{(m)},m)italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m ) can be trained by minimizing the mean squared error, i.e.,

𝔼𝒙(0),ϵ,m[ϵfθ(𝒙(m),m)22].subscript𝔼superscript𝒙0bold-italic-ϵ𝑚delimited-[]subscriptsuperscriptdelimited-∥∥bold-italic-ϵsubscript𝑓𝜃superscript𝒙𝑚𝑚22\mathbb{E}_{\bm{x}^{(0)},\bm{\epsilon},m}\Big{[}\lVert\bm{\epsilon}-f_{\theta}(\bm{x}^{(m)},m)\rVert^{2}_{2}\Big{]}\,.blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ϵ , italic_m end_POSTSUBSCRIPT [ ∥ bold_italic_ϵ - italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] .(4)

With a trained fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we can transform random noise 𝒙(M)𝒩(𝟎,𝑰)similar-tosuperscript𝒙𝑀𝒩0𝑰{\bm{x}^{(M)}\sim\mathcal{N}(\bm{0},\bm{I})}bold_italic_x start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) to a realistic data point 𝒙(0)superscript𝒙0\bm{x}^{(0)}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT through M𝑀Mitalic_M denoising iterations.To obtain high-quality generations, a large M𝑀Mitalic_M (e.g., 1000100010001000) is typically used.To reduce computational cost, denoising diffusion implicit models (DDIM)[20] furtherproposed an alternative formulation that allows running much fewer than M𝑀Mitalic_M sampling steps (e.g., 50100similar-to5010050\sim 10050 ∼ 100) at inference with minimal impact on generation quality.

II-B UNet Architecture for Image Diffusion Models

Our approach to music generation is rooted in methodology developed primarily for generative modeling of images.When applying diffusion modeling for image generation, the function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPTis often a large UNet[18, 21].The UNet architecture consists of two halves, an encoder and a decoder, that typically input and output image-like feature mapsin the pixel space[22] or some learned latent space[23].The encoder progressively downsamples the input to learn useful features at different resolution levels,while the decoder, which has a mirroring architecture to the encoder and accepts features from corresponding encoder layers through skip connections, progressively upsamples the features to eventually get back to the input dimension.For practical use,diffusion-based image generation models are often text-conditioned,which requires augmenting the network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to accept a text description 𝒄text𝒯subscript𝒄text𝒯{\bm{c}_{\text{text}}\in\mathcal{T}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ∈ caligraphic_T, where 𝒯𝒯\mathcal{T}caligraphic_T is the set of all text descriptions.This leads to the following function signature:

fθ(𝒙(m),m,𝒄text):𝒳××𝒯𝒳,:subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄text𝒳𝒯𝒳f_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\text{text}}):\mathcal{X}\times\mathbb{N}\times\mathcal{T}\rightarrow\mathcal{X}\,,italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) : caligraphic_X × blackboard_N × caligraphic_T → caligraphic_X ,(5)

which, via the process outlined in Sec.II-A, models the desired probability distribution pθ(𝒙(0)|𝒄text)subscript𝑝𝜃conditionalsuperscript𝒙0subscript𝒄textp_{\theta}(\bm{x}^{(0)}\,|\,\bm{c}_{\text{text}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ).The text condition 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT is typically a sequence of embeddings from a large language model (LLM) or one or more embeddings from a learned embedding layer for class-conditional control. In either case, the conditioning signals m𝑚mitalic_m (i.e., the diffusion time step) and 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT are usually incorporated in the UNet hidden layers via additive sinusoidal embeddings[18] and/or cross-attention[23].

II-C Classifier-free Guidance

To improve the flexibility of text conditioning, classifier-free guidance (CFG) is commonly employed. CFG is used to simultaneously learn a conditional and unconditional generative model together and trade-off conditioning strength, mode coverage, and sample quality[24]. Practically speaking, during training CFG is achieved by randomly setting conditioning information to a special null value 𝒄subscript𝒄\bm{c}_{\emptyset}bold_italic_c start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT for a fraction of the time during training. Then during inference, an image is generated using conditional control inputs, unconditional control inputs, or a linear combination of both. In most cases, a forward pass of fθ(𝒙(m),m,𝒄text)subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄textf_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\text{text}})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) and fθ(𝒙(m),m,𝒄)subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄f_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\emptyset})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT ) per sampling step are needed and subsequent weighted averaging.

II-D Adding Pixel-level Controls to Image Diffusion Models

ControlNet[14] proposed an effective method to add pixel-level (i.e., spatial) controls to large-scale pretrained text-to-image diffusion models.Let the diffusion model input/output space be images, i.e., 𝒳:=W×H×Dassign𝒳superscript𝑊𝐻𝐷\mathcal{X}\vcentcolon=\mathbb{R}^{W\times H\times D}caligraphic_X := blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H × italic_D end_POSTSUPERSCRIPT, where W,H,D𝑊𝐻𝐷W,H,Ditalic_W , italic_H , italic_D are respectively the width, height, and depth (for RGB images, D=3𝐷3D=3italic_D = 3) of an image,we denote the set of N𝑁Nitalic_N pixel-level controls as:

𝑪:={𝒄(n)W×H×Dn}n=1N,assign𝑪superscriptsubscriptsuperscript𝒄𝑛superscript𝑊𝐻subscript𝐷𝑛𝑛1𝑁\bm{C}\vcentcolon=\{\bm{c}^{(n)}\in\mathbb{R}^{W\times H\times D_{n}}\}_{n=1}^{N}\,,bold_italic_C := { bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_W × italic_H × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ,(6)

where Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the depth specific to each 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT.For each condition signal 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT,every pixel 𝒄i,j(n)Dnsubscriptsuperscript𝒄𝑛𝑖𝑗superscriptsubscript𝐷𝑛\bm{c}^{(n)}_{i,j}\in\mathbb{R}^{D_{n}}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where i{1,,W}𝑖1𝑊i\in\{1,\dots,W\}italic_i ∈ { 1 , … , italic_W } and j{1,,H}𝑗1𝐻j\in\{1,\dots,H\}italic_j ∈ { 1 , … , italic_H }, asserts an attribute on the corresponding pixel 𝒙i,j(0)subscriptsuperscript𝒙0𝑖𝑗\bm{x}^{(0)}_{i,j}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT in the output image. For example, “𝒙i,j(0)subscriptsuperscript𝒙0𝑖𝑗\bm{x}^{(0)}_{i,j}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is (not) part of an edge” or “the perceptual depth of 𝒙i,j(0)subscriptsuperscript𝒙0𝑖𝑗\bm{x}^{(0)}_{i,j}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT”.Naturally, the function to be learned, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, should be revised again as:

fθ(𝒙(m),m,𝒄text,𝑪):𝒳××𝒯×𝒞𝒳,:subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄text𝑪𝒳𝒯𝒞𝒳f_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\text{text}},\bm{C}):\mathcal{X}\times\mathbb{N}\times\mathcal{T}\times\mathcal{C}\rightarrow\mathcal{X}\,,italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) : caligraphic_X × blackboard_N × caligraphic_T × caligraphic_C → caligraphic_X ,(7)

where 𝒞𝒞\mathcal{C}caligraphic_C denotes the set of all possible sets of control signals.The updated fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT hence implicity models pθ(𝒙(0)|𝒄text,𝑪)subscript𝑝𝜃conditionalsuperscript𝒙0subscript𝒄text𝑪p_{\theta}(\bm{x}^{(0)}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ).

To promote training data efficiency,ControlNet instantiates fθ(𝒙(m),m,𝒄text,𝑪)subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄text𝑪f_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\text{text}},\bm{C})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) byreusing the pretrained (and frozen) text-conditioned UNet,and clones its encoder half to form an adaptor branch to incorporate pixel-level control through finetuning.To gracefully bring in the information from pixel-level control, it enters the adaptor branch through a convolution layer that is initialized to zeros (i.e., a zero convolution layer).Outputs from layers of the adaptor branch are then fed back to the corresponding layers of the frozen pretrained decoder, also through zero convolution layers, to influence the final output.Uni-ControlNet[15] then augmented the adaptor branch such that one model can be finetuned to accept multiple pixel-level controls via a single adaptor branch without the need to specify all controls at once whereas ControlNet requires separate adaptor branches per control.

III Music ControlNet

Our Music ControlNet framework builds on the methodology of text-to-image generation with pixel-level controls, i.e., ControlNet[14] and Uni-ControlNet[15], and extends it for text-to-audio generation with time-varying controls.We formulate our controllable audio generation task, explain the links and differences to ControlNet, and detail our essential model architecture and training modifications below.

III-A Problem Formulation

Our overall goal is to learn a conditional generative modelp(𝒘|𝒄text,𝑪)𝑝conditional𝒘subscript𝒄text𝑪p(\bm{w}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p ( bold_italic_w | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C )over audio waveforms 𝒘𝒘\bm{w}bold_italic_w, given a global (i.e., time-independent) text control 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT, and a set of time-varying controls 𝑪𝑪\bm{C}bold_italic_C. Due to our dataset, we limit 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT to musical genre and moods tags.Waveforms 𝒘𝒘\bm{w}bold_italic_w are vectors in TfssuperscriptsubscriptTfs\mathbb{R}^{\mathrm{T}\mathrm{f_{s}}}blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where TT\mathrm{T}roman_T is the length of audio in seconds and fssubscriptfs\mathrm{f_{s}}roman_f start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT is the sampling rate (i.e., number of samples per second).As fssubscriptfs\mathrm{f_{s}}roman_f start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT is large (typically between 16kHz and 48kHz),it is empirically difficult to directly model p(𝒘|)𝑝conditional𝒘p(\bm{w}\,|\,\cdot)italic_p ( bold_italic_w | ⋅ ).Hence, we adopt a common hierarchical approach of using spectrograms as an intermediary.A spectrogram 𝒔Tfk×B×D𝒔superscriptsubscriptTfk𝐵𝐷\bm{s}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times B\times D}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × italic_B × italic_D end_POSTSUPERSCRIPT is an image-like representation for audio signals, obtained through Fourier Transform on 𝒘𝒘\bm{w}bold_italic_w,where fksubscriptfk\mathrm{f_{k}}roman_f start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT is the frame rate (usually 50similar-to\sim100 per second), B𝐵Bitalic_B is the number of frequency bins, and D=1𝐷1D=1italic_D = 1 for mono-channel audio.With 𝒔𝒔\bm{s}bold_italic_s as the intermediary, we instead model the joint distributionp(𝒘,𝒔|𝒄text,𝑪)𝑝𝒘conditional𝒔subscript𝒄text𝑪{p(\bm{w},\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})}italic_p ( bold_italic_w , bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ), which can be factorized as:

p(𝒘,𝒔|𝒄text,𝑪)=𝑝𝒘conditional𝒔subscript𝒄text𝑪absent\displaystyle p(\bm{w},\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})=italic_p ( bold_italic_w , bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) =p(𝒘|𝒔,𝒄text,𝑪)p(𝒔|𝒄text,𝑪)𝑝conditional𝒘𝒔subscript𝒄text𝑪𝑝conditional𝒔subscript𝒄text𝑪\displaystyle\;p(\bm{w}\,|\,\bm{s},\bm{c}_{\text{text}},\bm{C})\cdot p(\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p ( bold_italic_w | bold_italic_s , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) ⋅ italic_p ( bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C )(8)
:=assign\displaystyle\vcentcolon=:=pϕ(𝒘|𝒔)pθ(𝒔|𝒄text,𝑪),subscript𝑝italic-ϕconditional𝒘𝒔subscript𝑝𝜃conditional𝒔subscript𝒄text𝑪\displaystyle\;p_{\phi}(\bm{w}\,|\,\bm{s})\cdot p_{\theta}(\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})\,,italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_w | bold_italic_s ) ⋅ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) ,(9)

where ϕitalic-ϕ\phiitalic_ϕ and θ𝜃\thetaitalic_θ are sets of parameters to be learned.Note that this factorization assumes conditional independence between waveform 𝒘𝒘\bm{w}bold_italic_w and all control signals 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT and 𝑪𝑪\bm{C}bold_italic_C given spectrogram 𝒔𝒔\bm{s}bold_italic_s,which is reasonable if the time-varying controls in 𝑪𝑪\bm{C}bold_italic_C vary at a rate no faster than fksubscriptfk\mathrm{f_{k}}roman_f start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT by nature.

In our work,we focus on modeling spectrograms given controls, i.e., pθ(𝒔|𝒄text,𝑪)subscript𝑝𝜃conditional𝒔subscript𝒄text𝑪p_{\theta}(\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ), and directly apply the DiffWave vocoder[25] to model pϕ(𝒘|𝒔)subscript𝑝italic-ϕconditional𝒘𝒔p_{\phi}(\bm{w}\,|\,\bm{s})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_w | bold_italic_s ).Following the text-to-image ControlNet[14] model, we leverage diffusion models[18] to learn pθ(𝒔|𝒄text,𝑪)subscript𝑝𝜃conditional𝒔subscript𝒄text𝑪p_{\theta}(\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ).If we set the input space 𝒳:=Tfk×B×Dassign𝒳superscriptsubscriptTfk𝐵𝐷{\mathcal{X}\vcentcolon=\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times B\times D}}caligraphic_X := blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × italic_B × italic_D end_POSTSUPERSCRIPT, and the desired output 𝒙(0):=𝒔assignsuperscript𝒙0𝒔\bm{x}^{(0)}\vcentcolon=\bm{s}bold_italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT := bold_italic_s, we can instantiate a neural network fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT having an identical function signature to Eqn.(7).However, we observe two key differences between pixel-level controls for images and time-varying controls for audio/music.

First, the first two dimensions in a spectrogram 𝒔𝒔\bm{s}bold_italic_s have different semantic meanings, one being time and the other being frequency, as opposed to both being spatial in an image.Second, the time-varying controls useful to creators are closely coupled with time, but could have a much more relaxed relationship with frequency such that the second dimension of(6) cannot be restricted to B𝐵Bitalic_B.For example,an intuitive control over ‘musical dynamics’ may involve defining volume over time, not over frequency.A high dynamics value for one frame can mean a number of different profiles over the B𝐵Bitalic_B frequency bins for the corresponding spectrogram frame, e.g., a powerful bass playing a single pitch, or a rich harmony of multiple pitches, which the model has freedom to decide.Therefore, we relax the definition for the set of N𝑁Nitalic_N control signals to become:

𝑪:={𝒄(n)Tfk×Bn×Dn}n=1N,assign𝑪superscriptsubscriptsuperscript𝒄𝑛superscriptsubscriptTfksubscript𝐵𝑛subscript𝐷𝑛𝑛1𝑁\bm{C}\vcentcolon=\{\bm{c}^{(n)}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times B_{n}\times D_{n}}\}_{n=1}^{N}\,,bold_italic_C := { bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ,(10)

where Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPTis the number of classesspecific to each control 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, which is not bound to B𝐵Bitalic_B.With this updated definition, the correspondence between control signals 𝑪𝑪\bm{C}bold_italic_C and the output spectrogram 𝒙𝒙\bm{x}bold_italic_x naturally becomes frame-wise.For example, suppose 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT represents dynamics control, a frame for the control 𝒄t(n)1×1subscriptsuperscript𝒄𝑛𝑡superscript11{\bm{c}^{(n)}_{t}\in\mathbb{R}^{1\times 1}}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 end_POSTSUPERSCRIPT, where t{1,,Tfk}𝑡1subscriptTfkt\in\{1,\dots,\mathrm{T}\mathrm{f_{k}}\}italic_t ∈ { 1 , … , roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT }, then describes “the musical dynamics (intensity) of the spectrogram frame𝒔tsubscript𝒔𝑡\bm{s}_{t}bold_italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT”.

Finally, we consider time-varying controls 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT that can be directly extracted from spectrograms.Given that spectrograms are also computed directly from waveforms,only pairs of (𝒘,𝒄text)𝒘subscript𝒄text{(\bm{w},\bm{c}_{\text{text}})}( bold_italic_w , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) are necessary for training, causing no extra annotation overhead.Nevertheless, we note that our formulation supports manually annotated time-varying controls as well.

III-B Adding Time-varying Controls to Diffusion Models

We propose a strategy to learn the mapping between input controls and the frequency axis of output spectrograms, marking an update from the ControlNet[14] method for image modeling.As mentioned inSectionII-D, ControlNet clones the encoder half of the pretrained UNet for text-to-image generation as the adaptor branch, which uses newly attached zero convolution layers to enable pixel-level control.Let f~(l)(𝒙(m,l1),m,𝒄text,𝑪)superscript~𝑓𝑙superscript𝒙𝑚𝑙1𝑚subscript𝒄text𝑪\tilde{f}^{(l)}(\bm{x}^{(m,l-1)},m,\bm{c}_{\text{text}},\bm{C})over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) denote the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT block of the adaptor branch,where m𝑚mitalic_m is the diffusion time step, 𝒙(m,l1)superscript𝒙𝑚𝑙1\bm{x}^{(m,l-1)}bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT contains the features of the noised image after l1𝑙1l-1italic_l - 1 blocks, and 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT, 𝑪𝑪\bm{C}bold_italic_C are the text and pixel-level controls respectively.Considering the case 𝑪:={𝒄(1)}assign𝑪superscript𝒄1\bm{C}\vcentcolon=\{\bm{c}^{(1)}\}bold_italic_C := { bold_italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT } which is consistent with past work[14], the pixel-level control is incorporated via:

f~(l)(𝒙(m,l1),m,𝒄text,𝑪):=assignsuperscript~𝑓𝑙superscript𝒙𝑚𝑙1𝑚subscript𝒄text𝑪absent\displaystyle{\tilde{f}}^{(l)}(\bm{x}^{(m,l-1)},m,\bm{c}_{\text{text}},\bm{C})\vcentcolon=over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) :=(11)
𝒵out(f(l)(𝒙(m,l1)+𝒵in(𝒄(1)),m,𝒄text)),subscript𝒵outsuperscript𝑓𝑙superscript𝒙𝑚𝑙1subscript𝒵insuperscript𝒄1𝑚subscript𝒄text\displaystyle\mathcal{Z}_{\mathrm{out}}(f^{(l)}(\bm{x}^{(m,l-1)}+\mathcal{Z}_{\mathrm{in}}(\bm{c}^{(1)}),m,\bm{c}_{\text{text}}))\,,caligraphic_Z start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT + caligraphic_Z start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT ( bold_italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) ) ,

where 𝒵insubscript𝒵in\mathcal{Z}_{\mathrm{in}}caligraphic_Z start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT and 𝒵outsubscript𝒵out\mathcal{Z}_{\mathrm{out}}caligraphic_Z start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT are the newly attached zero convolution layers, and f(l)superscript𝑓𝑙f^{(l)}italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is initialized from the lthsuperscript𝑙thl^{\text{th}}italic_l start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT encoder block of the pretrained text-conditioned UNet.

In Music ControlNet, we revamp the control process to be:

f~(l)(𝒙(m,l1),m,𝒄text,𝑪):=assignsuperscript~𝑓𝑙superscript𝒙𝑚𝑙1𝑚subscript𝒄text𝑪absent\displaystyle\tilde{f}^{(l)}(\bm{x}^{(m,l-1)},m,\bm{c}_{\text{text}},\bm{C})\vcentcolon=over~ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) :=(12)
𝒵out(f(l)(𝒙(m,l1)+𝒵in((𝒄(1))),m,𝒄text)),subscript𝒵outsuperscript𝑓𝑙superscript𝒙𝑚𝑙1subscript𝒵insuperscript𝒄1𝑚subscript𝒄text\displaystyle\mathcal{Z}_{\mathrm{out}}(f^{(l)}(\bm{x}^{(m,l-1)}+\mathcal{Z}_{\mathrm{in}}(\mathcal{M}(\bm{c}^{(1)})),m,\bm{c}_{\text{text}}))\,,caligraphic_Z start_POSTSUBSCRIPT roman_out end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m , italic_l - 1 ) end_POSTSUPERSCRIPT + caligraphic_Z start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT ( caligraphic_M ( bold_italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ) ) , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT ) ) ,

where \mathcal{M}caligraphic_M is an additional 1-hidden-layer MLP that transforms B1subscript𝐵1B_{1}italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, the number of classes for the control 𝒄(1)superscript𝒄1\bm{c}^{(1)}bold_italic_c start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT following(10), to match the number of frequency bins B𝐵Bitalic_B, and simultaneously learns the relationship between control classes and frequency bins.In cases with multiple controls, i.e., 𝑪={𝒄(n)}n=1N𝑪superscriptsubscriptsuperscript𝒄𝑛𝑛1𝑁\bm{C}=\{\bm{c}^{(n)}\}_{n=1}^{N}bold_italic_C = { bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, each control is processed with its individual MLP, i.e., (n)superscript𝑛\mathcal{M}^{(n)}caligraphic_M start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, and then concatenated along the depth dimension, i.e., Dnsubscript𝐷𝑛D_{n}italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, before entering the shared zero-convolution layer 𝒵insubscript𝒵in\mathcal{Z}_{\mathrm{in}}caligraphic_Z start_POSTSUBSCRIPT roman_in end_POSTSUBSCRIPT.

Music ControlNet: Multiple Time-varying Controls for Music Generation (2)

III-C Masking Strategy to Enable Partially-specifiedControls

To give creators the freedom to input any subset of the N𝑁Nitalic_N controls, Uni-ControlNet[15] proposed a CFG-like training strategy to drop out each of the control signals 𝒄(n)superscript𝒄𝑛\bm{c}^{(n)}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT randomly during training. We follow the same strategy and further assign a higher probability to keep or drop all controls[15] as we found that this leads to perceptually better generations.In more detail, we let the index set of control signals be ={1,,N}1𝑁\mathcal{I}=\{1,\dots,N\}caligraphic_I = { 1 , … , italic_N }.At each training step, we then select a subset superscript\mathcal{I}^{\prime}\subseteq\mathcal{I}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊆ caligraphic_I that will be set to zero or dropped.We then directly apply the index subset to the control signals via:

𝒄(n):={𝟎Tfk×Bn×Dnn𝒄(n)n.assignsuperscript𝒄𝑛casessubscript0subscriptTfksubscript𝐵𝑛subscript𝐷𝑛for-all𝑛superscriptsuperscript𝒄𝑛for-all𝑛superscript\bm{c}^{(n)}:=\begin{cases}\bm{0}_{\mathrm{T}\mathrm{f_{k}}\times B_{n}\times D_{n}}\;&\forall n\in\mathcal{I}^{\prime}\\\bm{c}^{(n)}\;&\forall n\in\mathcal{I}\setminus\mathcal{I}^{\prime}\,.\end{cases}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT := { start_ROW start_CELL bold_0 start_POSTSUBSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ∀ italic_n ∈ caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT end_CELL start_CELL ∀ italic_n ∈ caligraphic_I ∖ caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT . end_CELL end_ROW(13)

Doing so induces fθ(𝒙(m),m,𝒄text,𝑪)subscript𝑓𝜃superscript𝒙𝑚𝑚subscript𝒄text𝑪f_{\theta}(\bm{x}^{(m)},m,\bm{c}_{\text{text}},\bm{C})italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_m , bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ) to learn the correspondence between any subset of the N𝑁Nitalic_N control signals and the output spectrogram.

In Music ControlNet, we further desire a model that allows the given subset of controls to be partially-specified in time.Therefore, we devise a new scheme that partially masks the active controls (i.e., those indexed by superscript\mathcal{I}\setminus\mathcal{I}^{\prime}caligraphic_I ∖ caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT)Specifically, we randomly sample a pair (tn,a,tn,b){1,,Tfk}2subscript𝑡𝑛𝑎subscript𝑡𝑛𝑏superscript1subscriptTfk2(t_{n,a},t_{n,b})\in\{1,\dots,\mathrm{T}\mathrm{f_{k}}\}^{2}( italic_t start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_b end_POSTSUBSCRIPT ) ∈ { 1 , … , roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where tn,a<tn,bsubscript𝑡𝑛𝑎subscript𝑡𝑛𝑏t_{n,a}<t_{n,b}italic_t start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT italic_n , italic_b end_POSTSUBSCRIPT, for each of the active controls, and mask them as:

𝒄t(n):={𝟎Bn×Dnift[tn,a,tn,b]𝒄t(n)otherwisenassignsubscriptsuperscript𝒄𝑛𝑡casessubscript0subscript𝐵𝑛subscript𝐷𝑛if𝑡subscript𝑡𝑛𝑎subscript𝑡𝑛𝑏subscriptsuperscript𝒄𝑛𝑡otherwisefor-all𝑛superscript\bm{c}^{(n)}_{t}:=\begin{cases}\bm{0}_{B_{n}\times D_{n}}\;&\mathrm{if}\;t\in[t_{n,a},t_{n,b}]\\\bm{c}^{(n)}_{t}\;&\mathrm{otherwise}\end{cases}\;\;\;\forall n\in\mathcal{I}\setminus\mathcal{I}^{\prime}bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := { start_ROW start_CELL bold_0 start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT × italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL roman_if italic_t ∈ [ italic_t start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_b end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL bold_italic_c start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL roman_otherwise end_CELL end_ROW ∀ italic_n ∈ caligraphic_I ∖ caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT(14)

Fig.2 displays example instantiations of the two masking schemes detailed above.At each training step, after selecting superscript\mathcal{I}^{\prime}caligraphic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (i.e., determining the dropped controls) we choose one of the two masking schemes uniformly at random, and then sample the timestamp pairs (i.e., (tn,a,tn,b)subscript𝑡𝑛𝑎subscript𝑡𝑛𝑏(t_{n,a},t_{n,b})( italic_t start_POSTSUBSCRIPT italic_n , italic_a end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_b end_POSTSUBSCRIPT )’s) when needed. In this way, we further employ a CFG-like training strategy to enable partially-specified controls in a unified manner.

III-D Musical Control Signals

Here, we outline the time-varying controls, i.e., 𝑪𝑪\bm{C}bold_italic_C, that we combine with our proposed framework to build a music generation model that is useful for creators.We propose three control signals, namely, melody, dynamics, and rhythm, that are musically complementary to each other.These controls are designed such that they can be directly extracted from the target spectrogram, i.e, 𝒔Tfk×B×D𝒔superscriptsubscriptTfk𝐵𝐷\bm{s}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times B\times D}bold_italic_s ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × italic_B × italic_D end_POSTSUPERSCRIPT, requiring no human annotation at all,and allow music creators to easily create their control signals at inference time to compose their music from scratch,in addition to remixing, i.e., combining musical elements from different sources, using controls extracted from existing music.Below, we briefly introduce how our control signals are obtained, what connections they have to music, and how creators can form created controls at inference time.Readers may refer to Fig.3 for how the control signals may be visually presented to creators.

  • Melody (cmelTfk×12×1subscript𝑐normal-melsuperscriptsubscriptnormal-Tfnormal-k121\bm{c}_{\mathrm{mel}}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times 12\times 1}bold_italic_c start_POSTSUBSCRIPT roman_mel end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × 12 × 1 end_POSTSUPERSCRIPT):  Following[5], we adopt a variation of the chromagram[26] to encode the most prominent musical tone over time.To do so, we compute a linear spectrogram and then rearrange the energy across the B𝐵Bitalic_B frequency bins into 12121212 pitch classes (or semitones, i.e., C, C-sharp, …, B-flat, B) in a frame-wise manner, i.e., independently for each t{1,,Tfk}𝑡1subscriptTfkt\in\{1,\dots,\mathrm{T}\mathrm{f_{k}}\}italic_t ∈ { 1 , … , roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT }, via the Librosa Chroma function[27].To form a better proxy for melody from the raw chromagram, only the most prominent pitch class is preserved by applying an argmaxargmax\mathrm{argmax}roman_argmax operation to make the chromagram frame-wise one-hot.Additionally, we apply a Biquadratic high-pass filter[28] with a cut-off at Middle C, or 261.2 Hz) before chromagram computation to avoid bass dominance, i.e., the resulting one-hot chromagram encodes the bass notes, rather than the desired melody notes.At test time, the melody control can be created by recording a simple melody, or simply drawing the pitch contour.A desirable model should be able to turn the simple created melody control into rich, high-quality multitrack music.

  • Dynamics (cdynTfk×1×1subscript𝑐normal-dynsuperscriptsubscriptnormal-Tfnormal-k11\bm{c}_{\mathrm{dyn}}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times 1\times 1}bold_italic_c start_POSTSUBSCRIPT roman_dyn end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × 1 × 1 end_POSTSUPERSCRIPT):  The dynamics control is obtained by summing the energy across frequency bins per time frame of a linear spectrogram, and mapping the resulting values to the decibel (dB) scale, which is closely linked to loudness perceived by humans[27].To mitigate rapid fluctuations of the raw dynamic values due to note or percussion onsets, and also to bring our dynamics control closer to the perceived musical intensity, we apply a smoothing filter with one second context window over the frame-wise values (i.e., a Savitzky-Golay filter[29]). The dynamics control not only characterizes the loudness of notes, but also is strongly correlated with important musical intensity-related attributes like instrumentation, harmonic texture, and rhythmic density thanks to the natural correlation between loudness and the aforementioned attributes in human-composed music.During inference, creators can simply draw a line/curve of how they want the musical intensity to vary over time as the created dynamics control.

  • Rhythm (crhyTfk×2×1subscript𝑐normal-rhysuperscriptsubscriptnormal-Tfnormal-k21\bm{c}_{\mathrm{rhy}}\in\mathbb{R}^{\mathrm{T}\mathrm{f_{k}}\times 2\times 1}bold_italic_c start_POSTSUBSCRIPT roman_rhy end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT roman_Tf start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT × 2 × 1 end_POSTSUPERSCRIPT):  For rhythm control, we employ an in-house implementation of an RNN-based beat detector[30] that is trained on a different internal dataset to predict whether a frame is situated on a beat, a downbeat, or neither. We then use the frame-wise beat and downbeat probabilities for control, resulting in 2222 classes per frame.The advantages of our time-varying beat/downbeat control over just inputting a global tempo (i.e, beats per minute) are:(i) it allows creators to precisely synchronize beats/downbeats with, for example, video scene cuts or other moments of interest in the content to be paired with generated music.(ii) it encodes some nuanced information of rhythmic feeling, e.g., whether the music sounds more harmonic or rhythmic, and whether the rhythmic pattern is clear/simple, or complex, on which experienced music creators may want to influence in the generative process.At inference, the rhythm control can be created by time-stretching the beat/downbeat probability curves extracted from existing songs to match the desired tempo.Also, creators can obtain precise beat/downbeat timestamps by feeding the beat/downbeat curves to a Hidden Markov Model (HMM) based post-filter[31, 32], and use the timestamps to shift the curves along the time axis for synchronization purposes mentioned above. We also tried to manually draw spiked curves as the created rhythm control, but the performance of this was worse than our final hand-drawn (i.e., created) dynamics control.

Control signalsMelody acc (%)Dynamics corr (r𝑟ritalic_r, in %)Rhythm F1 (%)CLAPFAD \downarrow
melmel\mathrm{mel}roman_meldyndyn\mathrm{dyn}roman_dynrhyrhy\mathrm{rhy}roman_rhyMicroMacroBeatDownbeat
Global style only  8.5--0.7  0.727.8  7.80.281.51
Single controls58.3  4.4  3.140.212.10.281.34
  8.688.863.636.716.10.261.50
  8.625.834.669.235.40.271.17
Multi controls57.789.764.847.421.80.261.38
59.131.636.370.038.70.261.16
  8.789.660.972.139.90.261.12
58.790.864.070.840.80.251.14

IV Experimental Setup

IV-A Datasets

We train our models on a dataset of 1800absent1800{\approx}1800≈ 1800 hours of licensed instrumental music with genre and mood tags.Our dataset does not have free-form text description,so we use class-conditional text control of global musical style, as done in JukeBox[1].For evaluation, we use data from four sources:(i)an in-domain test set with 2K songs held out from our dataset,(ii)the MusicCaps dataset[2] with around 5K 10-second clips associated with free-form text description(iii)the MusicCaps+ChatGPT dataset where we use ChatGPT[33] to convert the free-form text in MusicCaps to mood and genre tags that match our dataset via the prompt “For the lines of text below, convert each to one of the following [genres or moods] and only output the [genre or mood] per line (no bullets): [MusicCaps description]”, and(iv)a Created Controls dataset of control signals that music creators can realistically give via manual annotation or similar.

IV-B Created Controls Dataset Details

For our Created Controls dataset, we created example melodies, dynamics annotations, and rhythm presets that we envision creators would use during music co-creation via:

  • Melody: We record our piano play of 10 well-known classical public domain music melodies (30 seconds long each) composed by Bach, Vivaldi, Mozart, Beethoven, Schubert, Mendelssohn, Bizet, and Tchaikovsky, and crop two 6-second chunks, minimizing repeated musical content as possible, resulting in a 20-example melody controls.

  • Dynamics: To simulate a creator-drawn dynamics curves, we draw out 6-second long dynamics curves as {Linear,Tanh,Cosine}LinearTanhCosine\{\mathrm{Linear,Tanh,Cosine}\}{ roman_Linear , roman_Tanh , roman_Cosine } functions, either vertically flipped or not, with scaled dynamics ranges of {±6,±9,±12,±15}plus-or-minus6plus-or-minus9plus-or-minus12plus-or-minus15\{\pm 6,\pm 9,\pm 12,\pm 15\}{ ± 6 , ± 9 , ± 12 , ± 15 } decibels from the mean value of all training examples.This leads to 3×\times×2×\times×4=\,=\,=24 created dynamics controls.

  • Rhythm: We create “rhythm presets” via selecting four songs from our in-domain test set with different rhythmic strengths and feelings, extract their rhythm control signals, and time-stretch them using PyTorch interpolation with factors {0.8,0.9,1.0,1.1,1.2}0.80.91.01.11.2\{0.8,0.9,1.0,1.1,1.2\}{ 0.8 , 0.9 , 1.0 , 1.1 , 1.2 } to create 20 rhythm controls.

Each set of created controls is then cross-producted with 10 genres ×\times× 10 moods to form the final dataset of 2.0K, 2.4K and 2.0K samples. Our created controls are distinct from controls that are directly extracted from mixture data during training.

IV-C Model, Training, and Inference Specifics

For our spectrogram generation model pθ(𝒔|𝒄text,𝑪)subscript𝑝𝜃conditional𝒔subscript𝒄text𝑪p_{\theta}(\bm{s}\,|\,\bm{c}_{\text{text}},\bm{C})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_s | bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT , bold_italic_C ),we use a convolutional UNet[21] with 5 2D-convolution ResNet[34] blocks with [64,64,128,128,256]6464128128256[64,64,128,128,256][ 64 , 64 , 128 , 128 , 256 ] feature channels per block with a stride of 2 in between downsampling blocks. The UNet inputs Mel-scaled[35] spectrograms clipped to a dynamic range of 160 dB and scaled to [1,1]11[-1,1][ - 1 , 1 ] computed from 22.05 kHz audio with a hop size of 256 (i.e., frame rate fk86subscriptfk86\mathrm{f_{k}}\approx 86roman_f start_POSTSUBSCRIPT roman_k end_POSTSUBSCRIPT ≈ 86 Hz), a window size of 2048, and 160 Mel bins. For our genre and mood global style control 𝒄textsubscript𝒄text\bm{c}_{\text{text}}bold_italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT, we use learnable class-conditional embeddings with dimension of 256 that are injected into the inner two ResNet blocks of the U-Net via cross-attention. We use a cosine noise schedule with 1000 diffusion steps m𝑚mitalic_m that are injected via sinusoidal embeddings with a learnable linear transformation summed directly with U-Net features in each block. To approximately match the output dimensions of ControlNet (512×\times×512×\times×3), we set our output time dimension to 512 or \approx6 seconds, yielding a 512×\times×160×\times×1 output dimension. We use an L1 training objective between predicted and actual added noise, an Adam optimizer with learning rate to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT with linear warm-up and cosine decay. Due to limited data and efficiency considerations, we instantiate a relatively small model of 41 million parameters and pretrain with distributed data parallel for 5 days on 32 A100 GPUs with a batch size of 24 per GPU.

Given our pretrained global style control model, we finetune on time-varying melody, dynamics, and rhythm controls controls. The time-varying controls enter the pretrained U-Net via an adaptor branch as discussed above. We use the same loss and optimizer used for pretraining and finetune until convergence for 3 days with 8 A100 GPUs.At inference, we use 100-step DDIM[20] sampling, and CFG[24] on global style control with a scale of 4 on the global style control only.

For our spectrogram-to-audio vocoder pϕ(𝒘|𝒔)subscript𝑝italic-ϕconditional𝒘𝒔p_{\phi}(\bm{w}\,|\,\bm{s})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_italic_w | bold_italic_s ), we train a diffusion-based DiffWave[25] vocoder. We leverage an open-source package[36], and use our main training dataset, an Adam optimizer with learning rate of 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, noise prediction L1 loss, a 50-step linear noise schedule,hopsize of 256 samples, sampling rate of 22050Hz, batch size of 50 per GPU and train on 8 GPUs for 10 days. For inference, we adopt DDIM-like fast sampling[25] with six steps.

IV-D Evaluation Metrics

We use the following metrics to evaluate time-varying controllability, adherence to global text (i.e., mood & genre tags) control, and overall audio realism.

  • Melody accuracy examines whether the frame-wise pitch classes (C, C#,…, B; 12 in total) match between the input melody control and that extracted from the generation.

  • Dynamics correlation is the Pearson’s correlation between the frame-wise input dynamics values to the values computed from the generation.We compute two types of correlation, which we call micro and macro correlation respectively.Micro computes r𝑟ritalic_r’s separately for each generation, while macro collects input/generation dynamics values from all generations, and then computes a single r𝑟ritalic_r.The micro correlation examines whether relative dynamics control values within a generation is respected, while the macro one checks the same property across many generations.

  • Rhythm F1follows the standard evaluation methodology for beat/downbeat detection[37, 38].It quantifies the alignment between the beat/downbeat timestamps estimated from the input rhythm control, and those from the generation.The timestamps are estimated by applying an HMM post-filter[31] on the frame-wise (down)beat probabilities (i.e., the rhythm control signal).Following[38], a pair of input and generated (down)beat timestamps are considered aligned if they differ by <70absent70<70< 70 milliseconds.

  • CLAP score[39, 40] evaluate text control adherence via computing the pair-wise cosine similarity of text and audio embeddings extracted from CLAP.CLAP is a dual-encoder foundation model where the encoders respectively receive a text input and an audio input.The text and audio embedding spaces are learned via a contrastive objective[41].To obtain the embeddings for evaluation, we feed the generated audio to the CLAP audio encoder, andset the CLAP text encoder input to “An audio of [mood] [genre] music” to accommodate our tag-based control on global musical style.

  • FAD is the Fréchet distance between the distribution of embeddings from a set of reference audios and that from generated audios[42].It measures ‘how realistic the set of generated audios are’, taking both quality and diversity into account.To ensure comparable FAD scores, we utilize the Google Research FAD package[42], which employs a VGGish[43] model trained on audio classification[44] to extract embeddings from audios.Unless otherwise specified, the reference audios for FAD are our in-domain test dataset.

ControlControl sourceMelody acc (%)Dynamics corr (r𝑟ritalic_r, in %)Rhythm F1 (%)CLAPFAD \downarrow
MicroMacroBeatDownbeat
MelodyExtracted58.30.281.34
Created78.20.271.81
DynamicsExtracted88.863.60.261.50
Created98.593.20.262.18
RhythmExtracted69.235.40.261.17
Created88.645.20.262.93
ControlControl source & spanMelody acc (%)Dynamics corr (r𝑟ritalic_r, in %)Rhythm F1 (%)CLAPFAD \downarrow
MicroMacroBeatDownbeat
MelodyCreated, full78.20.271.81
Created, partial74.30.271.66
DynamicsCreated, full98.593.20.262.18
Created, partial88.689.00.271.52
RhythmCreated, full88.645.20.262.93
Created, partial80.134.80.262.60

V Evaluation and Discussion

We conduct a comprehensive evaluation of our proposed Music ControlNet framework.Specifically, we perform quantitative studies of(i)single vs.multiple time-varying extracted controls,(ii)extracted controls vs.created controls,(iii)fully vs.partially-specified created controls,(iv)extrapolating generation duration beyond the training duration (i.e., 6 seconds),and (v)benchmarking with the 1.5 billion-parameter MusicGen model with melody control.In all experiments, a single fine-tuned model is used with different inference configurations.The duration of generation is 12 or 24 seconds in experiment (iv), 10 seconds in experiment (v) so we can be consistent with MusicCaps[2] benchmark, and 6 seconds in all other experiments. We leverage the fully convolutional nature of our UNet backbone to generate music that is longer than what is seen during training.We conclude with an in-depth qualitative analysis of created generation examples.

V-A Single & Multiple Extracted Controls

We evaluate generation performance by applying different combinations of controls at inference time, using single or multiple control signals extracted from our in-domain test set.The results are shown in TableI.First, we compare generations using global style only (i.e., genre and mood tags) and those with single time-varying controls (rows 2similar-to\sim4).When the corresponding controls are enforced, we observe much higher melody accuracy, dynamics correlations, and rhythm F1s,which indicate thatour proposed control injection mechanism (see Sec.III-B) affords effective time-varying controllability.Interestingly, we find in rows 3 and 4 that the dynamics and rhythm metrics are higher compared to using global style control only (1stst{{}^{\text{st}}}start_FLOATSUPERSCRIPT st end_FLOATSUPERSCRIPT row) even when the corresponding controls are excluded.We hypothesize that this is due to that our rhythm and dynamics controls have natural correlation.

Second, focusing on generations with multiple controls (last 4 rows in TableI), we find the time-varying controllability metrics to remain largely the same compared to single control scenarios.This shows that our model learns to simultaneously respond to multiple controls well despite theadded complexity.However, as more time-varying controls are enforced, text control adherence (CLAP score) degrades mildly, while overall audio realisticness (FAD) is not negatively impacted.

V-B From Extracted to Created Controls

To empower creators to generate music with their own ideas, we evaluate the single-control generations using created controls from our Created Controls dataset.The comparison with extracted controls are displayed in TableII.We notice several interesting insights.First and perhaps unexpectedly, we find that across all three control signals, all time-varying controllability metrics actually improve when using created controls.This demonstrates our model’s generalizability to out-of-domain control inputs.

Second, we find that global style control adherence (CLAP score) is largely unaffected, while FAD appears to degrade.The degradation in FAD is multifaceted.On the one hand, the created controls, naturally creates some music that is distributionally different from the in-domain test set.Hence, we can not expect the desirable generations to score a low FAD.On the other hand, perceptually, we do find the generations with created controls are more often less musically interesting.We find this true particularly for created melody and dynamics controls, where the model may copy the melody with a single instrument on a constant background chord, or match dynamics using monotonous bass or sound effects.However, in practice, we believe this is not an issue as creators can ask for a batch of generations and select the best one.

V-C From Fully- to Partially-specified Controls

We evaluate generation quality using partially-specified, created control signals (made possible by the masking scheme in Sec.III-C) and compare fully-specified created controls in TableIII.For partially-specified cases, for each sample, we specify the control for a random 1.0 to 4.5-second span out of the full 6-second duration.The melody, dynamics, rhythm metrics are computed only within the partially-specified spans,while CLAP and FAD still take the full generated audio as input.Overall, we find that partial control somewhat degrades time-varying controllability compared to the full created control scenarios, but it remains strong and mostly better than using full extracted controls (cf.rows marked by Extracted in TableII).Global style control adherence (CLAP) is unaffected.Overall quality (FAD) improves, suggesting that the less amount of controls induces the generations to match the training distribution better. We also found that the coexistence of controlled and uncontrolled spans did not lead to pronounced incoherence issues.

V-D Extrapolating Duration of Generation

Length

Melody acc(%)CLAPFAD \downarrow
6 sec78.20.271.81
12 sec81.00.322.11
24 sec82.80.332.54
ControlModelExtracted melody controlCreated melody control
MelodyCLAPtagtag{}_{\text{tag}}start_FLOATSUBSCRIPT tag end_FLOATSUBSCRIPTCLAPtexttext{}_{\text{text}}start_FLOATSUBSCRIPT text end_FLOATSUBSCRIPTFADMCaps{}_{\text{MCaps}}\downarrowstart_FLOATSUBSCRIPT MCaps end_FLOATSUBSCRIPT ↓FADours{}_{\text{ours}}\downarrowstart_FLOATSUBSCRIPT ours end_FLOATSUBSCRIPT ↓MelodyCLAPtagtag{}_{\text{tag}}start_FLOATSUBSCRIPT tag end_FLOATSUBSCRIPTCLAPtexttext{}_{\text{text}}start_FLOATSUBSCRIPT text end_FLOATSUBSCRIPTFADMCaps{}_{\text{MCaps}}\downarrowstart_FLOATSUBSCRIPT MCaps end_FLOATSUBSCRIPT ↓FADours{}_{\text{ours}}\downarrowstart_FLOATSUBSCRIPT ours end_FLOATSUBSCRIPT ↓
TextonlyOurs0.330.2010.52.5
MusicGen0.320.28  4.63.8
Melody(full)Ours47.10.330.2210.82.582.60.330.1911.22.0
MusicGen41.30.340.29  5.72.555.20.340.28  6.22.8
Melody(1/212\nicefrac{{1}}{{2}}/ start_ARG 1 end_ARG start_ARG 2 end_ARG prompt)Ours46.70.330.2110.92.580.80.330.2011.11.9
MusicGen42.10.340.29  5.72.356.80.340.28  6.12.3

The 6-second duration of our model can be restrictive for some real-world use cases.Therefore, we capitalize on the inherent length-extrapolation ability of our fully convolutional model backbone, and experiment with 12 and 24 second-long generations (i.e., 2x and 4x the duration at training) using created melody controls.The evaluation results are in TableIV.We observe that both time-varying controllability and text control adherence are retained, but the overall audio realisticness, measured by FAD, somewhat degrades.We verify this degradation via listening and note that the background noise level noticeably increases as we extrapolate duration.

V-E Benchmarking with MusicGen on Melody Control

Music ControlNet: Multiple Time-varying Controls for Music Generation (3)

We compare our model trained with melody, dynamics, and rhythm controls to the 1.5B-parameter MusicGen[5] model trained only with melody control.We use the MusicGen model in three scenarios:(i)text-only generation, where we do not pass in melodies,(ii)full melody control, where we pass in melodies that are as long as generation length, and(iii)1/2 prompt melody control, where the melodies passed in are half length.For our model, these scenarios are achieved by omitted, partially-specified, or full melody control.

As MusicGen support free-form text control while our model does not,in this experiment, we use both the MusicCaps and MusicCaps+ChatGPT datasets.Both datasets contains the same audio, but the MusicCaps+ChatGPT dataset has the text descriptions converted into genre & mood tags by ChatGPT.The ChatGPT-converted tags are then used in two ways: as the global style input to our model, and as text input when computing CLAP scores.That is, we have two versions of CLAP when comparing our model to MusicGen, namely, CLAPtexttext{}_{\text{text}}start_FLOATSUBSCRIPT text end_FLOATSUBSCRIPT, which measures CLAP with (original free text, generation audio) tuples, and CLAPtagtag{}_{\text{tag}}start_FLOATSUBSCRIPT tag end_FLOATSUBSCRIPT (i.e., the CLAP metric used in previous experiments), which only allows converted tags as text input to both MusicGen (written as text, e.g., “An audio of happy jazz music”) and our model, and measures CLAP with (converted tags, generation audio) tuples.We also compute two versions of FAD scores, one using MusicCaps as the reference set (i.e., FADMCapsMCaps{}_{\text{MCaps}}start_FLOATSUBSCRIPT MCaps end_FLOATSUBSCRIPT) and the other using our in-domain test set as the reference (i.e., FADOursOurs{}_{\text{Ours}}start_FLOATSUBSCRIPT Ours end_FLOATSUBSCRIPT).We generate 10-second long outputs to be consistent with the MusicCaps dataset and evaluation protocol.

We consider both extracted and created melody controls in this comparison.As shown in TableV, we find our proposed work responds more precisely to the melody control, particularly on created melodies, where our model is as much as 49% relatively more faithful to the control.In terms of text control adherence, when the text input is restricted to the converted mood & genre tags (i.e., the CLAPtagtag{}_{\text{tag}}start_FLOATSUBSCRIPT tag end_FLOATSUBSCRIPT metric), our model is comparable to MusicGen.On overall audio realisticness, as our model is much smaller than MusicGen, and trained on a much more restricted domain of data,it is unsurprising that it scores a worse FAD when using MusicCaps recordings.Moreover, we note that many examples in the MusicCaps datast are, in fact, low-quality audio recordings and/or contain vocals which our model never sees during training, whichmay render 𝐅𝐀𝐃MCapssubscript𝐅𝐀𝐃MCaps\textbf{FAD}_{\text{MCaps}}FAD start_POSTSUBSCRIPT MCaps end_POSTSUBSCRIPT biased against our model.Finally, we note when the reference set is our in-domain test set audios (i.e., FADoursours{}_{\text{ours}}start_FLOATSUBSCRIPT ours end_FLOATSUBSCRIPT), we are competitive to or somewhat better than MusicGen.

Music ControlNet: Multiple Time-varying Controls for Music Generation (4)

V-F Qualitative Analysis of Generations

In Fig.3, we show generation outputs with each of the proposed controls, i.e., melody, dynamics, or rhythm, either extracted or created.Concentrating first on the extracted controls (Fig.3, left half), all of the three control signals are closely followed by our model even with their different dimensions and relationships w.r.t.the spectrogram.Moving on to the created controls (Fig.3, right half), the controls are almost perfectly reflected despite some of them (i.e., melody & dynamics) being out-of-domain from training data.Moreover, our approach is able to wield musical creativity even though the created controls are much simpler than extracted ones.For example, visible from the output spectrograms given melody or dynamics controls,our model generates music with varying texture and rhythmic patterns, rather than simply replicating the monophonic melody, or changing the volume of a single note to match the increasing dynamics.

Fig.4 displays generations using multiple created controls, specifically, with a)full melody & dynamics controls simultaneously enforced, and b)all three controls with partially-specified spans, which simulate the creator intent: “I want the music to start with my signature melody, and have it intensifying at the end with beats synchronized to my video scene cuts to engage my audience.”Example a) verifies the composability of created controls (as opposed to extracted ones, which has been examined in TableI) as both controls are respected by the model.Example b) demonstrates effective control even when controls signals are partially specified, and the capability to generate cohesive music (i.e., the output spectrogram contains no visible borders) when both controlled and uncontrolled spans are present.

VI Related Work

VI-A Text-to-music Generation

Music ControlNet builds on a recent body of work on text-to-music, where the goal is to generate music audio conditioned on text descriptions or categories[1, 45, 2, 3, 4, 46].This line of research is bifurcated into two broad methodological branches which build on advances in natural language processing and computer vision respectively:(i)using LLMs to model tokens from learned audio codecs as proposed in[47, 48], and(ii)using (latent) diffusion to model image-like spectrograms.We explore diffusion to leverage strong inductive biases developed for spatial control.

VI-B Time-varying Controls for Music Generation

Our approach is related to generating music audio from time-varying control.A contemporaneous work is[49], which focuses on a similar goal to ours, but is built on pretrained large language models(LLMs) instead of diffusion models.Work on style transfer includes methods to convert musical recordings in one style to another while preserving underlying symbolic music[8, 9, 10, 11].Other work explores directly synthesizing symbolic music (e.g., MIDI) into audio[6, 7].Both approaches require training individual models per style rather than leveraging text control for style, and needs complete musical inputs rather than simpler controls we explore here.More recently, [50, 16, 51] generate music in broad styles with time-varying control but target tasks with stronger conditions like musical accompaniment or variation generation, which are different applications than ours.Another body of research[52, 53, 54, 55] explores time-varying controls for symbolic-domain music generation, i.e., modeling sheet music or MIDI events.The controls considered in these works are of coarser time scale, e.g., at the measure or phrase level, while our approach offers precise control down to the frame level.

VI-C Unconditional Music Generation

Our work on controllable music audio generation builds on earlier work on unconditional generative modeling of audio.Early approaches explored directly modeling audio waveforms[56, 57, 58].More recent work[47, 48, 59, 60] favors hierarchical approaches like those we consider there.

VII Conclusions

We proposed Music ControlNet, a framework that enables creators to harness music generation with precise, multiple time-varying controls.We demonstrate our framework via melody, dynamics, and rhythm control signals, which are all basic elements in music and complement with each other well.We find that our framework and control signals not only enables any combination of controls, fully- or partially-specified in time, but also generalizes well to controls we envision creators would employ.

Our work paves a number of promising avenues for future research.First, beyond melody, dynamics, and rhythm controls, several additional musical features could be employed such as chord estimation for harmony control, multi-track pitch transcription, instrument classification, or even more abstract controls like emotion and tension.Second, as the set of musical controls becomes large, generating control presets based on text, speech, or video inputs could make controllable music generation systems more approachable to a wide range of content creators.Last but not least, addressing the domain gap between extracted and created controls via, e.g., adversarial approaches[61], could further enhance the musical quality of generations under created controls.

VIII Ethics Statement

Music generation is poised to upend longstanding norms around how music is created and by whom.On the one hand, this presents an opportunity to increase the accessibility of musical expression, but on the other hand, existing musicians may be forced to compete against generated music.While we acknowledge our work carries some risk, we sharply focus on improving control methods so as to directly offer musicians more creative agency during the generation process.Other potential risks surround the inclusion of singing voice, accidental imitation of artists without their consent, and other unforeseen ethical issues, so we use licensed instrumental music for training and melodies extracted from our training data or public domain melodies we recorded ourselves for inference. For evaluation, we do use the MusicCaps dataset[2] as it is standard in recent text-to-music generation literature.

IX Acknowledgements

Thank you to Ge Zhu, Juan-Pablo Caceres, Zhiyao Duan, and Nicholas J. Bryan for sharing their high-fidelity vocoder used for the demo video (citation coming soon).

References

  • [1]P.Dhariwal, H.Jun, C.Payne, J.W. Kim, A.Radford, and I.Sutskever,“Jukebox: A generative model for music,” arXiv:2005.00341, 2020.
  • [2]A.Agostinelli, T.I. Denk, Z.Borsos, J.Engel, M.Verzetti, A.Caillon,Q.Huang, A.Jansen, A.Roberts, M.Tagliasacchi etal., “MusicLM:Generating music from text,” arXiv:2301.11325, 2023.
  • [3]Q.Huang, D.S. Park, T.Wang, T.I. Denk, A.Ly, N.Chen, Z.Zhang, Z.Zhang,J.Yu, C.Frank etal., “Noise2Music: Text-conditioned musicgeneration with diffusion models,” arXiv:2302.03917, 2023.
  • [4]H.Liu, Z.Chen, Y.Yuan, X.Mei, X.Liu, D.Mandic, W.Wang, and M.D.Plumbley, “AudioLDM: Text-to-audio generation with latent diffusionmodels,” in ICML, 2023.
  • [5]J.Copet, F.Kreuk, I.Gat, T.Remez, D.Kant, G.Synnaeve, Y.Adi, andA.Défossez, “Simple and controllable music generation,”arXiv:2306.05284, 2023.
  • [6]C.Hawthorne, A.Stasyuk, A.Roberts, I.Simon, C.-Z.A. Huang, S.Dieleman,E.Elsen etal., “Enabling factorized piano music modeling andgeneration with the MAESTRO dataset,” in ICLR, 2019.
  • [7]C.Hawthorne, I.Simon, A.Roberts, N.Zeghidour, J.Gardner, E.Manilow, andJ.Engel, “Multi-instrument music synthesis with spectrogram diffusion,” inISMIR, 2022.
  • [8]N.Mor, L.Wolf, A.Polyak, and Y.Taigman, “A universal music translationnetwork,” in ICLR, 2019.
  • [9]S.Huang, Q.Li, C.Anil, X.Bao, S.Oore, and R.B. Grosse, “TimbreTron: AWaveNet(CycleGAN(CQT(Audio))) pipeline for musical timbre transfer,” inICLR, 2019.
  • [10]J.Engel, L.Hantrakul, C.Gu, and A.Roberts, “DDSP: Differentiable digitalsignal processing,” in ICLR, 2020.
  • [11]A.Caillon and P.Esling, “RAVE: A variational autoencoder for fast andhigh-quality neural audio synthesis,” arXiv:2111.05011, 2021.
  • [12]Y.Wu, E.Manilow, Y.Deng, R.Swavely, K.Kastner, T.Cooijmans, A.Courvilleetal., “MIDI-DDSP: Detailed control of musical performance viahierarchical modeling,” in ICLR, 2022.
  • [13]C.J. Steinmetz, N.J. Bryan, and J.D. Reiss, “Style transfer of audioeffects with differentiable signal processing,” JAES, 2022.
  • [14]L.Zhang, A.Rao, and M.Agrawala, “Adding conditional control totext-to-image diffusion models,” in ICCV, 2023.
  • [15]S.Zhao, D.Chen, Y.-C. Chen, J.Bao, S.Hao, L.Yuan, and K.-Y.K. Wong,“Uni-ControlNet: All-in-one control to text-to-image diffusion models,”arXiv:2305.16322, 2023.
  • [16]C.Donahue, A.Caillon, A.Roberts, E.Manilow, P.Esling, A.Agostinelli,M.Verzetti etal., “SingSong: Generating musical accompanimentsfrom singing,” arXiv:2301.12662, 2023.
  • [17]J.Sohl-Dickstein, E.Weiss, N.Maheswaranathan, and S.Ganguli, “Deepunsupervised learning using nonequilibrium thermodynamics,” in ICML,2015.
  • [18]J.Ho, A.Jain, and P.Abbeel, “Denoising diffusion probabilistic models,” inNeurIPS, 2020.
  • [19]D.P. Kingma and M.Welling, “Auto-encoding variational bayes,” inICLR, 2013.
  • [20]J.Song, C.Meng, and S.Ermon, “Denoising diffusion implicit models,” inICLR, 2020.
  • [21]O.Ronneberger, P.Fischer, and T.Brox, “U-Net: Convolutional networks forbiomedical image segmentation,” in MICCAI, 2015.
  • [22]C.Saharia, W.Chan, S.Saxena, L.Li, J.Whang, E.L. Denton, K.Ghasemipour,G.Lopes etal., “Photorealistic text-to-image diffusion models withdeep language understanding,” NeurIPS, 2022.
  • [23]R.Rombach, A.Blattmann, D.Lorenz, P.Esser, and B.Ommer, “High-resolutionimage synthesis with latent diffusion models,” in CVPR, 2022.
  • [24]J.Ho and T.Salimans, “Classifier-free diffusion guidance,” in NeurIPSWorkshop on Deep Gen. Models and Downstream Applications, 2021.
  • [25]Z.Kong, W.Ping, J.Huang, K.Zhao, and B.Catanzaro, “DiffWave: Aversatile diffusion model for audio synthesis,” in ICLR, 2021.
  • [26]M.Müller, Fundamentals of music processing: Audio, analysis,algorithms, applications.Springer,2015.
  • [27]B.McFee and etal., “librosa/librosa: 0.10.1,” Aug. 2023. [Online].Available: https://doi.org/10.5281/zenodo.8252662
  • [28]Y.-Y. Yang and Contributors, “Torchaudio: Building blocks for audio andspeech processing,” arXiv preprint arXiv:2110.15018, 2021.
  • [29]P.Virtanen, , and SciPy 1.0 Contributors, “SciPy 1.0: FundamentalAlgorithms for Scientific Computing in Python,” Nature Methods,2020.
  • [30]S.Böck, F.Korzeniowski, J.Schlüter, F.Krebs, and G.Widmer,“Madmom: A new python audio and music signal processing library,” inACM International Conference on Multimedia, 2016.
  • [31]F.Krebs, S.Böck, and G.Widmer, “An efficient state-space model forjoint tempo and meter tracking.” in ISMIR, 2015.
  • [32]S.Böck, F.Krebs, and G.Widmer, “Joint beat and downbeat tracking withrecurrent neural networks.” in ISMIR, 2016.
  • [33]J.Schulman, B.Zoph, C.Kim, J.Hilton, J.Menick, J.Weng etal.,“Introducing ChatGPT,” OpenAI Blog, 2022.
  • [34]K.He, X.Zhang, S.Ren, and J.Sun, “Deep residual learning for imagerecognition,” in CVPR, 2016.
  • [35]S.S. Stevens, J.Volkmann, and E.B. Newman, “A scale for the measurement ofthe psychological magnitude pitch,” JASA, 1937.
  • [36]“DiffWave,” https://github.com/lmnt-com/diffwave, 2023.
  • [37]M.E. Davies, N.Degara, and M.D. Plumbley, “Evaluation methods for musicalaudio beat tracking algorithms,” Queen Mary University of London Tech.Rep. C4DM-TR-09-06, 2009.
  • [38]C.Raffel, B.McFee, E.J. Humphrey, J.Salamon, O.Nieto, D.Liang, D.P.Ellis, and C.C. Raffel, “MIR_EVAL: A transparent implementation ofcommon mir metrics.” in ISMIR, 2014.
  • [39]Y.Wu, K.Chen, T.Zhang, Y.Hui, T.Berg-Kirkpatrick, and S.Dubnov,“Large-scale contrastive language-audio pretraining with feature fusion andkeyword-to-caption augmentation,” in ICASSP, 2023.
  • [40]K.Chen, X.Du, B.Zhu, Z.Ma, T.Berg-Kirkpatrick, and S.Dubnov, “HTS-AT:A hierarchical token-semantic audio transformer for sound classification anddetection,” in ICASSP, 2022.
  • [41]A.v.d. Oord, Y.Li, and O.Vinyals, “Representation learning withcontrastive predictive coding,” arXiv preprint arXiv:1807.03748,2018.
  • [42]K.Kilgour, M.Zuluaga, D.Roblek, and M.Sharifi, “Frechet audio distance: Ametric for evaluating music enhancement algorithms,”arXiv:1812.08466, 2018.
  • [43]S.Hershey, S.Chaudhuri, D.P. Ellis, J.F. Gemmeke, A.Jansen, R.C. Moore,M.Plakal, D.Platt, R.A. Saurous, B.Seybold etal., “CNNarchitectures for large-scale audio classification,” in ICASSP, 2017.
  • [44]J.F. Gemmeke, D.P. Ellis, D.Freedman, A.Jansen, W.Lawrence, R.C. Moore,M.Plakal, and M.Ritter, “Audio set: An ontology and human-labeled datasetfor audio events,” in ICASSP, 2017.
  • [45]S.Forsgren and H.Martiros, “Riffusion: Stable diffusion for real-time musicgeneration,” 2022. [Online]. Available: https://riffusion.com/about
  • [46]K.Chen, Y.Wu, H.Liu, M.Nezhurina, T.Berg-Kirkpatrick, and S.Dubnov,“MusicLDM: Enhancing novelty in text-to-music generation usingbeat-synchronous mixup strategies,” arXiv:2308.01546, 2023.
  • [47]A.vanden Oord, O.Vinyals etal., “Neural discrete representationlearning,” NeurIPS, 2017.
  • [48]S.Dieleman, A.vanden Oord, and K.Simonyan, “The challenge of realisticmusic generation: modelling raw audio at scale,” NeurIPS, 2018.
  • [49]L.Lin, G.Xia, J.Jiang, and Y.Zhang, “Content-based controls for musiclarge language modeling,” arXiv preprint arXiv:2310.17162, 2023.
  • [50]Y.-K. Wu, C.-Y. Chiu, and Y.-H. Yang, “JukeDrummer: conditional beat-awareaudio-domain drum accompaniment generation via transformer VQ-VA,” inISMIR, 2022.
  • [51]H.F. Garcia, P.Seetharaman, R.Kumar, and B.Pardo, “VampNet: Musicgeneration via masked acoustic token modeling,” in ISMIR, 2023.
  • [52]K.Chen, C.-i. Wang, T.Berg-Kirkpatrick, and S.Dubnov, “Music SketchNet:Controllable music generation via factorized representations of pitch andrhythm,” in ISMIR, 2020.
  • [53]H.H. Tan and D.Herremans, “Music FaderNets: Controllable music generationbased on high-level features via low-level feature modelling,” inISMIR, 2020.
  • [54]S.Dai, Z.Jin, C.Gomes, and R.Dannenberg, “Controllable deep melodygeneration via hierarchical music structure representation,” inISMIR, 2021.
  • [55]S.-L. Wu and Y.-H. Yang, “MuseMorphose: Full-song and fine-grained pianomusic style transfer with one Transformer VAE,” IEEE/ACM TASLP,2023.
  • [56]A.vanden Oord, S.Dieleman, H.Zen, K.Simonyan, O.Vinyals, A.Graves,N.Kalchbrenner, A.Senior, and K.Kavukcuoglu, “WaveNet: A generativemodel for raw audio,” arXiv:1609.03499, 2016.
  • [57]N.Kalchbrenner, E.Elsen, K.Simonyan, S.Noury, N.Casagrande, E.Lockhartetal., “Efficient neural audio synthesis,” in ICML, 2018.
  • [58]C.Donahue, J.McAuley, and M.Puckette, “Adversarial audio synthesis,” inICLR, 2019.
  • [59]C.Hawthorne, A.Jaegle, C.Cangea, S.Borgeaud, C.Nash, M.Malinowski, anda.o. Dieleman, “General-purpose, long-context autoregressive modeling withPerceiver AR,” in ICML, 2022.
  • [60]Z.Borsos, R.Marinier, D.Vincent, E.Kharitonov, O.Pietquin, M.Sharifi,D.Roblek, O.Teboul etal., “AudioLM: a language modeling approachto audio generation,” IEEE/ACM TASLP, 2023.
  • [61]D.Kim, Y.Kim, W.Kang, and I.-C. Moon, “Refining generative process withdiscriminator guidance in score-based diffusion models,” in ICML,2023.
Music ControlNet: Multiple Time-varying Controls for Music Generation (2024)
Top Articles
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 5818

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.