Содержание

Jukebox
Curated samples
Motivation and prior work
Approach
Compressing music to discrete codes
Generating codes using transformers
Dataset
Artist and genre conditioning
Lyrics conditioning
Limitations
Future directions
Minidisc FAQ
Вопросы про NetMD

Jukebox

Curated samples

Motivation and prior work

Automatic music generation dates back to more than half a century. A prominent approach is to generate music symbolically in the form of a piano roll, which specifies the timing, pitch, velocity, and instrument of each note to be played. This has led to impressive results like producing Bach chorals, polyphonic music with multiple instruments, as well as minute long musical pieces.

But symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A different approach [1] is to model music directly as raw audio. Generating music at the audio level is challenging since the sequences are very long. A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.

One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.

We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.

Approach

Compressing music to discrete codes

Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-2 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images.

We draw inspiration from VQ-VAE-2 and apply their approach to music. We modify their architecture as follows:

To alleviate codebook collapse common to VQ-VAE models, we use random restarts where we randomly reset a codebook vector to one of the encoded hidden states whenever its usage falls below a threshold.
To maximize the use of the upper levels, we use separate decoders and independently reconstruct the input from the codes of each level.
To allow the model to reconstruct higher frequencies easily, we add a spectral loss that penalizes the norm of the difference of input and reconstructed spectrograms.

We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.

Generating codes using transformers

Next, we train the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. Like the VQ-VAE, we have three levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.

The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality.

We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively.

Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.

Dataset

To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki. The metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. We train on 32-bit, 44.1 kHz raw audio, and perform data augmentation by randomly downmixing the right and left channels to produce mono audio.

Artist and genre conditioning

The top-level transformer is trained on the task of predicting compressed audio tokens. We can provide additional information, such as the artist and genre for each song. This has two advantages: first, it reduces the entropy of the audio prediction, so the model is able to achieve better quality in any particular style; second, at generation time, we are able to steer the model to generate in a style of our choosing.

This t-SNE below shows how the model learns, in an unsupervised way, to cluster similar artists and genres close together, and also makes some surprising associations like Jennifer Lopez being so close to Dolly Parton!

Lyrics conditioning

In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics.

To match audio portions to their corresponding lyrics, we begin with a simple heuristic that aligns the characters of the lyrics to linearly span the duration of each song, and pass a fixed-size window of characters centered around the current segment during training. While this simple strategy of linear alignment worked surprisingly well, we found that it fails for certain genres with fast lyrics, such as hip hop. To address this, we use Spleeter to extract vocals from each song and run NUS AutoLyricsAlign on the extracted vocals to obtain precise word-level alignments of the lyrics. We chose a large enough window so that the actual lyrics have a high probability of being inside the window.

To attend to the lyrics, we add an encoder to produce a representation for the lyrics, and add attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder. After training, the model learns a more precise alignment.

Limitations

While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music.

For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernable noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render one minute of audio through our models, and thus they cannot yet be used in interactive applications. Using techniques that distill the model into a parallel sampler can significantly speed up the sampling speed. Finally, we currently train on English lyrics and mostly Western music, but in the future we hope to include songs from other languages and parts of the world.

Future directions

Our audio team is continuing to work on generating audio samples conditioned on different kinds of priming information. In particular, we’ve seen early success conditioning on MIDI files and stem files. Here’s an example of a raw audio sample conditioned on MIDI tokens. We hope this will improve the musicality of samples (in the way conditioning on lyrics improved the singing), and this would also be a way of giving musicians more control over the generations. We expect human and model collaborations to be an increasingly exciting creative space. If you’re excited to work on these problems with us, we’re hiring.

As generative modeling across various domains continues to advance, we are also conducting research into issues like bias and intellectual property rights, and are engaging with people who work in the domains where we develop tools. To better understand future implications for the music community, we shared Jukebox with an initial set of 10 musicians from various genres to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations. We are connecting with the wider creative community as we think generative work across text, images, and audio will continue to improve. If you’re interested in being a creative collaborator to help us build useful tools or new works of art in these domains, please let us know!

Minidisc FAQ

Вопросы про NetMD

1. Что такое “NetMD”?
NetMD — это расширение минидискового формата, которое позволяет осуществлять прямую передачу сжатых ATRAC данных с ПК на минидиск через интерфейс USB. Передача аудио данных осуществляется быстрее реального режима времени (до 32x для LP4 звука). Впервые расширение NetMD было введено в середине 2001 года для портативного записывающего устройства Sony MZ-N1.

2. Что вы подразумеваете под словами “расширение формата”?
NetMD оборудование требует использование соответствующего программного обеспечения для управления и передачи данных. Благодаря стандартизации USB протокола PC&lt-&gtMD, Sony реализовала совместимость между NetMD оборудованием и программами NetMD от различных изготовителей ПО.

3. Какие существуют программные пакеты NetMD?
— RealOne w/NetMD Plugin (для RealAudio/Windows). RealAudio анонсировала NetMD плагин для их потокового/MP3 проигрывателя. Плагин можно скачать совместно со свободной версией RealOne следуя приведенным ниже инструкциям:
— В меню “Tools” нажмите “Check for Update”
— В окне “AudioUpdate” нажмите “Devices”
— Выберите “Sony Music Devices Plug-in” и нажмите “Install”
— Когда установка закончится, нажмите “Ok” для перезапуска RealOne Player
— OpenMG Jukebox 2.2 (для Sony/Windows). Это основная программа, которую Sony поставляет совместно со своим NetMD оборудованием. Она также вкладывается в комплект с проигрывателями (MemoryStick и Network Walkman). Программа позволяет скачивать сжатый звук на устройство NetMD, равно как подписывать и перенумеровывать дорожки. OpenML Jukebox поддерживает концепцию SDMI дополнительной проверки при записи. На минидиск можно закачать только три копии, затем программа потребует удаление одной из копий с минидиска.
— Simple Burner (для Sony/Windows). Программа поставляется с американскими NetMD устройствами, при этом ее можно скачать и отдельно. Программа упрощает перекачивание дорожек с аудио CD в приводе ПК на минидиск. Дополнительная проверка SDMI здесь отсутствует.
— Beat Jam (для JustSystem Inc/Windows). Поставляется с NetMD устройствами Panasonic.
— Open/NMD Project (для Unix, Linux и MacOS). Находится в разработке, сегодня поддерживает название дорожек и переименование через командную строку.
— Muria (для Kenwood/Windows). Поставляется с Kenwood MDX-J9 NetMD.

5. Какие ограничения накладывает NetMD?
— Защищенные дорожки. Аудио дорожки, закаченные на минидиск с помощью OpenMG Jukebox и BeatJam, помечаются как “защищенные” и не могут быть удалены или разделены большинством минидисковых устройств. Это могут сделать только лишь поддерживающие NetMD устройства с соответствующим ПО. (Такая функция, к сожалению, ограничивает гибкость дисков с NetMD дорожками до уровня MP3 проигрывателей с флешем). Дорожки, закачанные с помощью Simple Burner, не имеют такого ограничения и работают как обычные MD дорожки.

— Передача данных только с ПК на MD. Возможно только лишь передача звука (ПК->MD). Такое ограничение, по всей видимости, связано с авторскими правами. Поэтому передача звука с MD на ПК невозможна. Если вы не согласны с этим, то можете подписать соответствующую петицию.

— Качество SP-режима не доступно. Пользователи могут создавать SP, LP2 и LP4 дорожки на минидиске с помощью OpenMG Jukebox, однако импортируемый в OpenMG Jukebox аудио поток (из CD, MP3 источников) конвертируется и хранится на ПК только лишь в формате LP2 и LP4 ATRAC3 файлов. Так что если даже эти дорожки и закачивать потом на MD в режиме SP, они не будут превосходить качество LP. Через Simple Burner можно закачивать дорожки только в режимах LP2 и LP4.

6. Что известно об информации, которая передается через USB соединение?
Полный взлом NetMD пока еще не осуществлен. Вот что известно на сегодняшний момент:
— Известны элементы протокола. Open/NMD проект документирует протокол по мере его исследования.
— Шифрованная PCM передача для дорожек в SP режиме. Судя по экспериментам Ланца Бирча, по всей видимости, когда программа NetMD создает дорожки на минидиске в режиме SP, то программа распаковывает [всегда сжатые в LP2 или LP4] данные ПК в стандартный 16-битный 44,1 кГц PCM аудио поток, затем шифрует его и закачивает. Факт закачки в PCM потоке очень радует, поскольку теоретически существует возможность закачки звука с качеством режима SP, если научить этому программу. Сегодняшние NetMD записывающие устройства поддерживают только лишь 1,6x закачку для дорожек в SP режиме, однако 12 Мбит/с USB 1.1 позволят закачивать со скоростью до 8x. Для этой цели можно использовать технологию 4x CD->MD перекачки в Sony MXD-D40.
PCM данные шифруются при передаче, поскольку Sony пытается предотвратить незаконный доступ к «закрытому» ATRAC3 контенту, который можно покупать в онлайновых магазинах. Взлом PCM шифра позволит хакерам легко получать «открытые» копии песен.
— Передача LP2/LP4 данных происходит открыто. Как показали эксперименты Бирча, при записи LP2/LP4 дорожек данные передаются блоками по 4096 байт, причем их содержание аналогично файлам на ПК.

7. Существует ли способ передачи MP3 на MD без использования OpenMG Jukebox?
Воспользуемся советами Дино Инглеза, который использовал Simple Burner (вы можете посмотреть его оригинальное сообщение на форуме, оно довольно веселое).
Недостатки: вам понадобится Nero и функция Nero Imagedrive, или что-то подобное для создания образа виртуального диска.
Весь процесс перекачки MP3->MD занимает пять шагов.
1. Откройте Nero, выберите Audio CD из настроек и перетащите туда все MP3 файлы, которые нужно записать. Nero не слишком привередлив в форматах и скоростях потоков, так что вряд ли вы ошибетесь.
2. Запишите, или “прожгите” CD на жесткий диск (не на болванку). Nero дает файлу по умолчанию имя “image.nrg”.
3. Используйте Nero Imagedrive (поставляется вместе с Nero) для монтирования .nrg файла (CD образа), который вы только что создали. Пусть это будет, к примеру, диск F.

Я не эксперт, но у меня эти три шага заняли 2 минуты или меньше для файла размером с обычный аудио CD (то есть это примерно 10 MP3 дорожек). Конфигурация — 1 ГГц/PIII, мощный процессор помогает ускорить преобразование MP3->PCM.

4. Выберите виртуальный CD привод F в Simple Burner и закачайте файлы на минидиск.
5. Когда все сделаете, сотрите.nrg файл.

Если ваш компьютер быстр, то преобразование Simple Burner CD -> ATRAC будет выполнено в памяти на лету, не загружая жесткий диск.

Преимущества данного метода

Вы используете надежные и проверенные программы
Нет проверки на количество копий
В отличие от OpenMG, файлы не остаются на жестком диске
Вы можете удалять и изменять скачанные дорожки на MD без подключения к ПК
Данный способ быстрее и не сильно грузит жесткий диск

8. Ускорится ли передача при переходе к USB 2.0?
Сегодняшняя скорость перекачки на минидиск потребляет менее 20% пиковой пропускной способности USB 1.1 (12 Мбит/с).
— закачка LP4 звука при 32X требует только лишь 2,1 Мбит/с (66 кбит/с * 32)
— закачка LP2 звука при 16X требует тоже 2,1 Мбит/с (132 кбит/с * 16)
— передача SP звука на 1,6X потребляет 2,25 Мбит/с (то есть 1411,2 кбит/с*1,6). (Здесь можно сделать следующие выводы — возможно, драйвер или интерфейс ограничивает передачу по USB данных NetMD до примерно 2,5 Мбит/с, что существенно ограничивает скорость перекачки в режиме SP. Конечно, USB 2.0 в данном случае поможет, но может лучше исправить проблемы в USB 1.1?).

Open mg jukebox net md windows 10