Nvidia’s new AI audio model can synthesize sounds that have never existed

What does a screaming saxophone sound like? The Fugatto model has an answer...

Anyone keeping up with AI research is likely familiar with generative models that can create speech or melodic music simply from text prompts. Nvidia’s newly unveiled "Fugatto" model aims to push this boundary further. Using advanced synthetic training methods and inference-level combination techniques, Fugatto promises to "transform any mix of music, voices, and sounds," including the creation of entirely new, never-before-heard sounds.

Fugatto: A Swiss Army Knife for Sound

Although Fugatto is not yet available for public testing, a sample-filled website offers a glimpse into its capabilities. The model can adjust a variety of distinct audio traits, producing everything from saxophones barking to voices speaking underwater, and even ambulance sirens harmonizing in a choir. While the results may vary, the wide array of possibilities showcases Nvidia's description of Fugatto as "a Swiss Army knife for sound."

The Challenge of Crafting Meaningful Audio Datasets

In a research paper, over a dozen Nvidia researchers discuss the complexities of creating a training dataset that can "reveal meaningful relationships between audio and language." While traditional language models often infer how to process various instructions from text data, understanding and generalizing audio descriptions and traits proves much more challenging without explicit guidance.

To that end, the researchers start by using an LLM to generate a Python script that can create a large number of template-based and free-form instructions describing different audio "personas" (e.g., "standard, young-crowd, thirty-somethings, professional"). They then generate a set of both absolute (e.g., "synthesize a happy voice") and relative (e.g., "increase the happiness of this voice") instructions that can be applied to those personas.

The open-source audio datasets that form the foundation of Fugatto typically don't include these trait measurements by default. To overcome this, the researchers leverage existing audio understanding models to generate "synthetic captions" for their training clips based on specific prompts. These captions provide natural language descriptions that can automatically quantify traits like gender, emotion, and speech quality. Additionally, audio processing tools are used to analyze and quantify more technical aspects of the clips, such as "fundamental frequency variance" or "reverb."

For relational comparisons, the researchers use datasets where one factor remains constant while another varies, such as different emotional interpretations of the same text or various instruments playing identical notes. By analyzing these samples across a large dataset, the model learns to identify audio characteristics associated with specific traits, like the qualities that typically define "happier" speech, or the differences between the sounds of a saxophone and a flute.

After processing a wide range of open-source audio collections, the researchers compiled a highly annotated dataset of 20 million samples, representing at least 50,000 hours of audio. This dataset was then used to train a model with 2.5 billion parameters on a set of 32 Nvidia tensor cores, which began to show reliable results across various audio quality tests.

It's all in the mix

In addition to its training methods, Nvidia highlights Fugatto's "ComposableART" system (short for "Audio Representation Transformation"). When given a text and/or audio prompt, this system uses "conditional guidance" to "independently control and generate (unseen) combinations of instructions and tasks," enabling it to produce "highly customizable audio outputs beyond the scope of the training data." Essentially, it can mix various traits from its training set to create entirely new, never-before-heard sounds.

While the complex mathematics in the paper—such as the "weighted combination of vector fields between instructions, frame indices, and models"—may be difficult to fully grasp, the results speak for themselves. Examples on the project’s webpage and in an Nvidia trailer show how ComposableART can generate sounds like a violin "sounding like a laughing baby," a banjo "playing in front of gentle rainfall," or "factory machinery screaming in metallic agony." While some of these examples are more convincing than others, the ability of Fugatto to generate such unique combinations demonstrates its impressive capacity to mix and characterize diverse audio data from a range of open-source datasets.

One of the most intriguing aspects of Fugatto is how it treats each audio trait as a tunable continuum, rather than a simple binary. For example, when blending the sounds of an acoustic guitar and running water, the result can vary significantly depending on whether the guitar or the water is given more weight in Fugatto's interpolated mix. Nvidia also mentions the ability to adjust a French accent to be more or less pronounced, or to alter the "degree of sorrow" in a spoken clip.

In addition to tuning and combining various audio traits, Fugatto can also perform tasks seen in previous models, such as altering the emotion in spoken text or isolating vocal tracks in music. It can also detect individual notes in MIDI music and replace them with different vocal performances, or identify the beat of a song and add effects like drums, barking dogs, or ticking clocks that sync with the rhythm.

While the researchers view Fugatto as just the first step "towards a future where unsupervised multitask learning emerges from data and model scale," Nvidia is already highlighting potential use cases, from song prototyping and dynamically changing video game scores to international ad targeting. However, Nvidia was quick to emphasize that models like Fugatto should be seen as a new tool for audio artists, rather than a replacement for their creative abilities.

"The history of music is also a history of technology," said Ido Zmishlany, producer/songwriter and Nvidia Inception participant, in a blog post. "The electric guitar gave the world rock and roll. When the sampler came along, hip-hop was born. With AI, we're writing the next chapter of music. We now have a new instrument, a new tool for making music—and that’s incredibly exciting."

Press contact

Timon Harz

oneboardhq@outlook.com