Audiobox TTA-RAG

Audio samples for the paper "Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation". arXiv

We propose Audiobox TTA-RAG, a novel retrieval-augmented TTA approach based on Audiobox, a conditional flow-matching audio generation model. Unlike the vanilla Audiobox TTA solution which generates audio conditioned on text, we augmented the conditioning input with retrieved audio samples that provide additional acoustic information to generate the target audio.

The proposed Audiobox TTA-RAG uses as conditioning contexts both the text and audio samples retrieved using the input text. Below we showcase outputs from a model trained with top-3 retrieved audio samples. We show the ground truth audio of the input text (Ground Truth), the output of the TTA baseline (Audiobox TTA), the retrieved audio samples used during inference (Retrieved Audio 1-3) and the output from the proposed Audiobox TTA-RAG (Audiobox TTA-RAG).

We show audio samples on 3 evaluation sets: zero-shot set, few-shot set and the AudioCaps test set.

Zero-shot evaluation set samples

Text	Ground Truth	Audiobox TTA (baseline)	Retrieved Audio 1	Retrieved Audio 2	Retrieved Audio 3	Audiobox TTA-RAG (proposed)
The audio primarily consists of rhythmic heartbeats and the sound of a heartbeat.
The audio primarily consists of a 'jingle bell' sound that's bright and ringing.
The audio primarily consists of music and a 'pizzicato' sound effect.
In this audio, you can hear the sounds of a ukulele and some background noise.
The audio primarily consists of a single note being played on a zither.

Few-shot evaluation set samples

Text	Ground Truth	Audiobox TTA (baseline)	Retrieved Audio 1	Retrieved Audio 2	Retrieved Audio 3	Audiobox TTA-RAG (proposed)
The audio contains sounds of a camera shutter, mechanical sounds, and a repetitive ticking sound.
The audio contains the sound of a zipper being zipped, some background noise, and the sound of a tap.
The audio primarily consists of the sounds of bird vocalizations and a hooting sound.
The audio primarily consists of rhythmic footsteps, background noise, panting, and sharp, loud bursts of sounds.
The audio features music and sound effects, including an explosion and a whooshing sound, all happening against a backdrop of video game sounds and a sound effect.

AudioCaps test set samples

Text	Ground Truth	Audiobox TTA (baseline)	Retrieved Audio 1	Retrieved Audio 2	Retrieved Audio 3	Audiobox TTA-RAG (proposed)
birds chirping and water dripping with some banging in the background
an electronic signal followed by compressed air releasing then an electronic bell playing as a train runs on tracks in the background
rain falling followed by fabric rustling and footsteps shuffling then a vehicle door opening and closing as plastic crinkles
an animal heavily breathing then snorting followed by footsteps on a hard surface and a camera muffling
various insects and bugs are chirping with a rodent breathing sound in the background

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Mu Yang¹, Bowen Shi², Matthew Le², Wei-Ning Hsu², Andros Tjandra²

¹Center for Robust Speech Systems (CRSS), University of Texas at Dallas, USA

²Meta AI, USA

Audiobox TTA-RAG

Zero-shot evaluation set samples

Few-shot evaluation set samples

AudioCaps test set samples

Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation

Mu Yang1, Bowen Shi2, Matthew Le2, Wei-Ning Hsu2, Andros Tjandra2

1Center for Robust Speech Systems (CRSS), University of Texas at Dallas, USA

2Meta AI, USA

Audiobox TTA-RAG

Zero-shot evaluation set samples

Few-shot evaluation set samples

AudioCaps test set samples

Mu Yang¹, Bowen Shi², Matthew Le², Wei-Ning Hsu², Andros Tjandra²

¹Center for Robust Speech Systems (CRSS), University of Texas at Dallas, USA

²Meta AI, USA