1 Introduction

Text-conditioned generative models have revolutionized the field of content creation, enabling the generation of high-quality natural images (Ramesh et al., 2022; Rombach et al., 2022; Podell et al., 2024; Haji-Ali et al., 2024a), vivid videos (Ho et al., 2022; Villegas et al., 2022; Wang et al., 2024d; Qiu et al., 2024; Menapace et al., 2024), and intricate 3D shapes (Cheng et al., 2023). The domain of audio synthesis has undergone comparable advancement (Huang et al., 2023b, a; Liu et al., 2023a; Xue et al., 2024; Guan et al., 2024; Saito et al., 2024; Niu et al., 2024; Yang et al., 2023a; Evans et al., 2024a; Liu et al., 2024b; Wang et al., 2024e; Guo et al., 2024), with three broad areas of study: speech, music and ambient sounds. The success in these domains rests on two key pillars: (i) the availability of high-quality large-scale datasets with text annotations, and (ii) the development of scalable generative models (Ho et al., 2020; Song et al., 2021). The objective of this work is to improve audio generation quality by scaling ambient audio generators across both the data and model axes.

In the field of audio synthesis, ambient audio generation emerges as a critical domain. Unlike speech and music, ambient sound generation is particularly challenging due to the lack of extensive, well-annotated datasets (Kim et al., 2019; Drossos et al., 2020). Attempts to curate ambient audio from online videos predominantly failed, primarily due to the dominance of speech and music content in such videos. For instance, AudioSet (Gemmeke et al., 2017), the largest available audio dataset sourced from online videos, contains \(99\%\) speech or music clips. Previous efforts to filter ambient audio from similar datasets involved using expensive classifiers on the video or audio content, making it impractical to compile a large-scale dataset due to the high filtering rate. In this work, we propose a simple, yet scalable filtering approach that leverages existing automatic video transcription to identify segments with ambient sounds. This method is not only more efficient but also more feasible, as it eliminates the need to download audio or video content. Through this approach, we built AutoReCap-XL, a dataset containing 47 million ambient audio clips sourced from existing video datasets, representing a 75-fold increase over the size of previously largest available datasets.

Another challenge in compiling large-scale text-audio datasets is providing accurate textual descriptions. For visual modalities, such as images and videos (Xue et al., 2022; Miech et al., 2019), researchers often relied on the raw description and metadata to train strong visual-text models including reliable captioners (Chen et al., 2024b). For ambient sounds, however, the task is substantially more challenging as accompanying raw text tends to describe visual information or convey feelings, rather than detailing the audio content. Moreover, human-captioned audio datasets are limited, containing fewer than 51k text-audio pairs in total. This significantly impacts the training of current captioning models, making them more susceptible to overfitting. To address this, we introduce AutoCap, a high-quality audio captioner that leverages visual cues to enhance captioning.

AutoCap refines the commonly used encoder-decoder design based on a pretrained BART (Lewis et al., 2020) by introducing a Q-Former (Li et al., 2023) module which learns an intermediate representation that better aligns the encoded audio and the original BART token representation. Second, we propose to remedy the data scarcity problem by using metadata and visual cues to aid the captioning process. Critically, we augment the encoder inputs with a set of descriptive textual metadata including audio title and a caption derived from the visual modality. This dual-input approach achieves improved performance over baselines on AudioCaps (Kim et al., 2019), marking a 3.2% improvement in CIDEr score. Using AutoCap, we provide textual descriptions for AutoReCap-XL and demonstrate the benefits of scaling audio generative models with synthetic captions.

Another axis for scaling generative models is model size (Peebles & Xie, 2023). While scaling diffusion backbones has shown significant benefits in image and video generation, ambient audio generation has shown poor scaling behavior. For instance, AudioLDM2 (Liu et al., 2024c), reported worse metrics for their largest model compared to the smaller one. Similarly, EzAudio (Hai et al., 2024), achieved only marginal improvements with scaling model size. In this work, we introduce GenAu, a scalable transformer-based architecture that achieves significant improvements over state-of-the-art models. We recognize that audio grows fast temporally, yet contains many silent and redundant segments. Therefore, an efficient architecture that can handle such properties is needed. In particular, we employ a transformer architecture in the denoising backbone where we modify the FIT transformer (Chen & Li, 2023) to generate audio in the latent space. On AutoCaps dataset, GenAu achieves significant improvements over baselines, with \(11.1\%\) higher Inception Score, \(4.7\%\) better FAD, and \(13.5\%\) improvement in CLAP score, demonstrating superior audio-text alignment and generation quality. Moreover, GenAu shows promising scaling properties, with consistent improvements across all metrics as model size increases.

In summary, this work presents significant contributions in three areas: (i) AutoCap, a novel audio captioner tailored towards the annotation of data at a large scale that uses visual clues and audio metadata to improve accuracy and robustness; (ii) AutoReCap-XL, a large scale ambient audio dataset, comprising 47M audio clips paired with synthetic captions, 75 times larger than available datasets (iii) GenAu, a novel audio generator based on a scalable transformer architecture specifically adapted to the audio domain, achieving significant improvements over previous state-of-the-art.

2 Related Work

Automatic Audio Captioning (AAC). The goal of AAC is to produce language descriptions for given audio content. Recent AAC methods (Deshmukh et al., 2024a; Gontier et al., 2021; Wu et al., 2024; Zhang et al., 2025; Sridhar et al., 2024; Kadlcik et al., 2023; Cousin et al., 2023; Labbé et al., 2023; Xu et al., 2024b; Zhang et al., 2024; Ghosh et al., 2024a; Deshmukh et al., 2024a; Chen et al., 2025; Rho et al., 2025) employ encoder-decoder transformer architectures, where an encoder receiving the audio signal produces a representation that is used by the decoder to produce the output caption. WavCaps (Mei et al., 2024a) employs HTSAT (Chen et al., 2022) audio encoder and uses a pretrained BART (Lewis et al., 2020) as the decoder. Similarly, EnCLAP (Kim et al., 2024c) uses a pretrained BART and improves on the audio representation. EnCLAP++ (Kim et al., 2024b) optimized the encoder slection of EnCLAP to further improve performance. Recognizing the limited data, CoNeTTE (Labb et al., 2024) proposes to train a lightweight vanilla transformer decoder (Vaswani et al., 2017) instead. Other work explored augmentation to counter data scarcity (Kim et al., 2022; Labb et al., 2024; Ye et al., 2022). Recent work (Liu et al., 2023b; Sun et al., 2024; Yuan et al., 2025) proposed to leverage visual information to address sound ambiguities, reporting improvements. More recent methods use audio large language model for zero-shot captioning (Kong et al., 2024; Deshmukh et al., 2023a; Ghosh et al., 2024b). Our method uses audio metadata and visual information as additional signals and leverages a lightweight Q-Former (Li et al., 2023) model to improve accuracy.

Text-conditioned audio generation. The current state-of-the-art text-to-audio generation methods widely adopt diffusion models (Yang et al., 2023b; Kreuk et al., 2023; Liu et al., 2023a, 2024c; Huang et al., 2023a; Ghosal et al., 2023; Evans et al., 2024a; Vyas et al., 2023; Kreuk et al., 2023; Hai et al., 2024; Evans et al., 2024b). AudioLDM 1 & 2 (Liu et al., 2023a, 2024c) make use of a latent diffusion model and employ a UNet as the diffusion backbone. Recently, StableAudio Open (Evans et al., 2024b) introduced a 1.32B model that uses a DiT (Peebles & Xie, 2023) to generate variable-length audio clips at 48kHz. Recent work also explored controllable audio generation (Shi et al., 2023; Xu et al., 2024a; Melechovsky et al., 2024; Paissan et al., 2024; Zhang et al., 2023b; Liang et al., 2024; Liu et al., 2024a), visual-conditioned audio generation (Wang et al., 2024e; Mei et al., 2024b; Wang et al., 2024a), and more recently joint audio-video generation (Tang et al., 2024b, 2023; Xing et al., 2024; Hayakawa et al., 2025; Tian et al., 2025; Vahdati et al., 2024; Chen et al., 2024a; Kim et al., 2024a; Wang et al., 2024b; Mao et al., 2024; Haji-Ali et al., 2024b). In this work, we propose a transformer architecture design that shows strong scalability properties.

Text-Audio Datasets. The performance of text-audio models (Zhu et al., 2024; Li et al., 2024; Deshmukh et al., 2023a; Mahfuz et al., 2023; Deshmukh et al., 2024b; Shu et al., 2023; Elizalde et al., 2024; Liu et al., 2024d; Tang et al., 2024a; Gong et al., 2024; Cheng et al., 2024; Zhang et al., 2023a) is currently hindered by the lack of high-quality large-scale paired audio text data of ambient sounds. Existing human-captioned (AudioCaps (Kim et al., 2019) and Clotho (Drossos et al., 2020)), have in total only 51k audio-text pairs. Another challenge is the limited availability of audio clips from sound-only platforms. LAION-Audio (Wu et al., 2023a) relied on numerous sources of audio platforms such as BBC Sound Effects (BBC Sound Effects, 2024),  (Font et al., 2013) FreeSounds, and SoundBible (SoundBible, 2024) to form a dataset consisting of 630k audio samples with highly noisy raw descriptions. Chen et al. (2020) attempted to extract audio clips from videos by employing classifiers to detect ambient audio, speech, and music. To annotate these clips, WavCaps (Mei et al., 2024a) proposes a filtering procedure based on ChatGPT (Achiam et al., 2023) to collect 400k audio clips and weakly caption them based on the noisy descriptions alone. Several subsequent work (Majumder et al., 2024; Sun et al., 2024) adopted similar strategies of using large language models to augment captions. While weak-captioning improves downstream metrics, it is suboptimal because it fails to incorporate the audio signal itself. In this work, we introduce an efficient dataset collection pipeline that relies on video datasets to extract ambient audio clips and automatic captioners to provide textual descriptions. We collect 47M audio clips, marking the largest available text-audio dataset.

Fig. 1
Fig. 1
Full size image

(Left) Overview of AutoCap. We employ a frozen HTSAT (Chen et al., 2022) encoder to produce an audio fine-grained representation of 1024 tokens. We then employ a Q-Former (Li et al., 2023) module to produce 256 tokens. These tokens, along with audio CLAP embeddings (Wu et al., 2023a) and 64 tokens of pertinent metadata, are processed by a pretrained BART to generate the final caption. (Right) Overview of GenAu. Following latent diffusion models, we use a frozen 1D-VAE to convert a Mel-Spectrogram into a sequence of patch tokens, which are then divided into groups. We then apply a series of N FIT blocks (Chen & Li, 2023). Each block processes the patch tokens using ‘local’ attention layers. ‘Read’ and ‘write’ layers, implemented as cross-attention, facilitate information transfer between input patch tokens and learnable latent tokens. Finally, ‘global’ attention layers on latent tokens facilitate global communication across groups.

3 Method

In this section, we describe our approach to high-quality text-to-audio generation, starting with audio captioning using AutoCap in section 3.1, data collection in section 3.2, and ambient audio generation with GenAu in section 3.3

3.1 Automatic Audio Captioning

Recent state-of-the-art methods (Labb et al., 2024; Kim et al., 2024c) generally employ an encoder-decoder transformer design where a pretrained audio encoder passes the audio representation to a pre-trained language model serving as the decoder. This language model (e.g. BART) is typically finetuned to adapt to the audio representation. However, due to the distribution mismatch between the pretraining data of the LLM and the audio embeddings produced by the encoder, the decoder suffers from catastrophic forgetting. Furthermore, audio is an inherently ambiguous modality, as many events can produce similar sound effects a phenomenon often leveraged in animation, where soundscapes are artificially constructed. Audio clips from many sources, however, are still commonly associated with metadata that might be relevant for captioning such as raw user descriptions, or related modalities (i.e. accompanied visual information). Motivated by these observations, we propose AutoCap, an audio captioning model that employs an intermediate audio representation to connect the pretrained encoder and decoder and uses metadata to aid with the captioning. Figure 1 (left) presents an overview of AutoCap.

Fig. 2
Fig. 2
Full size image

Audio data collection pipeline. We employ online video speech transcripts to identify audio segments without subtitles, which typically correspond to clips lacking speech or music. These are processed by AutoCap to generate captions. As a post-filtering technique for ambient audio selection, we retain only clips whose captions lack music and speech keywords.

We consider a dataset of audio-caption pairs \(\langle \textbf{a}, \textbf{y}\rangle \) and corresponding metadata represented as a set of token sequences \(\{\textbf{m}_{j}\}_{j=1}^{j=M}\). Inspired by state-of-the-art AAC methods (Mei et al., 2024a; Labb et al., 2024; Kim et al., 2024c), we employ an encoder-decoder architecture. We first compute a global feature representation of the audio:

$$\begin{aligned} \textbf{x}_\textrm{clap}=\mathcal {P}_\textrm{clap}(\mathcal {E}_\textrm{clap}(\textbf{a})), \end{aligned}$$
(1)

where \(\mathcal {P}_\textrm{clap}\) is a learnable projection layer and \(\mathcal {E}_\textrm{clap}\) is the audio encoder of a pretrained CLAP modelFootnote 1 (Wu et al., 2023a). We also compute local features as:

$$\begin{aligned} \textbf{x}_\textrm{audio}=\mathcal {Q}(\mathcal {E}_\textrm{a}(\textbf{a})), \end{aligned}$$
(2)

where \(\mathcal {Q}\) is a Q-Former (Li et al., 2023) and \(\mathcal {E}_\textrm{a}\) is a pretrained HTSAT (Chen et al., 2022) audio encoder that produces a time-aligned representation (1024 tokens) following Mei et al. (2024a). The Q-Former efficiently learns 256 latent tokens, which serve as keys in cross-attention layers with the input features, thereby outputting 256 tokens. Metadata sequences \(\textbf{m}_{i}\) are then embedded using the embedding layer of a pretrained BART to obtain embedding sequences \(\textbf{x}_{\textrm{meta}_{i}}\). For our experiments, we use video titles and captions as the metadata. We represent the input audio and metadata as the following input sequence:

$$\begin{aligned} \begin{aligned} \textbf{x}&= \textbf{x}_\textrm{clap}~\texttt {[boa]}~\textbf{x}_\textrm{audio}~\texttt {[eoa]}~\texttt {[bom]}_{1}~\textbf{x}_{\textrm{meta}_{1}} \\&\quad ~\texttt {[bom]}_{1}~...~\texttt {[bom]}_{M}~\textbf{x}_{\textrm{meta}_{M}}~\texttt {[bom]}_{M} \end{aligned} \text {,} \end{aligned}$$
(3)

where \(\texttt {[boa]}\texttt {[eoa]}\) represent beginning and end of audio sequence embeddings \(\textbf{x}_\textrm{audio}\), and \(\texttt {[bom]}_{i}, \texttt {[bom]}_{i}\) represent beginning and end of metadata embeddings \(\textbf{x}_{\textrm{meta}_{i}}\). The input sequence is passed to a pretrained BART (Lewis et al., 2020) \(\mathcal {D}_\textrm{t}\) to predict a caption as \(\hat{\textbf{y}}= \mathcal {D}_\textrm{t}(\textbf{x})\)

Training. We train our model using a standard cross-entropy loss over next token predictions. To avoid degrading the quality of the pretrained BART and audio encoder models, we adopt a two-stage training procedure. In Stage 1, both the audio encoders and BART model are kept frozen, thus allowing the Q-Former, projection layers, and newly introduced delimiter tokens to align to the pretraining BART representation. In this stage, we pretrain the model using a larger dataset of weakly-labeled audio clips. In Stage 2, we unfreeze all BART parameters apart from the embedding layer and finetune the model on the Audiocaps dataset at a lower learning rate to align the captioning style more with human style. This training strategy leverages the larger, weakly-labeled dataset while minimizing the knowledge drift in the pretrained BART. The use of Q-Former to learn an intermediate representation is pivotal for such a strategy.

Fig. 3
Fig. 3
Full size image

Scaling analysis of model size (left) and data with synthetic captions (right) reveals consistent improvements in FD and IS.

3.2 Data Collection and Re-captioning Pipeline

Generative models in the image and video domains have shown benefits from increased quantities of data and improved quality of captions. In the audio domain, however, the major human-annotated audio-text datasets, namely AudioCaps (Kim et al., 2019) and Clotho (Drossos et al., 2020), provide only 51k audio clips combined. Previous methods attempted to extract additional ambient audio clips from existing video datasets using pretrained audio classifiers, but a high filtering rate marked this method impractical. Instead, we found that automatic transcripts offer reliable information about the segments containing ambient sounds. In particular, we propose to select the parts of the videos that contain no automatic transcription, suggesting the absence of speech and music. Such an approach offers specific advantages over using pretrained classifiers. Automatic transcripts, readily available for most online videos, eliminate the need to download and process video and audio data before filtering. Additionally, as these transcripts provide precise time-aligned information, they facilitate the extraction of more segments per video. Subsequently, we leverage our AutoCap model to provide textual descriptions of the extracted audio clips. Despite the effectiveness of this method in collecting ambient sounds, some clips still inadvertently contain music or speech due to transcription errors, particularly with speech in less common languages. We address this by analyzing captions and filtering out clips with keywords related to speech or music. Finally, we filter all audio-text pairs with CLAP similarity less than 0.1.

We follow this process to extract 466k audio-text pairs from Audioset (Gemmeke et al., 2017) and VGGSounds (Chen et al., 2020). Additionally, we recaption datasets without visual content such as Freesound, BBC Sound Effects, and SoundBible. To provide metadata, we employ the captioning model of Chen et al. (2024b) to extract a caption whenever video content is available and pass an empty text otherwise. In total, we form AutoReCap, a large-scale dataset comprising 761,113 audio-text pairs with precise captions. As an additional contribution, we introduce AutoReCap-XL, in which we scale our approach by analyzing four additional large-scale video datasets (Lee et al., 2021; Xue et al., 2022; Zellers et al., 2022; Nagrani et al., 2022) with a total of 71M videos and 715.4k hours. After filtering, we collect and re-caption 47M ambient audio clips spanning 123.5k hours from 20.3M different videos, forming by far the largest available dataset of audio with paired captions. Figure 2 summarizes our data collection pipeline, and Sec. A in Appendix present more details about the dataset collection and processing and dataset statistics.

Table 1 AutoCap results on AudioCaps test split for various models. AS: AudioSet, AC: AudioCaps, WC: WavCaps, CL: Clotho, MA: Multi-Annotator Captioned Soundscapes, \(AS_A\): ambient audio subset from AudioSet.

3.3 Scalable Text-to-Audio Generation

We design our audio generation pipeline, GenAu, as a latent diffusion model. Figure 1 (right) shows an overview of our proposed model. In the following section, we describe in detail the structure of our latent variational autoencoder (VAE) and the latent diffusion model.

Latent VAE. Directly modeling waveforms is complex due to the high data dimensionality of audio signals. Instead, we replace the waveform with a Mel-spectrogram representation and use a VAE to further reduce its dimensionality, following prior work (Melechovsky et al., 2024; Huang et al., 2023b). Once generated, Mel-spectrograms can be decoded back to a waveform through a vocoder (gil Lee et al., 2023). However, commonly-used 2D autoencoder designs (Liu et al., 2023a, 2024c; Melechovsky et al., 2024), are not well suited to the Mel-spectrograms, as the separation between the Mel channels is non-linear, which is not well suited for 2D convolutions. In other words, since the Mel bins are spaced logarithmically, a shifting in a 2D CNN kernel along the frequency dimension mixes narrow low-frequency filters with very wide high-frequency filters, thus breaking the translation-invariance property of the convolutional filter. We instead opt for a 1D-VAE design based on 1D convolutions similar to Huang et al. (2023a). We train the VAE following Esser et al. (2021).

Latent diffusion model. Following the latent diffusion paradigm, we generate audio by training a diffusion model in the latent space of the 1D-VAE. Transformer-based diffusion models currently attain state-of-the-art performance in audio generation (Huang et al., 2023a). However, both UNet and transformer-based baselines exhibited limited performance gains with increasing model size (Liu et al., 2024c; Hai et al., 2024). We observe that ambient audio often contains extensive silent and redundant segments, which may explain the poor scalability of UNet and DiT-based methods, as they distribute computation uniformly across the input. Therefore, we propose to use a more dynamic transformer architecture as a diffusion backbone (Chen & Li, 2023; Menapace et al., 2024). In particular, we adopt the FIT architecture of Menapace et al. (2024), which was originally proposed to work in the pixel space, and revise it for the latent space of the audio modality.

Given a 1D input \(\textbf{x}\), we follow the approach of Menapace et al. (2024) by first applying a projection operation to every p consecutive latent features to produce a sequence of input patch tokens, where p indicates the patch size. We then apply a sequence of FIT blocks to the input patches, where each block divides patch tokens into contiguous groups of a predefined size. A set of local self-attention layers are then applied separately to each group to avoid the quadratic computational complexity of attention computation. Unlike the video domain (Menapace et al., 2024) where the high input dimensionality makes the local layers excessively expensive, we found them to be beneficial for audio generation. This is because videos are typically represented in video diffusion models with much larger amount of tokens compared to audio in audio diffusion models, making the “local”self-attention layers that operates on the input patch tokens extremely more expensive in the video modality compared to the audio modality. To further reduce the amount of computation while maintaining long-range interaction, each block considers a small set of latent tokens. First, a read operation implemented as a cross-attention layer transfers information from the patches to the latent tokens. Later, a series of global self-attention operations are applied to the latent tokens, allowing information-sharing between different groups. Finally, a write operation implemented as a cross-attention layer transfers information from the latent tokens back to the patches. Due to the reduced number of latent tokens when performing the global self-attention, computational requirements of the model are reduced with respect to a vanilla transformer design (Vaswani et al., 2017). Such a design is particularly suited for the audio modality, which contains mostly silent or redundant parts. Unlike DiT and UNet-based methods (Ronneberger et al., 2015; Peebles & Xie, 2023) which allocate the computation resources uniformly across input tokens, the FIT architecture selectively focuses on the more informative parts, dedicating more compute for these parts as the model size scales.

To condition the generation on an input prompt, we use a pretrained FLAN-T5 model (Chung et al., 2024) and a CLAP (Wu et al., 2023a) text encoder to produce the their respective embeddings \(e_{\textrm{FLAN}}\) and \(e_{\textrm{CLAP}}\) following prior work of Liu et al. (2024c), which we concatenate with the diffusion timestep \(t\) to form the input conditioning signal \(c\). We insert an additional cross-attention operation inside each FIT block immediately before the ‘read’ operation that makes latent tokens attend to the conditioning. Moreover, we use conditioning on dataset ID to adapt the generation style to different datasets. We perform such conditioning by adding a learnable embedding of the dataset ID to the context alongside the timestep embedding.We train the model using the epsilon prediction objective and follow a linear noise scheduler.

4 Experiments

Table 2 AutoCap ablation study on AudioCaps

In section 4.1, we evaluate AutoCap quantitatively. We then demonstrate the capabilities of GenAu in section 4.2 and discuss scaling trends with respect to data and model size. We also provide qualitative comparisons on the Website.

4.1 Automatic Audio Captioning

Training dataset and details. We train AutoCap in two stages. During stage 1, we pretrain on a large weakly labeled dataset of 634,208 audio clips, constructed from AudioSet, Freesound, BBC Sound Effects, SoundBible, AudioCaps, and Clotho. We use ground truth captions from AudioCaps and Clotho datasets, WavCaps captions for Freesound, SoundBible, and BBC Sound Effects, and handcrafted captions through a template leveraging the ground truth class labels for AudioSet. As metadata, we use the title provided with each clip, and pre-compute video captions using a pretrained Panda70M model (Chen et al., 2024b) or pass an empty string when the video modality is unavailable. We pretrain the model for 20 epochs with a learning rate of 1e-4, while keeping the audio encoder and pretrained BART frozen. In Stage 2, we fine-tune the model for 20 epochs on AudioCaps using a learning rate of 1e-5. We use 10-second clips at 32KHz for all experiments.

Baselines. We compare with V-ACT (Liu et al., 2023b), BART-tags (Gontier et al., 2021), AL-MixGEN (Kim et al., 2022), ENCLAP (Kim et al., 2024c), HTSAT-BART (Xu et al., 2024b), CNext-trans (Labb et al., 2024) and GAMA Ghosh et al. (2024b). Among these, ENCLAP and CNext-trans have the best performance. ENCLAP benefits from a stronger audio encoder and a CLAP representation. CNext-trans trains a lightweight transformer instead of fine-tuning a pretrained language model to reduce overfitting.

Metrics and evaluation. We report results using the established BLEU1 and BLEU4 (Papineni et al., 2002), ROUGE (Lin, 2004), Meteor (Lavie & Agarwal, 2007), CIDEr (Vedantam et al., 2015), and SPIDEr (Liu et al., 2017) metrics. We evaluate our method on the AudioCaps test split using the last checkpoint of our trained model. We follow the same evaluation pipeline as baselines and include their reported results, except for GAMA which we evaluate using their released checkpoint. Metrics unavailable in these publications are excluded from our analysis.

Results. Tab. 1 reports that our method outperforms baselines on all metrics, achieving notable improvements in CIDEr (83.2) and BLUE1 (73.1) scores. Notably, even without metadata (i.eaudio only), AutoCap surpasses baselines in most metrics. We found that incorporating metadata significantly enhances CIDEr but slightly reduces SPICE. This trade-off likely results from the enhanced descriptive detail brought by the metadata, which while enriching the content, introduces noise that may compromise the model’s semantic precision. In addition, AudioCaps is labeled based on audio alone. Thus, the evaluation penalizes the description of information that can not be deduced with certainty from the audio modality only, such as the specific type of object producing the sound. Qualitatively, our captions are more detailed and temporally accurate than baselines. ENCLAP-Large often misses key details. CNext-trans, while accurate, often produces short captions that lack details. We include qualitative comparisons in the Website and Appendix. Moreover, AutoCap is four times faster than ENCALP, producing a caption for a 10-second clip in 0.28 seconds, compared to ENCALP which takes 1.12 seconds. Furthermore, we observe consistent improvements when pretraining on weakly-labeled data, validating the effectiveness of our training strategy in benefiting from larger, weakly-labeled datasets.

Ablations. In Tab. 2, we ablate model design choices. Using CLAP embedding brings a 2.5 points increase in CIDEr. Omitting Stage 2 training, which involves finetuning BART (Lewis et al., 2020), results in performance degradation, likely due to the necessity of adapting BART’s decoder to the sentence structure typical of AudioCaps. A more severe degradation in performance is observed when Stage 1 is not performed. In this settings, the model directly finetunes the pretrained BART without resolving the misalignment in the representations between the pretrained encoder and BART. This can leads to catastrophic forgetting in the language model, resulting in a significant performance degradation. Finally, finetunning BART word embeddings in Stage 2 reduces performance.

Table 3 Comprehensive evaluation of GenAu . (a) Comparison with previous work. (b) Key ablations on out-of-distribution dataset. (c) user preference study on various GenAu variants.

4.2 Text-to-Audio Generation

Training dataset and details. We follow baselines Liu et al. (2024c); Huang et al. (2023a); Majumder et al. (2024) and train on 10-second clips at 16kHz resolution. We use a patch size of 1 and a group size of 32. We use LAMB optimizer (You et al., 2020) with a LR of 5e-3. We train for 220k steps and choose the checkpoint with the highest IS.

Baselines. We compare with TANGO 1 & 2, (Ghosal et al., 2023), AudioLDM 1 & 2 (Liu et al., 2023a, 2024c), and Make-An-Audio 1 & 2 (Huang et al., 2023b, a). Both AudioLDM and Make-An-Audio train a UNet-based latent diffusion model (Rombach et al., 2022) on Mel-Spectrogram representation, by regarding it as a single channel image, and use a pretrained CLAP encoder to condition the generation on an input prompt. TANGO proposed to use FLAN-T5 (Chung et al., 2024) as the text encoder and reported significant improvements. AudioLDM-2 and Make-An-Audio-2 proposed to use a dual encoder strategy of a T5 (Raffel et al., 2022) and CLAP encoder. Make-An-Audio-2 proposes to use a 1D VAE representation and employ a DiT as the diffusion backbone. Recently, Tango-2 proposed to use instruction fine-tuning on a synthetic dataset to enhance temporal understanding. In our experiments, we focus on text-conditioned natural audio generation.

Metrics. We compare the performance of our method with baselines using the standard Frechet Distance (FD), Inception score (IS), and CLAP score on the AudioSet test split. There is little consistency between baselines when computing the metrics. Some prior work reported the Fr chet distance results using the VGGish network (Hershey et al., 2017), denoted as (FAD) (Kilgour et al., 2019), while other uses PANNs (Kong et al., 2019). Additionally, to compute the CLAP score, some prior work (Liu et al., 2024c) used CLAP from LAION, which we denote as \(\text {CLAP}_{LAION}\) (Wu et al., 2023b), while others (Majumder et al., 2024; Huang et al., 2023b, a) used CLAP from Microsoft (Elizalde et al., 2023), which we denote as \(\text {CLAP}_{MS}\). Furthermore, some prior (Liu et al., 2023a, 2024c) used CLAP re-ranking with 3 samples for computing the metrics. Due to such inconsistencies in evaluation pipelines and varying results for the same baselines reported in different studies, we recompute all metrics using the official checkpoints to ensure consistent comparisons. We follow the same evaluation protocols of AudioLDM (Liu et al., 2023a) without CLAP re-ranking and use the AudioLDM evaluation package to compute the metrics. We use a DDIM sampler with CFG of 3.5 and 200 sampling steps for all baselines. Besides, to prevent biasing the evaluation based on the training data, we run our ablations on the Bigsoundbank split from WavText5k (Deshmukh et al., 2023b), which serves as an out-of-distribution evaluation for our models. Finally, to further validate our results, we run user studies. Details about the user study can be found in the Appendix.

Comparison with baselines. In Tab. 3a, we report evaluation results. When trained with a similar size (1.25B vs 937M parameters) and data scale (811k vs 1M samples) to state-of-the-art method Make-An-Audio-2, GenAu achieves superior performance in most metrics, improving IS by \(22.65\%\), FAD by \(4.7\%\), and \(13.5\%\) in \(CLAP_{LAION}\). Using the full 10-seconds subset of AutoReCap-XL further enhances the results with over \(31.1\%\) in IS and \(26.7\%\) in CLAP score. Additionally, to isolate the impact of model architecture from data quality, we conduct a user study in Tab. 3c (backbone). We train GenAu-S (493M params) with captions generated through HTSAT-BART from Wavcaps (Mei et al., 2024a). Despite using similar data captioning quality and smaller data size (811k vs 1M) and smaller model size (493M vs 937M params), GenAu-S is still consistently preferred over Make-An-Audio-2 (MAD-2).

How GenAu scale with synthetic data? To study this, we train GenAu-S (493M params) for 50k steps by fixing AudioCaps and Clotho in the training data and varying the amount of synthetic data from AutoReCap. As reported in Fig. 3 (right), increasing synthetic training data consistently improves both IS and FD. Similar improvements are evident in Tab. 3b (data scale) where increasing the dataset size significantly boosts all metrics, improving IS by \(56.3\%\). User studies in Tab. 3c (data scale) further support these findings, where a model training with AutoReCap is consistently favored over a model trained only on AudioCaps. Finally, we report in Tab. 3a the scaling performance on AutoReCap-XL (47M). To ensure a fair comparison, we train on 10-second subset (19.7M) and observe improvements of 12.3% in FD and 11.6% in CLAP score compared to training with AutoReCap (811k).

Is caption quality important? We compare in Tab. 3b (caption quality) GenAu-S against the same model trainined with captions generated with WavCaps Mei et al. (2024a) captioning model. We observe gains across all metrics, confirming the importance of caption quality and the improvements brought by our high-quality captioner AutoCap. Interestingly, expanding data size with lower-quality captions offers no significant gains over training on AudioCaps alone, consistent with Liu et al. (2024c).

Does GenAu benefit from model size scaling? Similar to data scaling, increasing model size consistently enhances performance. As shown in  Fig. 3 (left), larger models achieve better FD and IS. This is further confirmed in Tab. 3b (model scale), where GenAu-L (1.25B) outperforms GenAu-S (493M) across all metrics with almost 20.5% increase in IS. User study in Tab. 3c shows a strong preference for GenAu-L over GenAu-S. Unlike previous methods which reported diminishing returns with model scaling Hai et al. (2024); Liu et al. (2024c), GenAu continues to improve as model size scale.

How do FITs compare with other diffusion backbones? We evaluate the impact of the diffusion backbone by replacing FIT with a UNet (Ronneberger et al., 2015), or a DiT Peebles and Xie (2023). In Tab. 3b (backbone), we observe that GenAu with FIT outperforms these alternatives across all metrics. We infer that the FIT architect, with its read and write operations, allocates compute more efficiently to the key segments of the input, making it suitable for ambient audio clips which often include silent or redundant parts.

5 Conclusion

We take a holistic approach to improve the quality of existing audio generators. Starting by addressing the scarcity of large-scale captioned audio datasets, we propose a scalable and efficient dataset collection pipeline. We then build AutoCap, a strong audio captioner which leverages audio metadata to annotate a dataset of 47M annotated audio clips tailored for large-scale audio generation. We then built a latent diffusion model based on a scalable transformer architecture which we trained on our re-captioned dataset to obtain GenAu, a high-quality open-source model for audio generation. Our approach opens up possibilities for extending GenAu to other domains, such as speech and music. Additionally, AutoReCap-XL can serve as a joint text-audio-video dataset and broadens novel applications such as text to audio-video joint generation.

Limitations and future work. AutoCap was fine-tuned on AudioCaps, featuring 4,892 unique words, which limits the diversity of our generated captions. Consequently, GenAu may face challenges in accurately generating audio for detailed prompts. Additionally, while AutoReCap is extensive in size, it has only been validated for audio generation. We leave broader analysis on more tasks for future work.

6 Supplementary information

Please refer to the supplementary website for qualitative samples of the dataset, audio captioner and generator.