Llasa: A Scalable Single-Model Approach to Speech Synthesis

Efficient Language Model for More Realistic Speech Synthesis: Llasa Focuses on Scaling
The rapid advancements in large language models (LLMs) like GPT and Llama have clearly demonstrated the effectiveness of scaling computational power in both training and inference. This scaling allows for more complex tasks and higher accuracy. In the field of text-to-speech (TTS), however, most current state-of-the-art systems are multi-stage and require separate models, such as diffusion models after the LLM. This makes it difficult to decide which model to scale during training or inference.
New research now introduces Llasa, a framework for speech synthesis that attempts to reduce this complexity. Llasa uses a single-layer vector quantizer (VQ) codec and a single Transformer architecture. This approach is inspired by standard LLMs like Llama and simplifies the scaling of computational power.
Scaling Training and Inference Performance
The research focuses on scaling computational power during training and inference for speech synthesis. Experiments with Llasa show that scaling training compute consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Prosody refers to the rhythmic and melodic aspects of speech, which are crucial for natural speech perception.
When scaling inference performance, the research utilizes language understanding models as verifiers during the search. The results suggest that scaling inference compute shifts the sampling modes towards the preferences of specific verifiers. This leads to improvements in emotional expressiveness, voice consistency, and content accuracy.
Simple Design for Increased Efficiency
Llasa's simple design with a single-layer VQ codec and a single Transformer architecture allows for more efficient scaling compared to multi-stage systems. By using a single model, the need to scale and optimize different models separately is eliminated. This simplifies the training process and allows for more targeted resource allocation.
Public Availability of Code and Checkpoints
To promote further research and development in this area, the developers of Llasa have made the checkpoints and training code for their TTS models (1B, 3B, 8B) and the codec model publicly available. This allows other researchers to build upon the results and conduct their own experiments. The public availability of the code contributes to the transparency and reproducibility of the results.
Conclusion
Llasa represents a promising approach to speech synthesis that focuses on scaling computational power in both training and inference. The research results show that this scaling leads to improved naturalness, emotional expressiveness, and accuracy of synthesized speech. The simple design of the framework and the public availability of code and checkpoints contribute to the further development and dissemination of this technology.
Bibliography: zhenye234. X-Codec-2.0. GitHub. HKUSTAudio. Llasa-3B. Hugging Face. zhenye234. LLaSA_training. GitHub. HKUSTAudio. Llasa-8B. Hugging Face. Ye et al. Lightweight Language Model for Speech Synthesis: Attempts and Analysis. ResearchGate. Discussion on Stable Diffusion requirements. Reddit. Kong. What is a Vector Quantized Variational Autoencoder (VQ-VAE)? YouTube. Ye et al. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Findings.