aecifo

Meta Spirit LM integrates speech and text in a new multimodal GenAI model

Umito October 31, 2024

Featured in a recent article, Spirit LM allows the creation of pipelines mixing spoken and written text integrate speech and text into the same multimodal model. According to Meta, their new approach, based on interleaving text and speech tokens, circumvents the limitations inherent in previous solutions that use separate pipelines for speech and text.

Meta’s new model is based on a pre-trained text-only language model 7B (Llama 2) extended to include speech. To this end, the model is continuously trained on textual and vocal units.

Speech and text sequences are concatenated into a single stream of tokens and trained with a word-level interleaving method using a small, automatically curated parallel speech-text corpus.

According to Meta, Spirit LM brings together the semantic capabilities you expect from text-based LLMs with the expressive capabilities of voice models. However, as we’ll explain later, Spirit LM’s performance in text-only mode is currently slightly lower than Llama 2’s.

The usual approach to extending LLMs to support speech input and output, Meta researchers explain, is to build a pipeline in which speech is transcribed using automatic speech recognition (ASR) in text, which is then fed into an LLM, the output of which is finally converted to text. speech. This is the approach taken by GPT-4o and Hume’s EVI 2, which also claims to be able to generate an emotionally inflected voice. However, say Meta researchers:

With such pipelines, the modeling and generation of expressive speech is constrained outside of the linguistic model, leading to expressively poor generation.

Spirit LM is instead trained on a mix of text-only sequences, voice-only sequences, and interlaced sequences. Speech is converted into tokens that represent phonetic units (HuBERT) as well as pitch and style units. This allows for the creation of interleaved training sequences by randomly switching from text to speech modality at word boundaries.

%IMAGE1%

One of the key findings of Meta’s research is that Spirit LM can learn new tasks, similar to text-based LLMs, and is able to preserve the sentiment of text and voice prompts. This latter claim is based on a new benchmark introduced by Meta researchers, called Speech-Text Sentiment Preservation, which involves generating a voice or text sequence of tokens and checking whether it preserves the sentiment of the prompt, pre-classified as displaying a positive, negative or neutral sentiment.

As mentioned, according to the researchers themselves, Spirit LM doesn’t perform as well as the base Llama 2 model for text prompts, which is a limitation they hope to address by refining the training. Another front of evolution for Spirit LM is adopting a larger model as a base, which could lead to a further improvement in performance.

Finally, Spirit LM is a foundational model and therefore does not include any provisions to protect it against misuse, such as generating fake news, spam, or impersonating specific speakers. Likewise, Spirit LM is only trained for English and does not include a variety of accents and dialects for underrepresented groups.

Spirit LM is available in two versions. THE base The version only uses voice phonetic units (HuBERT) while the expressive the version also uses height and style units. The model is available on GitHub with his weightbut it is the license only permits non-commercial use.