Instructions to use jspaulsen/unmute-encoder with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jspaulsen/unmute-encoder with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("jspaulsen/unmute-encoder", dtype="auto") - Moshi
How to use jspaulsen/unmute-encoder with Moshi:
# pip install moshi # Run the interactive web server python -m moshi.server --hf-repo "jspaulsen/unmute-encoder" # Then open https://localhost:8998 in your browser
# pip install moshi import torch from moshi.models import loaders # Load checkpoint info from HuggingFace checkpoint = loaders.CheckpointInfo.from_hf_repo("jspaulsen/unmute-encoder") # Load the Mimi audio codec mimi = checkpoint.get_mimi(device="cuda") mimi.set_num_codebooks(8) # Encode audio (24kHz, mono) wav = torch.randn(1, 1, 24000 * 10) # [batch, channels, samples] with torch.no_grad(): codes = mimi.encode(wav.cuda()) decoded = mimi.decode(codes) - Notebooks
- Google Colab
- Kaggle
Unmute Encoder
A speaker embedding encoder trained to replicate Kyutai's unreleased "unmute encoder". This model extracts speaker embeddings from audio for use with Kyutai's Moshi TTS system.
Model Description
The encoder is built on top of Kyutai's Mimi neural audio codec:
- Mimi Encoder: Frozen Mimi encoder extracts latent audio representations
- MLP Projector: Trainable MLP head projects Mimi's latents to the target embedding space
- Output: Speaker embeddings of shape
[512, 125](512 channels, 125 time steps for 10s audio)
Audio (24kHz, 10s) -> Mimi Encoder -> Latent [512, T] -> MLP Projector -> Embedding [512, 125]
Usage
from src.models.mimi import MimiEncoder
# Load the encoder
encoder = MimiEncoder.from_pretrained(
model_name="jspaulsen/unmute-encoder",
device="cuda",
num_codebooks=32,
)
# Create embedding from audio tensor [1, 1, T] at 24kHz
output = encoder(audio_tensor)
embedding = output.embedding # [1, 512, 125]
Training
Trained using supervised learning with a hybrid loss (L1 + cosine similarity) against speaker embeddings from kyutai/tts-voices.
Training Details
- Global step: 500
- Epoch: 83.33333333333333
- Best metric: 0.17919586598873138
Acknowledgments
- Kyutai for releasing the Moshi TTS models and speaker embeddings
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for jspaulsen/unmute-encoder
Base model
kyutai/tts-1.6b-en_fr