Transformers documentation
Higgs Audio V2
This model was released on 2025-07-22 and added to Hugging Face Transformers on 2026-02-19.
Higgs Audio V2
Overview
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI. The model was pretrained on over 10 million hours of audio data and a diverse set of text data. Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.
Model Architecture:

- Developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as AudioVerse. The in-house understanding model is finetuned on top of Higgs Audio v1 Understanding, which adopts the “understanding variant” shown in the architecture figure.
- Trained a unified audio tokenizer from scratch that captures both semantic and acoustic features.
- Proposed DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead.
Usage
All of the snippets below mirror the integration tests in test_higgs_audio.py, ensuring the doc stays in sync with the officially supported workflows.
Single-speaker smart voice
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
],
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years."
}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_single_speaker_smart_voice.wav")Multi-speaker smart voice
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
system_message = """You are an AI assistant designed to convert text into speech.
If the user's message includes a [SPEAKER*] tag, do not read out the tag and generate speech for the following text, using the specified voice.
If no speaker tag is present, select a suitable voice on your own."""
user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": system_message
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
},
{
"type": "text",
"text": "SPEAKER0: feminine"
},
{
"type": "text",
"text": "SPEAKER1: masculine"
},
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": user_message
}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_multi_speaker_smart_voice.wav")Zero-shot voice cloning
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years."
}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_zero_shot_voice_cloning.wav")Multi-speaker voice cloning
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
user_message = """[SPEAKER0] I can't believe you did that without even asking me first!
[SPEAKER1] Oh, come on! It wasn't a big deal, and I knew you would overreact like this.
[SPEAKER0] Overreact? You made a decision that affects both of us without even considering my opinion!
[SPEAKER1] Because I didn't have time to sit around waiting for you to make up your mind! Someone had to act."""
conversation = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
},
{
"type": "text",
"text": "SPEAKER0:"
},
{
"type": "audio",
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/guess_age_gender.wav"
},
{
"type": "text",
"text": "SPEAKER1:"
},
{
"type": "audio",
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"
},
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": user_message
}
]
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, "output_multi_speaker_voice_cloning.wav")Batched inference
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
conversation1 = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years."
}
]
}
]
conversation2 = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": " It's super important to assess fairly the fact that our former model is over. And this is not a question of adjustment. This is not the same world, 2024, 2025. And on top of that, we are making the same mistakes, on top of the key elements I mentioned. We are over-regulating and under-investing. So just if, in the two to three years to come, if we follow our classical agenda, we will be out of the market. I have no doubts."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/macron.wav"
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Hey, here is a clone from the given voice."
}
]
}
]
inputs = processor.apply_chat_template(
[conversation1, conversation2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000, do_sample=False)
decoded = processor.batch_decode(outputs)
processor.save_audio(decoded, ["output_batched_1.wav", "output_batched_2.wav"])Training
By default, the model does not load the text language modeling head to save memory (~1.5GiB reduction), as it’s not required for generation. However, when training the model, you need the text head to compute loss on text tokens. To enable it, set
use_text_head=Truewhen instantiating the model (see example below).
from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
model_id = "eustlb/higgs-audio-v2-generation-3B-base"
processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto", use_text_head=True)
conversation1 = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year."
}
]
},
{
"role": "assistant",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav"
}
]
}
]
conversation2 = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "Generate audio following instruction."
}
]
},
{
"role": "scene",
"content": [
{
"type": "text",
"text": "Audio is recorded from a quiet room."
}
]
},
{
"role": "user",
"content": [
{
"type": "text",
"text": " I would imagine so. A wand with a dragon heartstring core is capable of dazzling magic, and the bond between you and your wand should only grow stronger. Do not be surprised at your new wand's ability to perceive your intentions, particularly in a moment of need"
}
]
},
{
"role": "assistant",
"content": [
{
"type": "audio",
"url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/broom_salesman.wav"
}
]
}
]
inputs = processor.apply_chat_template(
[conversation1, conversation2],
add_generation_prompt=True,
tokenize=True,
return_dict=True,
sampling_rate=24000,
return_tensors="pt",
output_labels=True,
).to(model.device)
outputs = model(**inputs)
outputs.loss.backward()This model was contributed by Shuai Zheng and Eustache Le Bihan. The original code can be found here.
HiggsAudioV2Config
class transformers.HiggsAudioV2Config
< source >( vocab_size = 128256 hidden_size = 3072 intermediate_size = 8192 num_hidden_layers = 28 num_attention_heads = 24 num_key_value_heads = 8 hidden_act = 'silu' max_position_embeddings = 2048 initializer_range = 0.02 rms_norm_eps = 1e-05 use_cache = True pad_token_id = 128001 bos_token_id = 1 eos_token_id = 128009 pretraining_tp = 1 tie_word_embeddings = False rope_parameters = {'factor': 32.0, 'rope_theta': 500000.0, 'high_freq_factor': 0.5, 'low_freq_factor': 0.125, 'original_max_position_embeddings': 1024, 'rope_type': 'llama3'} attention_bias = False attention_dropout = 0.0 mlp_bias = False head_dim = 128 num_codebooks = 8 codebook_size = 1024 audio_token_id = 128016 audio_bos_token_id = 128013 audio_delay_token_id = 128014 audio_stream_bos_id = 1024 audio_stream_eos_id = 1025 **kwargs )
Parameters
- vocab_size (
int, optional, defaults to 128256) — Vocabulary size of the HiggsAudioV2 model. Defines the number of different tokens that can be represented by theinputs_idspassed when calling HiggsAudioV2Model - hidden_size (
int, optional, defaults to 3072) — Dimension of the hidden representations. - intermediate_size (
int, optional, defaults to 8192) — Dimension of the MLP representations. - num_hidden_layers (
int, optional, defaults to 28) — Number of hidden layers in the Transformer decoder. - num_attention_heads (
int, optional, defaults to 24) — Number of attention heads for each attention layer in the Transformer decoder. - num_key_value_heads (
int, optional, defaults to 8) — This is the number of key_value heads that should be used to implement Grouped Query Attention. Ifnum_key_value_heads=num_attention_heads, the model will use Multi Head Attention (MHA), ifnum_key_value_heads=1the model will use Multi Query Attention (MQA) otherwise GQA is used. When converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed by meanpooling all the original heads within that group. For more details, check out this paper. If it is not specified, will default tonum_attention_heads. - hidden_act (
strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the decoder. - max_position_embeddings (
int, optional, defaults to 2048) — The maximum sequence length that this model might ever be used with. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - rms_norm_eps (
float, optional, defaults to 1e-05) — The epsilon used by the rms normalization layers. - use_cache (
bool, optional, defaults toTrue) — Whether or not the model should return the last key/values attentions (not used by all models). Only relevant ifconfig.is_decoder=True. - pad_token_id (
int, optional, defaults to 128001) — Padding token id. - bos_token_id (
int, optional, defaults to 1) — Beginning of stream token id. - eos_token_id (
int, optional, defaults to 128009) — End of stream token id. - pretraining_tp (
int, optional, defaults to 1) — Experimental feature. Tensor parallelism rank used during pretraining. Please refer to this document to understand more about it. This value is necessary to ensure exact reproducibility of the pretraining results. Please refer to this issue. - tie_word_embeddings (
bool, optional, defaults toFalse) — Whether to tie weight embeddings - rope_parameters (
RopeParameters, optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value forrope_thetaand optionally parameters used for scaling in case you want to use RoPE with longermax_position_embeddings. - attention_bias (
bool, optional, defaults toFalse) — Whether to use a bias in the query, key, value and output projection layers during self-attention. - attention_dropout (
float, optional, defaults to 0.0) — The dropout ratio for the attention probabilities. - mlp_bias (
bool, optional, defaults toFalse) — Whether to use a bias in up_proj, down_proj and gate_proj layers in the MLP layers. - head_dim (
int, optional, defaults to 128) — The attention head dimension. If None, it will default to hidden_size // num_attention_heads - num_codebooks (
int, optional, defaults to 8) — Number of codebooks used in the underlying codec model responsible for tokenizing the audio. - codebook_size (
int, optional, defaults to 1024) — Size of the codebook used in the underlying codec model for audio tokenization. - audio_token_id (
int, optional, defaults to 128016) — The token ID used to represent audio output in the text sequence. - audio_bos_token_id (
int, optional, defaults to 128013) — The token ID for the beginning-of-sequence token for audio output. - audio_delay_token_id (
int, optional, defaults to 128014) — The token ID used for audio delay pattern in multi-codebook generation. - audio_stream_bos_id (
int, optional, defaults to 1024) — The ID for the beginning-of-stream token in audio sequences. - audio_stream_eos_id (
int, optional, defaults to 1025) — The ID for the end-of-stream token in audio sequences.
This is the configuration class to store the configuration of a HiggsAudioV2Model. It is used to instantiate an HiggsAudioV2 model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the 3B model. e.g. bosonai/higgs-audio-v2-generation-3B-base
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import HiggsAudioV2Model, HiggsAudioV2Config
>>> # Initializing a HiggsAudioV2 style configuration
>>> configuration = HiggsAudioV2Config()
>>> # Initializing a model from the configuration
>>> model = HiggsAudioV2Model(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configHiggsAudioV2Processor
class transformers.HiggsAudioV2Processor
< source >( feature_extractor tokenizer audio_tokenizer chat_template = None audio_token = '<|AUDIO_OUT|>' audio_bos_token = '<|audio_out_bos|>' audio_eos_token = '<|audio_eos|>' audio_delay_token = '<|reserved_special_token_6|>' audio_stream_bos_id = 1024 audio_stream_eos_id = 1025 )
Parameters
- feature_extractor (
DacFeatureExtractor) — An instance of DacFeatureExtractor. The feature extractor is a required input. - tokenizer (
AutoTokenizer) — An instance of AutoTokenizer. The tokenizer is a required input. - audio_tokenizer (
HiggsAudioV2TokenizerModel) — An instance of HiggsAudioV2TokenizerModel. The audio tokenizer is a required input. - chat_template (
str, optional) — A template string for chat formatting when combining text and audio interactions. - audio_token (
str, optional, defaults to"<|AUDIO_OUT|>") — The token used to represent audio output in the text sequence. - audio_bos_token (
str, optional, defaults to"<|audio_out_bos|>") — The beginning-of-sequence token for audio output. - audio_eos_token (
str, optional, defaults to"<|audio_eos|>") — The end-of-sequence token for audio output. - audio_delay_token (
str, optional, defaults to"<|reserved_special_token_6|>") — The token used for audio delay pattern in multi-codebook generation. - audio_stream_bos_id (
int, optional, defaults to 1024) — The ID for the beginning-of-stream token in audio sequences. - audio_stream_eos_id (
int, optional, defaults to 1025) — The ID for the end-of-stream token in audio sequences.
Constructs a Higgs Audio processor which wraps a DacFeatureExtractor, a AutoTokenizer, and a HiggsAudioV2TokenizerModel into a single processor. It inherits, the audio feature extraction, tokenizer, and audio encode/decode functionalities. See call() and decode() for more information.
__call__
< source >( text: str | list[str] | list[list[str]] | None = None audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor'], NoneType] = None output_labels: bool | None = False **kwargs: typing_extensions.Unpack[transformers.models.higgs_audio_v2.processing_higgs_audio_v2.HiggsAudioV2ProcessorKwargs] )
HiggsAudioV2Model
class transformers.HiggsAudioV2Model
< source >( config: HiggsAudioV2Config )
Parameters
- config (HiggsAudioV2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The bare Higgs Audio V2 Model outputting raw hidden-states without any specific head on top.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None audio_input_ids: torch.LongTensor | None = None attention_mask: torch.LongTensor | None = None audio_input_ids_mask: torch.BoolTensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None cache_position: torch.LongTensor | None = None use_cache: bool | None = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → ~models.modeling_outputs.BaseModelOutputWithPast
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- audio_input_ids (
torch.LongTensorof shape(batch_size, num_audio_frames, num_codebooks), optional) — Indices of audio codebook tokens.Indices can be obtained using HiggsAudioV2TokenizerModel.encode().
- attention_mask (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- audio_input_ids_mask (
torch.BoolTensorof shape(batch_size, num_audio_frames), optional) — Indicates which audio frames inaudio_input_idsare valid. - position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values).
Returns
~models.modeling_outputs.BaseModelOutputWithPast
Usual decoder outputs with the placeholder positions already substituted by their corresponding audio embeddings.
The HiggsAudioV2Model forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, HiggsAudioV2Model
>>> import torch
>>> device = "cuda" if torch.cuda.is_available() else "cpu"
>>> processor = AutoProcessor.from_pretrained("eustlb/higgs-audio-v2-generation-3B-base", device_map=device)
>>> model = HiggsAudioV2Model.from_pretrained("eustlb/higgs-audio-v2-generation-3B-base", device_map=device)
>>> conversation = [
... {
... "role": "system",
... "content": [
... {
... "type": "text",
... "text": "Generate audio following instruction."
... }
... ]
... },
... {
... "role": "scene",
... "content": [
... {
... "type": "text",
... "text": "Audio is recorded from a quiet room."
... }
... ]
... },
... {
... "role": "user",
... "content": [
... {
... "type": "text",
... "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year."
... }
... ]
... },
... {
... "role": "assistant",
... "content": [
... {
... "type": "audio",
... "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav"
... }
... ]
... },
... {
... "role": "user",
... "content": [
... {
... "type": "text",
... "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years."
... }
... ]
... }
... ]
>>> inputs = processor.apply_chat_template(conversation, return_dict=True, tokenize=True, sampling_rate=24000, return_tensors="pt")
>>> inputs = inputs.to(model.device)
>>> outputs = model(**inputs)HiggsAudioV2ForConditionalGeneration
class transformers.HiggsAudioV2ForConditionalGeneration
< source >( config: HiggsAudioV2Config use_text_head: bool = False )
Parameters
- config (HiggsAudioV2Config) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
- use_text_head (
bool, optional, defaults to False) — Whether to use a text language model head. Such head is not required for generation, but can be used to compute the text loss when training.
The Higgs Audio model, a llama-like auto-regressive transformer model with dual-FFN.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_ids: torch.LongTensor | None = None attention_mask: torch.BoolTensor | None = None audio_input_ids: torch.LongTensor | None = None audio_input_ids_mask: torch.LongTensor | None = None position_ids: torch.LongTensor | None = None past_key_values: transformers.cache_utils.Cache | None = None inputs_embeds: torch.FloatTensor | None = None labels: torch.LongTensor | None = None audio_labels: torch.LongTensor | None = None use_cache: bool | None = None cache_position: torch.LongTensor | None = None logits_to_keep: int | torch.Tensor = 0 **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → ~models.modeling_outputs.CausalLMOutputWithPast
Parameters
- input_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of input sequence tokens in the vocabulary. Padding will be ignored by default.Indices can be obtained using AutoTokenizer. See PreTrainedTokenizer.encode() and PreTrainedTokenizer.call() for details.
- attention_mask (
torch.BoolTensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- audio_input_ids (
torch.LongTensorof shape(batch_size, num_audio_frames, num_codebooks), optional) — Indices of audio codebook tokens.Indices can be obtained using HiggsAudioV2TokenizerModel.encode().
- audio_input_ids_mask (
torch.BoolTensorof shape(batch_size, num_audio_frames), optional) — Indicates which audio frames inaudio_input_idsare valid. - position_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Indices of positions of each input sequence tokens in the position embeddings. Selected in the range[0, config.n_positions - 1]. - past_key_values (
~cache_utils.Cache, optional) — Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention blocks) that can be used to speed up sequential decoding. This typically consists in thepast_key_valuesreturned by the model at a previous stage of decoding, whenuse_cache=Trueorconfig.use_cache=True.Only Cache instance is allowed as input, see our kv cache guide. If no
past_key_valuesare passed, DynamicCache will be initialized by default.The model will output the same cache format that is fed as input.
If
past_key_valuesare used, the user is expected to input only unprocessedinput_ids(those that don’t have their past key value states given to this model) of shape(batch_size, unprocessed_length)instead of allinput_idsof shape(batch_size, sequence_length). - inputs_embeds (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size), optional) — Optionally, instead of passinginput_idsyou can choose to directly pass an embedded representation. This is useful if you want more control over how to convertinput_idsindices into associated vectors than the model’s internal embedding lookup matrix. - labels (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size]. - audio_labels (
torch.LongTensorof shape(batch_size, num_audio_frames, num_codebooks), optional) — Labels for the audio codebook tokens for computing the masked language modeling loss. Indices should either be in[0, ..., config.codebook_size]. Token with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, …, config.codebook_size]. Can be obtained usingoutput_labels=True` when calling HiggsAudioV2Processor. - use_cache (
bool, optional) — If set toTrue,past_key_valueskey value states are returned and can be used to speed up decoding (seepast_key_values). - cache_position (
torch.LongTensorof shape(sequence_length), optional) — Indices depicting the position of the input sequence tokens in the sequence. Contrarily toposition_ids, this tensor is not affected by padding. It is used to update the cache in the correct position and to infer the complete sequence length. - logits_to_keep (
Union[int, torch.Tensor], optional, defaults to0) — If anint, compute logits for the lastlogits_to_keeptokens. If0, calculate logits for allinput_ids(special case). Only last token logits are needed for generation, and calculating them only for that token can save memory, which becomes pretty significant for long sequences or large vocabulary size. If atorch.Tensor, must be 1D corresponding to the indices to keep in the sequence length dimension. This is useful when using packed tensor format (single dimension for batch and sequence length).
Returns
~models.modeling_outputs.CausalLMOutputWithPast
A ~models.modeling_outputs.CausalLMOutputWithPast containing the logits, loss (if labels are provided),
and other outputs from the model.
The HiggsAudioV2ForConditionalGeneration forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, HiggsAudioV2ForConditionalGeneration
>>> model_id = "eustlb/higgs-audio-v2-generation-3B-base"
>>> processor = AutoProcessor.from_pretrained(model_id, device_map="auto")
>>> model = HiggsAudioV2ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
>>> conversation = [
... {
... "role": "system",
... "content": [
... {
... "type": "text",
... "text": "Generate audio following instruction."
... }
... ]
... },
... {
... "role": "scene",
... "content": [
... {
... "type": "text",
... "text": "Audio is recorded from a quiet room."
... }
... ]
... },
... {
... "role": "user",
... "content": [
... {
... "type": "text",
... "text": "It was the night before my birthday. Hooray! It's almost here! It may not be a holiday, but it's the best day of the year."
... }
... ]
... },
... {
... "role": "assistant",
... "content": [
... {
... "type": "audio",
... "url": "https://huggingface.co/datasets/eustlb/dummy-audio-samples-higgs/resolve/main/belinda.wav"
... }
... ]
... },
... {
... "role": "user",
... "content": [
... {
... "type": "text",
... "text": "The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years."
... }
... ]
... }
... ]
>>> inputs = processor.apply_chat_template(conversation, return_dict=True, tokenize=True, sampling_rate=24000, return_tensors="pt")
>>> inputs = inputs.to(model.device)
>>> outputs = model(**inputs)generate
< source >( inputs: torch.Tensor | None = None generation_config: transformers.generation.configuration_utils.GenerationConfig | None = None logits_processor: transformers.generation.logits_process.LogitsProcessorList | None = None stopping_criteria: transformers.generation.stopping_criteria.StoppingCriteriaList | None = None prefix_allowed_tokens_fn: collections.abc.Callable[[int, torch.Tensor], list[int]] | None = None synced_gpus: bool | None = None assistant_model: typing.Optional[ForwardRef('PreTrainedModel')] = None streamer: typing.Optional[ForwardRef('BaseStreamer')] = None negative_prompt_ids: torch.Tensor | None = None negative_prompt_attention_mask: torch.Tensor | None = None custom_generate: str | collections.abc.Callable | None = None **kwargs ) → ModelOutput or torch.LongTensor
Parameters
- inputs (
torch.Tensorof varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. IfNonethe method initializes it withbos_token_idand a batch size of 1. For decoder-only modelsinputsshould be in the format ofinput_ids. For encoder-decoder models inputs can represent any ofinput_ids,input_values,input_features, orpixel_values. - generation_config (GenerationConfig, optional) —
The generation configuration to be used as base parametrization for the generation call.
**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which has the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation. - logits_processor (
LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users. - stopping_criteria (
StoppingCriteriaList, optional) — Custom stopping criteria that complements the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. If your stopping criteria depends on thescoresinput, make sure you passreturn_dict_in_generate=True, output_scores=Truetogenerate. This feature is intended for advanced users. - prefix_allowed_tokens_fn (
Callable[[int, torch.Tensor], list[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_idandinput_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_idand the previously generated tokensinputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval. - synced_gpus (
bool, optional) — Whether to continue running the while loop until max_length. Unless overridden, this flag will be set toTrueif usingFullyShardedDataParallelor DeepSpeed ZeRO Stage 3 with multiple GPUs to avoid deadlocking if one GPU finishes generating before other GPUs. Otherwise, defaults toFalse. - assistant_model (
PreTrainedModel, optional) — An assistant model that can be used to accelerate generation. The assistant model must have the exact same tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistant model is much faster than running generation with the model you’re calling generate from. As such, the assistant model should be much smaller. - streamer (
BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)and the streamer is responsible for any further processing. - negative_prompt_ids (
torch.LongTensorof shape(batch_size, sequence_length), optional) — The negative prompt needed for some processors such as CFG. The batch size must match the input batch size. This is an experimental feature, subject to breaking API changes in future versions. - negative_prompt_attention_mask (
torch.LongTensorof shape(batch_size, sequence_length), optional) — Attention_mask fornegative_prompt_ids. - custom_generate (
strorCallable, optional) — One of the following:str(Hugging Face Hub repository name): runs the customgeneratefunction defined atcustom_generate/generate.pyin that repository instead of the standardgeneratemethod. The repository fully replaces the generation logic, and the return type may differ.str(local repository path): same as above but from a local path,trust_remote_codenot required.Callable:generatewill perform the usual input preparation steps, then call the provided callable to run the decoding loop. For more information, see the docs.
- kwargs (
dict[str, Any], optional) — Ad hoc parametrization ofgeneration_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True) or a torch.LongTensor.
If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible
ModelOutput types are:
If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible
ModelOutput types are:
Generates sequences of token ids for models with a language modeling head.
Most generation-controlling parameters are set in
generation_configwhich, if not passed, will be set to the model’s default generation configuration. You can override anygeneration_configby passing the corresponding parameters to generate(), e.g..generate(inputs, num_beams=4, do_sample=True).For an overview of generation strategies and code examples, check out the following guide.