# Audio Spectrogram Transformer

## 概要

Audio Spectrogram Transformerモデルは、[AST: Audio Spectrogram Transformer](https://huggingface.co/papers/2104.01778)という論文でYuan Gong、Yu-An Chung、James Glassによって提案されました。これは、音声を画像（スペクトログラム）に変換することで、音声に[Vision Transformer](vit)を適用します。このモデルは音声分類において最先端の結果を得ています。

論文の要旨は以下の通りです：

*過去10年間で、畳み込みニューラルネットワーク（CNN）は、音声スペクトログラムから対応するラベルへの直接的なマッピングを学習することを目指す、エンドツーエンドの音声分類モデルの主要な構成要素として広く採用されてきました。長距離のグローバルなコンテキストをより良く捉えるため、最近の傾向として、CNNの上にセルフアテンション機構を追加し、CNN-アテンションハイブリッドモデルを形成することがあります。しかし、CNNへの依存が必要かどうか、そして純粋にアテンションに基づくニューラルネットワークだけで音声分類において良いパフォーマンスを得ることができるかどうかは明らかではありません。本論文では、これらの問いに答えるため、音声分類用では最初の畳み込みなしで純粋にアテンションベースのモデルであるAudio Spectrogram Transformer（AST）を紹介します。我々はASTを様々なオーディオ分類ベンチマークで評価し、AudioSetで0.485 mAP、ESC-50で95.6%の正解率、Speech Commands V2で98.1%の正解率という新たな最先端の結果を達成しました。*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/audio_spectogram_transformer_architecture.png"
alt="drawing" width="600"/>

 Audio Spectrogram Transformerのアーキテクチャ。元論文より抜粋。

このモデルは[nielsr](https://huggingface.co/nielsr)より提供されました。
オリジナルのコードは[こちら](https://github.com/YuanGongND/ast)で見ることができます。

## 使用上のヒント

- 独自のデータセットでAudio Spectrogram Transformer（AST）をファインチューニングする場合、入力の正規化（入力の平均を0、標準偏差を0.5にすること）処理することが推奨されます。[ASTFeatureExtractor](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor)はこれを処理します。デフォルトではAudioSetの平均と標準偏差を使用していることに注意してください。著者が下流のデータセットの統計をどのように計算しているかは、[`ast/src/get_norm_stats.py`](https://github.com/YuanGongND/ast/blob/master/src/get_norm_stats.py)で確認することができます。
- ASTは低い学習率が必要であり 著者は[PSLA論文](https://huggingface.co/papers/2102.01243)で提案されたCNNモデルに比べて10倍小さい学習率を使用しています）、素早く収束するため、タスクに適した学習率と学習率スケジューラーを探すことをお勧めします。

## 参考資料

Audio Spectrogram Transformerの使用を開始するのに役立つ公式のHugging Faceおよびコミュニティ（🌎で示されている）の参考資料の一覧です。

- ASTを用いた音声分類の推論を説明するノートブックは[こちら](https://github.com/NielsRogge/Transformers-Tutorials/tree/master/AST)で見ることができます。
- [ASTForAudioClassification](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTForAudioClassification)は、この[例示スクリプト](https://github.com/huggingface/transformers/tree/main/examples/pytorch/audio-classification)と[ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb)によってサポートされています。
- こちらも参照：[音声分類タスク](../tasks/audio_classification)。

ここに参考資料を提出したい場合は、気兼ねなくPull Requestを開いてください。私たちはそれをレビューいたします！参考資料は、既存のものを複製するのではなく、何か新しいことを示すことが理想的です。

## ASTConfig[[transformers.ASTConfig]]

#### transformers.ASTConfig[[transformers.ASTConfig]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/configuration_audio_spectrogram_transformer.py#L24)

This is the configuration class to store the configuration of a Audio Spectrogram TransformerModel. It is used to instantiate a Audio Spectrogram Transformer
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the [MIT/ast-finetuned-audioset-10-10-0.4593](https://huggingface.co/MIT/ast-finetuned-audioset-10-10-0.4593)

Configuration objects inherit from [PreTrainedConfig](/docs/transformers/v5.8.0/ja/main_classes/configuration#transformers.PreTrainedConfig) and can be used to control the model outputs. Read the
documentation from [PreTrainedConfig](/docs/transformers/v5.8.0/ja/main_classes/configuration#transformers.PreTrainedConfig) for more information.

Example:

```python
>>> from transformers import ASTConfig, ASTModel

>>> # Initializing a AST MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
>>> configuration = ASTConfig()

>>> # Initializing a model (with random weights) from the MIT/ast-finetuned-audioset-10-10-0.4593 style configuration
>>> model = ASTModel(configuration)

>>> # Accessing the model configuration
>>> configuration = model.config
```

**Parameters:**

hidden_size (`int`, *optional*, defaults to `768`) : Dimension of the hidden representations.

num_hidden_layers (`int`, *optional*, defaults to `12`) : Number of hidden layers in the Transformer decoder.

num_attention_heads (`int`, *optional*, defaults to `12`) : Number of attention heads for each attention layer in the Transformer decoder.

intermediate_size (`int`, *optional*, defaults to `3072`) : Dimension of the MLP representations.

hidden_act (`str`, *optional*, defaults to `gelu`) : The non-linear activation function (function or string) in the decoder. For example, `"gelu"`, `"relu"`, `"silu"`, etc.

hidden_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.

attention_probs_dropout_prob (`Union[float, int]`, *optional*, defaults to `0.0`) : The dropout ratio for the attention probabilities.

initializer_range (`float`, *optional*, defaults to `0.02`) : The standard deviation of the truncated_normal_initializer for initializing all weight matrices.

layer_norm_eps (`float`, *optional*, defaults to `1e-12`) : The epsilon used by the layer normalization layers.

patch_size (`Union[int, list[int], tuple[int, int]]`, *optional*, defaults to `16`) : The size (resolution) of each patch.

qkv_bias (`bool`, *optional*, defaults to `True`) : Whether to add a bias to the queries, keys and values.

frequency_stride (`int`, *optional*, defaults to 10) : Frequency stride to use when patchifying the spectrograms.

time_stride (`int`, *optional*, defaults to 10) : Temporal stride to use when patchifying the spectrograms.

max_length (`int`, *optional*, defaults to 1024) : Temporal dimension of the spectrograms.

num_mel_bins (`int`, *optional*, defaults to `128`) : Number of mel features used per input frame. Should correspond to the value used in the `AutoFeatureExtractor` class.

## ASTFeatureExtractor[[transformers.ASTFeatureExtractor]]

#### transformers.ASTFeatureExtractor[[transformers.ASTFeatureExtractor]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py#L36)

Constructs a Audio Spectrogram Transformer (AST) feature extractor.

This feature extractor inherits from [SequenceFeatureExtractor](/docs/transformers/v5.8.0/ja/main_classes/feature_extractor#transformers.SequenceFeatureExtractor) which contains
most of the main methods. Users should refer to this superclass for more information regarding those methods.

This class extracts mel-filter bank features from raw speech using TorchAudio if installed or using numpy
otherwise, pads/truncates them to a fixed length and normalizes them using a mean and standard deviation.

__call__transformers.ASTFeatureExtractor.__call__https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py#L158[{"name": "raw_speech", "val": ": numpy.ndarray | list[float] | list[numpy.ndarray] | list[list[float]]"}, {"name": "sampling_rate", "val": ": int | None = None"}, {"name": "return_tensors", "val": ": str | transformers.utils.generic.TensorType | None = None"}, {"name": "**kwargs", "val": ""}]- **raw_speech** (`np.ndarray`, `list[float]`, `list[np.ndarray]`, `list[list[float]]`) --
  The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float
  values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not
  stereo, i.e. single float per timestep.
- **sampling_rate** (`int`, *optional*) --
  The sampling rate at which the `raw_speech` input was sampled. It is strongly recommended to pass
  `sampling_rate` at the forward call to prevent silent errors.
- **return_tensors** (`str` or [TensorType](/docs/transformers/v5.8.0/ja/internal/file_utils#transformers.TensorType), *optional*) --
  If set, will return tensors instead of list of python integers. Acceptable values are:

  - `'pt'`: Return PyTorch `torch.Tensor` objects.
  - `'np'`: Return Numpy `np.ndarray` objects.0

Main method to featurize and prepare for the model one or several sequence(s).

**Parameters:**

feature_size (`int`, *optional*, defaults to 1) : The feature dimension of the extracted features.

sampling_rate (`int`, *optional*, defaults to 16000) : The sampling rate at which the audio files should be digitalized expressed in hertz (Hz).

num_mel_bins (`int`, *optional*, defaults to 128) : Number of Mel-frequency bins.

max_length (`int`, *optional*, defaults to 1024) : Maximum length to which to pad/truncate the extracted features.

do_normalize (`bool`, *optional*, defaults to `True`) : Whether or not to normalize the log-Mel features using `mean` and `std`.

mean (`float`, *optional*, defaults to -4.2677393) : The mean value used to normalize the log-Mel features. Uses the AudioSet mean by default.

std (`float`, *optional*, defaults to 4.5689974) : The standard deviation value used to normalize the log-Mel features. Uses the AudioSet standard deviation by default.

return_attention_mask (`bool`, *optional*, defaults to `False`) : Whether or not [__call__()](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor.__call__) should return `attention_mask`.

## ASTModel[[transformers.ASTModel]]

#### transformers.ASTModel[[transformers.ASTModel]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py#L336)

The bare Audio Spectrogram Transformer Model outputting raw hidden-states without any specific head on top.

This model inherits from [PreTrainedModel](/docs/transformers/v5.8.0/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.ASTModel.forwardhttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py#L352[{"name": "input_values", "val": ": torch.Tensor | None = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_values** (`torch.FloatTensor` of shape `(batch_size, max_length, num_mel_bins)`) --
  Float values mel features extracted from the raw audio waveform. Raw audio waveform can be obtained by
  loading a `.flac` or `.wav` audio file into an array of type `list[float]`, a `numpy.ndarray` or a
  `torch.Tensor`, *e.g.* via the torchcodec library (`pip install torchcodec`) or the soundfile library
  (`pip install soundfile`).
  To prepare the array into `input_features`, the [AutoFeatureExtractor](/docs/transformers/v5.8.0/ja/model_doc/auto#transformers.AutoFeatureExtractor) should be used for extracting the
  mel features, padding and conversion into a tensor of type `torch.FloatTensor`.
  See [__call__()](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor.__call__)0[BaseModelOutputWithPooling](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)`A [BaseModelOutputWithPooling](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) and inputs.
The [ASTModel](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTModel) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **last_hidden_state** (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`) -- Sequence of hidden-states at the output of the last layer of the model.
- **pooler_output** (`torch.FloatTensor` of shape `(batch_size, hidden_size)`) -- Last layer hidden-state of the first token of the sequence (classification token) after further processing
  through the layers used for the auxiliary pretraining task. E.g. for BERT-family of models, this returns
  the classification token after processing through a linear layer and a tanh activation function. The linear
  layer weights are trained from the next sentence prediction (classification) objective during pretraining.
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

**Parameters:**

config ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.8.0/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

`[BaseModelOutputWithPooling](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or `tuple(torch.FloatTensor)``

A [BaseModelOutputWithPooling](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.BaseModelOutputWithPooling) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) and inputs.

## ASTForAudioClassification[[transformers.ASTForAudioClassification]]

#### transformers.ASTForAudioClassification[[transformers.ASTForAudioClassification]]

[Source](https://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py#L403)

Audio Spectrogram Transformer model with an audio classification head on top (a linear layer on top of the pooled
output) e.g. for datasets like AudioSet, Speech Commands v2.

This model inherits from [PreTrainedModel](/docs/transformers/v5.8.0/ja/main_classes/model#transformers.PreTrainedModel). Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)

This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.

forwardtransformers.ASTForAudioClassification.forwardhttps://github.com/huggingface/transformers/blob/v5.8.0/src/transformers/models/audio_spectrogram_transformer/modeling_audio_spectrogram_transformer.py#L416[{"name": "input_values", "val": ": torch.Tensor | None = None"}, {"name": "labels", "val": ": torch.Tensor | None = None"}, {"name": "**kwargs", "val": ": typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs]"}]- **input_values** (`torch.FloatTensor` of shape `(batch_size, max_length, num_mel_bins)`) --
  Float values mel features extracted from the raw audio waveform. Raw audio waveform can be obtained by
  loading a `.flac` or `.wav` audio file into an array of type `list[float]`, a `numpy.ndarray` or a `torch.Tensor`, *e.g.* via
  the torchcodec library (`pip install torchcodec`) or the soundfile library (`pip install soundfile`).
  To prepare the array into `input_features`, the [AutoFeatureExtractor](/docs/transformers/v5.8.0/ja/model_doc/auto#transformers.AutoFeatureExtractor) should be used for extracting the
  mel features, padding and conversion into a tensor of type `torch.FloatTensor`.
  See [__call__()](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor.__call__)
- **labels** (`torch.LongTensor` of shape `(batch_size,)`, *optional*) --
  Labels for computing the audio classification/regression loss. Indices should be in `[0, ...,
  config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
  `config.num_labels > 1` a classification loss is computed (Cross-Entropy).0[SequenceClassifierOutput](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) or `tuple(torch.FloatTensor)`A [SequenceClassifierOutput](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) and inputs.
The [ASTForAudioClassification](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTForAudioClassification) forward method, overrides the `__call__` special method.

Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
instance afterwards instead of this since the former takes care of running the pre and post processing steps while
the latter silently ignores them.

- **loss** (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided) -- Classification (or regression if config.num_labels==1) loss.
- **logits** (`torch.FloatTensor` of shape `(batch_size, config.num_labels)`) -- Classification (or regression if config.num_labels==1) scores (before SoftMax).
- **hidden_states** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`) -- Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
  one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.

  Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
- **attentions** (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`) -- Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
  sequence_length)`.

  Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
  heads.

Example:

```python
>>> from transformers import AutoFeatureExtractor, ASTForAudioClassification
>>> from datasets import load_dataset
>>> import torch

>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
>>> dataset = dataset.sort("id")
>>> sampling_rate = dataset.features["audio"].sampling_rate

>>> feature_extractor = AutoFeatureExtractor.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")
>>> model = ASTForAudioClassification.from_pretrained("MIT/ast-finetuned-audioset-10-10-0.4593")

>>> # audio file is decoded on the fly
>>> inputs = feature_extractor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")

>>> with torch.no_grad():
...     logits = model(**inputs).logits

>>> predicted_class_ids = torch.argmax(logits, dim=-1).item()
>>> predicted_label = model.config.id2label[predicted_class_ids]
>>> predicted_label
...

>>> # compute loss - target_label is e.g. "down"
>>> target_label = model.config.id2label[0]
>>> inputs["labels"] = torch.tensor([model.config.label2id[target_label]])
>>> loss = model(**inputs).loss
>>> round(loss.item(), 2)
...
```

**Parameters:**

config ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) : Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the [from_pretrained()](/docs/transformers/v5.8.0/ja/main_classes/model#transformers.PreTrainedModel.from_pretrained) method to load the model weights.

**Returns:**

`[SequenceClassifierOutput](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) or `tuple(torch.FloatTensor)``

A [SequenceClassifierOutput](/docs/transformers/v5.8.0/ja/main_classes/output#transformers.modeling_outputs.SequenceClassifierOutput) or a tuple of
`torch.FloatTensor` (if `return_dict=False` is passed or when `config.return_dict=False`) comprising various
elements depending on the configuration ([ASTConfig](/docs/transformers/v5.8.0/ja/model_doc/audio-spectrogram-transformer#transformers.ASTConfig)) and inputs.