AudioSR-MLX-8bit

MLX port of AudioSR — a multi-step latent-diffusion audio super-resolution model — quantized to INT8 weight-only for on-device inference on Apple Silicon. Upsamples any-rate input (mono) to 48 kHz with a 200-step DDIM solver (v-prediction

  • classifier-free guidance) over an AudioLDM-style VAE latent, then decodes via a HiFi-GAN vocoder. AudioSR is the teacher model that FlashSR was distilled from — FlashSR runs the same architecture in 1 step instead of 200, at the cost of some quality.

Model

Total parameters 672 M (VAE 223 M + UNet 258 M + Vocoder 190 M)
Diffusion 200-step DDIM, v-prediction, cosine schedule, η=0
Classifier-free guidance scale 3.5 (cond + uncond passes per step)
Quantization INT8 weight-only, group size 64, mode mlx_affine_flat
Format MLX safetensors (single combined bundle)
Sample rate 48 kHz mono out (any-rate mono in)
Frame length 10.24 s (491 520 samples) per forward
Bundle size 684 MB on disk
Source haoheliu/audiosr_basic

Files

File Size Description
model.safetensors 684 MB INT8-quantized VAE + UNet + HiFi-GAN weights
config.json ~80 KB Sub-model configs + quantization metadata + per-tensor shape table for dequant-on-load

The three sub-models share one safetensors file with vae.*, ldm.*, voc.* key prefixes. config.quantized_shapes records each tensor's pre-flatten shape so mx.dequantize can rebuild conv weight tensors at load time.

Performance (Apple Silicon, M-series, 10.24 s @ 48 kHz)

Metric Value
Real-time factor (wall / audio) 1.87
Load time 0.06 s
SNR vs FP16 reference +31.8 dB
Peak amplitude 0.500 (renormalised per upstream)

INT8 is the transparent-quality variant — choose it for maximum fidelity vs the upstream PyTorch reference at twice the on-disk size of INT4. INT4 is the recommended deployment variant.

Usage

from huggingface_hub import snapshot_download
import mlx.core as mx
import numpy as np
import scipy.io.wavfile as wf
from scipy.signal import resample_poly

bundle = snapshot_download("aufklarer/AudioSR-MLX-8bit")
# See https://github.com/soniqo/speech-swift for production usage.

from audiosr import AudioSR
model = AudioSR(bundle)

sr, audio = wf.read("lr.wav")
audio = audio.astype(np.float32) / 32767.0
audio_48 = resample_poly(audio, 48000, sr).astype(np.float32)

hr = model(mx.array(audio_48), steps=200, cfg_scale=3.5, seed=42)
mx.eval(hr)
wf.write("hr.wav", 48000, (np.clip(np.array(hr), -1, 1) * 32767).astype(np.int16))

Source

License

CC-BY-NC 4.0 — inherited from upstream AudioSR weights. Non-commercial use only.

Downloads last month
17
Safetensors
Model size
0.2B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aufklarer/AudioSR-MLX-8bit

Finetuned
(2)
this model

Collection including aufklarer/AudioSR-MLX-8bit

Paper for aufklarer/AudioSR-MLX-8bit