mazesmazes
/

tiny-audio

@@ -1,267 +1,82 @@
 ---
-license: mit
-language:
-- en
-datasets:
-- speechbrain/LoquaciousSet
-base_model:
-- zai-org/GLM-ASR-Nano-2512
-- Qwen/Qwen3-0.6B
-pipeline_tag: automatic-speech-recognition
-tags:
-- asr
-- speech-recognition
-- audio
-- qwen
-- glm-asr
 library_name: transformers
 ---
-# Tiny Audio
-A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
-## Quick Start
-```python
-from transformers import pipeline
-pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
-result = pipe("audio.wav")
-print(result["text"])
-```
-## Usage Examples
-### Basic Transcription
-```python
-from transformers import pipeline
-pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
-# From file
-result = pipe("audio.wav")
-print(result["text"])
-# From URL
-result = pipe("https://example.com/audio.mp3")
-# From numpy array (must be 16kHz)
-import numpy as np
-audio = np.random.randn(16000).astype(np.float32)  # 1 second
-result = pipe(audio)
-```
-### Batch Processing
-```python
-# Process multiple files
-files = ["audio1.wav", "audio2.wav", "audio3.wav"]
-results = pipe(files, batch_size=4)
-for r in results:
-    print(r["text"])
-```
-### Word-Level Timestamps
-```python
-result = pipe("audio.wav", return_timestamps="word")
-# Returns:
-# {
-#   "text": "hello world",
-#   "chunks": [
-#     {"text": "hello", "timestamp": (0.0, 0.5)},
-#     {"text": "world", "timestamp": (0.6, 1.0)}
-#   ]
-# }
-```
-### Streaming Inference
-```python
-from tiny_audio import ASRModel, ASRProcessor
-import torch
-model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
-processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
-# Load and process audio
-import librosa
-audio, sr = librosa.load("audio.wav", sr=16000)
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
-# Stream tokens
-for token in model.generate_streaming(inputs["input_features"]):
-    print(token, end="", flush=True)
-```
-### Using with torch directly
-```python
-from tiny_audio import ASRModel, ASRProcessor
-import torch
-import librosa
-# Load model and processor
-model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
-processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
-# Load audio (16kHz)
-audio, sr = librosa.load("audio.wav", sr=16000)
-# Process
-inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
-# Generate
-with torch.no_grad():
-    output = model.generate(
-        input_features=inputs["input_features"],
-        attention_mask=inputs["attention_mask"],
-        max_new_tokens=256
-    )
-# Decode
-text = processor.batch_decode(output, skip_special_tokens=True)[0]
-print(text)
-```
-### GPU Inference
-```python
-import torch
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model="mazesmazes/tiny-audio",
-    trust_remote_code=True,
-    device="cuda"  # or device=0
-)
-```
-### Half Precision
-```python
-pipe = pipeline(
-    "automatic-speech-recognition",
-    model="mazesmazes/tiny-audio",
-    trust_remote_code=True,
-    torch_dtype=torch.float16,
-    device="cuda"
-)
-```
-## Architecture
-```
-Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
-```
-Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
-| Component | Model | Parameters | Status |
-|-----------|-------|------------|--------|
-| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
-| Projector | 2-layer MLP | ~12M | Trained |
-| Language Model | Qwen3-0.6B | ~600M | Frozen |
-### How It Works
-1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
-2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
-3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
-The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
-## Model Specifications
-| Specification | Value |
-|---------------|-------|
-| Input | Audio (16kHz mono) |
-| Output | Text transcription |
-| Max Audio Length | ~30 seconds (limited by encoder) |
-| Vocabulary | Qwen3 tokenizer |
-| Languages | English only |
-| Generation | Greedy decoding (num_beams=1, do_sample=False) |
-## Training Details
-| | |
-|---|---|
-| **Dataset** | LoquaciousSet (25,000 hours) |
-| **Hardware** | Single NVIDIA A40 |
-| **Time** | ~24 hours |
-| **Cost** | ~$12 |
-| **Optimizer** | AdamW |
-| **Learning Rate** | 1e-4 |
-| **Batch Size** | 4 |
-| **Steps** | 50,000 |
-## Limitations
-- **English only**: Not trained on other languages
-- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
-- **Audio length**: Best for clips under 30 seconds
-- **Accuracy**: May degrade on:
-  - Heavily accented speech
-  - Noisy or low-quality audio
-  - Domain-specific terminology
-  - Overlapping speakers
-- **No punctuation**: Output is lowercase without punctuation by default
-## Requirements
-```
-transformers>=4.40.0
-torch>=2.0.0
-torchaudio>=2.0.0
-```
-Optional for streaming:
-```
-librosa
-soundfile
-```
-## Files
-| File | Description |
-|------|-------------|
-| `config.json` | Model configuration |
-| `model.safetensors` | Projector weights (~48MB) |
-| `preprocessor_config.json` | Audio preprocessing config |
-| `tokenizer.json` | Tokenizer |
-| `tokenizer_config.json` | Tokenizer config |
-| `special_tokens_map.json` | Special tokens |
-Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
-## Citation
-If you use this model, please cite:
-```bibtex
-@misc{tinyaudio2024,
-  author = {Alex Kroman},
-  title = {Tiny Audio: Minimal ASR Training},
-  year = {2024},
-  publisher = {GitHub},
-  url = {https://github.com/alexkroman/tiny-audio}
-}
-```
-## Links
-- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
-- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
-- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
-## Acknowledgments
-- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
-- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
-- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
-## License
-MIT

 ---
 library_name: transformers
+tags:
+- generated_from_trainer
+model-index:
+- name: tiny-audio
+  results: []
 ---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+# tiny-audio
+This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
+It achieves the following results on the evaluation set:
+- Loss: 0.4587
+## Model description
+More information needed
+## Intended uses & limitations
+More information needed
+## Training and evaluation data
+More information needed
+## Training procedure
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 0.001
+- train_batch_size: 14
+- eval_batch_size: 14
+- seed: 42
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 56
+- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
+- lr_scheduler_type: polynomial
+- lr_scheduler_warmup_steps: 500
+- num_epochs: 1
+- label_smoothing_factor: 0.1
+### Training results
+| Training Loss | Epoch  | Step  | Validation Loss |
+|:-------------:|:------:|:-----:|:---------------:|
+| 2.1737        | 0.0418 | 1000  | 0.4878          |
+| 2.1091        | 0.0836 | 2000  | 0.4777          |
+| 2.0988        | 0.1254 | 3000  | 0.4728          |
+| 2.0590        | 0.1672 | 4000  | 0.4705          |
+| 2.0484        | 0.2090 | 5000  | 0.4689          |
+| 2.0637        | 0.2508 | 6000  | 0.4670          |
+| 2.0505        | 0.2926 | 7000  | 0.4659          |
+| 2.0550        | 0.3344 | 8000  | 0.4650          |
+| 2.0516        | 0.3762 | 9000  | 0.4641          |
+| 2.0530        | 0.4180 | 10000 | 0.4634          |
+| 2.0301        | 0.4598 | 11000 | 0.4628          |
+| 2.0608        | 0.5016 | 12000 | 0.4623          |
+| 2.0428        | 0.5434 | 13000 | 0.4621          |
+| 2.0248        | 0.5852 | 14000 | 0.4620          |
+| 2.0525        | 0.6270 | 15000 | 0.4612          |
+| 2.0281        | 0.6688 | 16000 | 0.4609          |
+| 2.0338        | 0.7106 | 17000 | 0.4600          |
+| 2.0492        | 0.7524 | 18000 | 0.4605          |
+| 2.0261        | 0.7942 | 19000 | 0.4598          |
+| 2.0084        | 0.8360 | 20000 | 0.4593          |
+| 2.0236        | 0.8778 | 21000 | 0.4590          |
+| 2.0205        | 0.9196 | 22000 | 0.4590          |
+| 2.0063        | 0.9614 | 23000 | 0.4587          |
+### Framework versions
+- Transformers 5.0.0.dev0
+- Pytorch 2.8.0+cu128
+- Datasets 3.6.0
+- Tokenizers 0.22.2