mazesmazes commited on
Commit
bbe4853
·
verified ·
1 Parent(s): 68bc8e1

Model save

Browse files
Files changed (1) hide show
  1. README.md +78 -263
README.md CHANGED
@@ -1,267 +1,82 @@
1
  ---
2
- license: mit
3
- language:
4
- - en
5
- datasets:
6
- - speechbrain/LoquaciousSet
7
- base_model:
8
- - zai-org/GLM-ASR-Nano-2512
9
- - Qwen/Qwen3-0.6B
10
- pipeline_tag: automatic-speech-recognition
11
- tags:
12
- - asr
13
- - speech-recognition
14
- - audio
15
- - qwen
16
- - glm-asr
17
  library_name: transformers
 
 
 
 
 
18
  ---
19
 
20
- # Tiny Audio
21
-
22
- A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
23
-
24
- ## Quick Start
25
-
26
- ```python
27
- from transformers import pipeline
28
-
29
- pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
30
- result = pipe("audio.wav")
31
- print(result["text"])
32
- ```
33
-
34
- ## Usage Examples
35
-
36
- ### Basic Transcription
37
-
38
- ```python
39
- from transformers import pipeline
40
-
41
- pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
42
-
43
- # From file
44
- result = pipe("audio.wav")
45
- print(result["text"])
46
-
47
- # From URL
48
- result = pipe("https://example.com/audio.mp3")
49
-
50
- # From numpy array (must be 16kHz)
51
- import numpy as np
52
- audio = np.random.randn(16000).astype(np.float32) # 1 second
53
- result = pipe(audio)
54
- ```
55
-
56
- ### Batch Processing
57
-
58
- ```python
59
- # Process multiple files
60
- files = ["audio1.wav", "audio2.wav", "audio3.wav"]
61
- results = pipe(files, batch_size=4)
62
- for r in results:
63
- print(r["text"])
64
- ```
65
-
66
- ### Word-Level Timestamps
67
-
68
- ```python
69
- result = pipe("audio.wav", return_timestamps="word")
70
- # Returns:
71
- # {
72
- # "text": "hello world",
73
- # "chunks": [
74
- # {"text": "hello", "timestamp": (0.0, 0.5)},
75
- # {"text": "world", "timestamp": (0.6, 1.0)}
76
- # ]
77
- # }
78
- ```
79
-
80
- ### Streaming Inference
81
-
82
- ```python
83
- from tiny_audio import ASRModel, ASRProcessor
84
- import torch
85
-
86
- model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
87
- processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
88
-
89
- # Load and process audio
90
- import librosa
91
- audio, sr = librosa.load("audio.wav", sr=16000)
92
- inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
93
-
94
- # Stream tokens
95
- for token in model.generate_streaming(inputs["input_features"]):
96
- print(token, end="", flush=True)
97
- ```
98
-
99
- ### Using with torch directly
100
-
101
- ```python
102
- from tiny_audio import ASRModel, ASRProcessor
103
- import torch
104
- import librosa
105
-
106
- # Load model and processor
107
- model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
108
- processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
109
-
110
- # Load audio (16kHz)
111
- audio, sr = librosa.load("audio.wav", sr=16000)
112
-
113
- # Process
114
- inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
115
-
116
- # Generate
117
- with torch.no_grad():
118
- output = model.generate(
119
- input_features=inputs["input_features"],
120
- attention_mask=inputs["attention_mask"],
121
- max_new_tokens=256
122
- )
123
-
124
- # Decode
125
- text = processor.batch_decode(output, skip_special_tokens=True)[0]
126
- print(text)
127
- ```
128
-
129
- ### GPU Inference
130
-
131
- ```python
132
- import torch
133
-
134
- pipe = pipeline(
135
- "automatic-speech-recognition",
136
- model="mazesmazes/tiny-audio",
137
- trust_remote_code=True,
138
- device="cuda" # or device=0
139
- )
140
- ```
141
-
142
- ### Half Precision
143
-
144
- ```python
145
- pipe = pipeline(
146
- "automatic-speech-recognition",
147
- model="mazesmazes/tiny-audio",
148
- trust_remote_code=True,
149
- torch_dtype=torch.float16,
150
- device="cuda"
151
- )
152
- ```
153
-
154
- ## Architecture
155
-
156
- ```
157
- Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
158
- ```
159
-
160
- Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
161
-
162
- | Component | Model | Parameters | Status |
163
- |-----------|-------|------------|--------|
164
- | Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
165
- | Projector | 2-layer MLP | ~12M | Trained |
166
- | Language Model | Qwen3-0.6B | ~600M | Frozen |
167
-
168
- ### How It Works
169
-
170
- 1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
171
- 2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
172
- 3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
173
-
174
- The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
175
-
176
- ## Model Specifications
177
-
178
- | Specification | Value |
179
- |---------------|-------|
180
- | Input | Audio (16kHz mono) |
181
- | Output | Text transcription |
182
- | Max Audio Length | ~30 seconds (limited by encoder) |
183
- | Vocabulary | Qwen3 tokenizer |
184
- | Languages | English only |
185
- | Generation | Greedy decoding (num_beams=1, do_sample=False) |
186
-
187
- ## Training Details
188
-
189
- | | |
190
- |---|---|
191
- | **Dataset** | LoquaciousSet (25,000 hours) |
192
- | **Hardware** | Single NVIDIA A40 |
193
- | **Time** | ~24 hours |
194
- | **Cost** | ~$12 |
195
- | **Optimizer** | AdamW |
196
- | **Learning Rate** | 1e-4 |
197
- | **Batch Size** | 4 |
198
- | **Steps** | 50,000 |
199
-
200
- ## Limitations
201
-
202
- - **English only**: Not trained on other languages
203
- - **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
204
- - **Audio length**: Best for clips under 30 seconds
205
- - **Accuracy**: May degrade on:
206
- - Heavily accented speech
207
- - Noisy or low-quality audio
208
- - Domain-specific terminology
209
- - Overlapping speakers
210
- - **No punctuation**: Output is lowercase without punctuation by default
211
-
212
- ## Requirements
213
-
214
- ```
215
- transformers>=4.40.0
216
- torch>=2.0.0
217
- torchaudio>=2.0.0
218
- ```
219
-
220
- Optional for streaming:
221
- ```
222
- librosa
223
- soundfile
224
- ```
225
-
226
- ## Files
227
-
228
- | File | Description |
229
- |------|-------------|
230
- | `config.json` | Model configuration |
231
- | `model.safetensors` | Projector weights (~48MB) |
232
- | `preprocessor_config.json` | Audio preprocessing config |
233
- | `tokenizer.json` | Tokenizer |
234
- | `tokenizer_config.json` | Tokenizer config |
235
- | `special_tokens_map.json` | Special tokens |
236
-
237
- Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
238
-
239
- ## Citation
240
-
241
- If you use this model, please cite:
242
-
243
- ```bibtex
244
- @misc{tinyaudio2024,
245
- author = {Alex Kroman},
246
- title = {Tiny Audio: Minimal ASR Training},
247
- year = {2024},
248
- publisher = {GitHub},
249
- url = {https://github.com/alexkroman/tiny-audio}
250
- }
251
- ```
252
-
253
- ## Links
254
-
255
- - [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
256
- - [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
257
- - [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
258
-
259
- ## Acknowledgments
260
-
261
- - [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
262
- - [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
263
- - [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
264
-
265
- ## License
266
-
267
- MIT
 
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
+ tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: tiny-audio
7
+ results: []
8
  ---
9
 
10
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
11
+ should probably proofread and complete it, then remove this comment. -->
12
+
13
+ # tiny-audio
14
+
15
+ This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
16
+ It achieves the following results on the evaluation set:
17
+ - Loss: 0.4587
18
+
19
+ ## Model description
20
+
21
+ More information needed
22
+
23
+ ## Intended uses & limitations
24
+
25
+ More information needed
26
+
27
+ ## Training and evaluation data
28
+
29
+ More information needed
30
+
31
+ ## Training procedure
32
+
33
+ ### Training hyperparameters
34
+
35
+ The following hyperparameters were used during training:
36
+ - learning_rate: 0.001
37
+ - train_batch_size: 14
38
+ - eval_batch_size: 14
39
+ - seed: 42
40
+ - gradient_accumulation_steps: 4
41
+ - total_train_batch_size: 56
42
+ - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
43
+ - lr_scheduler_type: polynomial
44
+ - lr_scheduler_warmup_steps: 500
45
+ - num_epochs: 1
46
+ - label_smoothing_factor: 0.1
47
+
48
+ ### Training results
49
+
50
+ | Training Loss | Epoch | Step | Validation Loss |
51
+ |:-------------:|:------:|:-----:|:---------------:|
52
+ | 2.1737 | 0.0418 | 1000 | 0.4878 |
53
+ | 2.1091 | 0.0836 | 2000 | 0.4777 |
54
+ | 2.0988 | 0.1254 | 3000 | 0.4728 |
55
+ | 2.0590 | 0.1672 | 4000 | 0.4705 |
56
+ | 2.0484 | 0.2090 | 5000 | 0.4689 |
57
+ | 2.0637 | 0.2508 | 6000 | 0.4670 |
58
+ | 2.0505 | 0.2926 | 7000 | 0.4659 |
59
+ | 2.0550 | 0.3344 | 8000 | 0.4650 |
60
+ | 2.0516 | 0.3762 | 9000 | 0.4641 |
61
+ | 2.0530 | 0.4180 | 10000 | 0.4634 |
62
+ | 2.0301 | 0.4598 | 11000 | 0.4628 |
63
+ | 2.0608 | 0.5016 | 12000 | 0.4623 |
64
+ | 2.0428 | 0.5434 | 13000 | 0.4621 |
65
+ | 2.0248 | 0.5852 | 14000 | 0.4620 |
66
+ | 2.0525 | 0.6270 | 15000 | 0.4612 |
67
+ | 2.0281 | 0.6688 | 16000 | 0.4609 |
68
+ | 2.0338 | 0.7106 | 17000 | 0.4600 |
69
+ | 2.0492 | 0.7524 | 18000 | 0.4605 |
70
+ | 2.0261 | 0.7942 | 19000 | 0.4598 |
71
+ | 2.0084 | 0.8360 | 20000 | 0.4593 |
72
+ | 2.0236 | 0.8778 | 21000 | 0.4590 |
73
+ | 2.0205 | 0.9196 | 22000 | 0.4590 |
74
+ | 2.0063 | 0.9614 | 23000 | 0.4587 |
75
+
76
+
77
+ ### Framework versions
78
+
79
+ - Transformers 5.0.0.dev0
80
+ - Pytorch 2.8.0+cu128
81
+ - Datasets 3.6.0
82
+ - Tokenizers 0.22.2