mazesmazes
/

tiny-audio

@@ -1,4 +1,3 @@
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text
-tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 tokenizer_config.json -filter -diff -merge text

README.md CHANGED Viewed

@@ -1,82 +1,267 @@
 ---
-library_name: transformers
 tags:
-- generated_from_trainer
-model-index:
-- name: tiny-audio
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# tiny-audio
-This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
-It achieves the following results on the evaluation set:
-- Loss: 0.4587
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 0.001
-- train_batch_size: 14
-- eval_batch_size: 14
-- seed: 42
-- gradient_accumulation_steps: 4
-- total_train_batch_size: 56
-- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: polynomial
-- lr_scheduler_warmup_steps: 500
-- num_epochs: 1
-- label_smoothing_factor: 0.1
-### Training results
-| Training Loss | Epoch  | Step  | Validation Loss |
-|:-------------:|:------:|:-----:|:---------------:|
-| 2.1737        | 0.0418 | 1000  | 0.4878          |
-| 2.1091        | 0.0836 | 2000  | 0.4777          |
-| 2.0988        | 0.1254 | 3000  | 0.4728          |
-| 2.0590        | 0.1672 | 4000  | 0.4705          |
-| 2.0484        | 0.2090 | 5000  | 0.4689          |
-| 2.0637        | 0.2508 | 6000  | 0.4670          |
-| 2.0505        | 0.2926 | 7000  | 0.4659          |
-| 2.0550        | 0.3344 | 8000  | 0.4650          |
-| 2.0516        | 0.3762 | 9000  | 0.4641          |
-| 2.0530        | 0.4180 | 10000 | 0.4634          |
-| 2.0301        | 0.4598 | 11000 | 0.4628          |
-| 2.0608        | 0.5016 | 12000 | 0.4623          |
-| 2.0428        | 0.5434 | 13000 | 0.4621          |
-| 2.0248        | 0.5852 | 14000 | 0.4620          |
-| 2.0525        | 0.6270 | 15000 | 0.4612          |
-| 2.0281        | 0.6688 | 16000 | 0.4609          |
-| 2.0338        | 0.7106 | 17000 | 0.4600          |
-| 2.0492        | 0.7524 | 18000 | 0.4605          |
-| 2.0261        | 0.7942 | 19000 | 0.4598          |
-| 2.0084        | 0.8360 | 20000 | 0.4593          |
-| 2.0236        | 0.8778 | 21000 | 0.4590          |
-| 2.0205        | 0.9196 | 22000 | 0.4590          |
-| 2.0063        | 0.9614 | 23000 | 0.4587          |
-### Framework versions
-- Transformers 5.0.0.dev0
-- Pytorch 2.8.0+cu128
-- Datasets 3.6.0
-- Tokenizers 0.22.2

 ---
+license: mit
+language:
+- en
+datasets:
+- speechbrain/LoquaciousSet
+base_model:
+- zai-org/GLM-ASR-Nano-2512
+- Qwen/Qwen3-0.6B
+pipeline_tag: automatic-speech-recognition
 tags:
+- asr
+- speech-recognition
+- audio
+- qwen
+- glm-asr
+library_name: transformers
 ---
+# Tiny Audio
+A speech recognition model trained in 24 hours on a single GPU for ~$12. Built with [Tiny Audio](https://github.com/alexkroman/tiny-audio)—a minimal, hackable ASR framework.
+## Quick Start
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+result = pipe("audio.wav")
+print(result["text"])
+```
+## Usage Examples
+### Basic Transcription
+```python
+from transformers import pipeline
+pipe = pipeline("automatic-speech-recognition", model="mazesmazes/tiny-audio", trust_remote_code=True)
+# From file
+result = pipe("audio.wav")
+print(result["text"])
+# From URL
+result = pipe("https://example.com/audio.mp3")
+# From numpy array (must be 16kHz)
+import numpy as np
+audio = np.random.randn(16000).astype(np.float32)  # 1 second
+result = pipe(audio)
+```
+### Batch Processing
+```python
+# Process multiple files
+files = ["audio1.wav", "audio2.wav", "audio3.wav"]
+results = pipe(files, batch_size=4)
+for r in results:
+    print(r["text"])
+```
+### Word-Level Timestamps
+```python
+result = pipe("audio.wav", return_timestamps="word")
+# Returns:
+# {
+#   "text": "hello world",
+#   "chunks": [
+#     {"text": "hello", "timestamp": (0.0, 0.5)},
+#     {"text": "world", "timestamp": (0.6, 1.0)}
+#   ]
+# }
+```
+### Streaming Inference
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load and process audio
+import librosa
+audio, sr = librosa.load("audio.wav", sr=16000)
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Stream tokens
+for token in model.generate_streaming(inputs["input_features"]):
+    print(token, end="", flush=True)
+```
+### Using with torch directly
+```python
+from tiny_audio import ASRModel, ASRProcessor
+import torch
+import librosa
+# Load model and processor
+model = ASRModel.from_pretrained("mazesmazes/tiny-audio")
+processor = ASRProcessor.from_pretrained("mazesmazes/tiny-audio")
+# Load audio (16kHz)
+audio, sr = librosa.load("audio.wav", sr=16000)
+# Process
+inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
+# Generate
+with torch.no_grad():
+    output = model.generate(
+        input_features=inputs["input_features"],
+        attention_mask=inputs["attention_mask"],
+        max_new_tokens=256
+    )
+# Decode
+text = processor.batch_decode(output, skip_special_tokens=True)[0]
+print(text)
+```
+### GPU Inference
+```python
+import torch
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    device="cuda"  # or device=0
+)
+```
+### Half Precision
+```python
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="mazesmazes/tiny-audio",
+    trust_remote_code=True,
+    torch_dtype=torch.float16,
+    device="cuda"
+)
+```
+## Architecture
+```
+Audio (16kHz) → GLM-ASR Encoder (frozen) → MLP Projector (trained) → Qwen3 (frozen) → Text
+```
+Only the projector is trained (~12M params). The encoder and decoder remain frozen, leveraging their pretrained knowledge.
+| Component | Model | Parameters | Status |
+|-----------|-------|------------|--------|
+| Audio Encoder | GLM-ASR-Nano-2512 | ~600M | Frozen |
+| Projector | 2-layer MLP | ~12M | Trained |
+| Language Model | Qwen3-0.6B | ~600M | Frozen |
+### How It Works
+1. **Audio Encoder**: GLM-ASR converts 16kHz audio into frame-level embeddings (768-dim)
+2. **Projector**: A 2-layer MLP with frame stacking bridges the audio and text embedding spaces
+3. **Language Model**: Qwen3 generates text autoregressively, conditioned on the projected audio
+The projector reduces sequence length via frame stacking: `output_len = (input_len - 5) // 5 + 1`
+## Model Specifications
+| Specification | Value |
+|---------------|-------|
+| Input | Audio (16kHz mono) |
+| Output | Text transcription |
+| Max Audio Length | ~30 seconds (limited by encoder) |
+| Vocabulary | Qwen3 tokenizer |
+| Languages | English only |
+| Generation | Greedy decoding (num_beams=1, do_sample=False) |
+## Training Details
+| | |
+|---|---|
+| **Dataset** | LoquaciousSet (25,000 hours) |
+| **Hardware** | Single NVIDIA A40 |
+| **Time** | ~24 hours |
+| **Cost** | ~$12 |
+| **Optimizer** | AdamW |
+| **Learning Rate** | 1e-4 |
+| **Batch Size** | 4 |
+| **Steps** | 50,000 |
+## Limitations
+- **English only**: Not trained on other languages
+- **Sample rate**: Expects 16kHz audio (other rates resampled automatically)
+- **Audio length**: Best for clips under 30 seconds
+- **Accuracy**: May degrade on:
+  - Heavily accented speech
+  - Noisy or low-quality audio
+  - Domain-specific terminology
+  - Overlapping speakers
+- **No punctuation**: Output is lowercase without punctuation by default
+## Requirements
+```
+transformers>=4.40.0
+torch>=2.0.0
+torchaudio>=2.0.0
+```
+Optional for streaming:
+```
+librosa
+soundfile
+```
+## Files
+| File | Description |
+|------|-------------|
+| `config.json` | Model configuration |
+| `model.safetensors` | Projector weights (~48MB) |
+| `preprocessor_config.json` | Audio preprocessing config |
+| `tokenizer.json` | Tokenizer |
+| `tokenizer_config.json` | Tokenizer config |
+| `special_tokens_map.json` | Special tokens |
+Note: Only the projector weights are stored. The encoder (GLM-ASR) and decoder (Qwen3) are loaded from their respective HuggingFace repos.
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{tinyaudio2024,
+  author = {Alex Kroman},
+  title = {Tiny Audio: Minimal ASR Training},
+  year = {2024},
+  publisher = {GitHub},
+  url = {https://github.com/alexkroman/tiny-audio}
+}
+```
+## Links
+- [GitHub Repository](https://github.com/alexkroman/tiny-audio) - Train your own model
+- [Free 3.5-hour Course](https://github.com/alexkroman/tiny-audio/blob/main/docs/course/0-course-overview.md) - Learn ASR from scratch
+- [Live Demo](https://huggingface.co/spaces/mazesmazes/tiny-audio) - Try it in your browser
+## Acknowledgments
+- [GLM-ASR](https://huggingface.co/zai-org/GLM-ASR-Nano-2512) for the audio encoder
+- [Qwen3](https://huggingface.co/Qwen/Qwen3-0.6B) for the language model
+- [LoquaciousSet](https://huggingface.co/datasets/speechbrain/LoquaciousSet) for training data
+## License
+MIT

asr_pipeline.py CHANGED Viewed

@@ -1,6 +1,7 @@
 """ASR pipeline for audio-to-text transcription with optional timestamps and diarization."""
 import re
 from pathlib import Path
 from typing import Any
@@ -23,8 +24,135 @@ def _get_device() -> str:
     return "cpu"
 class ForcedAligner:
-    """Lazy-loaded forced aligner for word-level timestamps using torchaudio wav2vec2."""
     _bundle = None
     _model = None
@@ -44,7 +172,8 @@ class ForcedAligner:
         if cls._model is None:
             import torchaudio
-            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_BASE_960H
             cls._model = cls._bundle.get_model().to(device)
             cls._model.eval()
             cls._labels = cls._bundle.get_labels()
@@ -57,28 +186,29 @@ class ForcedAligner:
         audio: np.ndarray,
         text: str,
         sample_rate: int = 16000,
-        _language: str = "eng",
         _batch_size: int = 16,
     ) -> list[dict]:
         """Align transcript to audio and return word-level timestamps.
         Args:
             audio: Audio waveform as numpy array
             text: Transcript text to align
             sample_rate: Audio sample rate (default 16000)
-            _language: ISO-639-3 language code (default "eng" for English, unused)
-            _batch_size: Batch size for alignment model (unused)
         Returns:
             List of dicts with 'word', 'start', 'end' keys
         """
         import torchaudio
-        from torchaudio.functional import forced_align, merge_tokens
         device = _get_device()
         model, labels, dictionary = cls.get_instance(device)
-        # Convert audio to tensor (copy to ensure array is writable)
         if isinstance(audio, np.ndarray):
             waveform = torch.from_numpy(audio.copy()).float()
         else:
@@ -88,7 +218,7 @@ class ForcedAligner:
         if waveform.dim() == 1:
             waveform = waveform.unsqueeze(0)
-        # Resample if needed (wav2vec2 expects 16kHz)
         if sample_rate != cls._bundle.sample_rate:
             waveform = torchaudio.functional.resample(
                 waveform, sample_rate, cls._bundle.sample_rate
@@ -103,67 +233,47 @@ class ForcedAligner:
         emission = emissions[0].cpu()
-        # Normalize text: uppercase, keep only valid characters
         transcript = text.upper()
-        # Build tokens from transcript
         tokens = []
         for char in transcript:
             if char in dictionary:
                 tokens.append(dictionary[char])
             elif char == " ":
-                tokens.append(dictionary.get("|", dictionary.get(" ", 0)))
         if not tokens:
             return []
-        targets = torch.tensor([tokens], dtype=torch.int32)
-        # Run forced alignment
-        # Note: forced_align is deprecated in torchaudio 2.6+ and will be removed in 2.9 (late 2025)
-        # No official replacement announced yet. See https://github.com/pytorch/audio/issues/3902
-        aligned_tokens, scores = forced_align(emission.unsqueeze(0), targets, blank=0)
-        # Use torchaudio's merge_tokens to get token spans (removes blanks and merges repeats)
-        token_spans = merge_tokens(aligned_tokens[0], scores[0])
-        # Convert frame indices to time (model stride is 320 samples at 16kHz = 20ms)
-        frame_duration = 320 / cls._bundle.sample_rate
-        # Group token spans into words based on pipe separator
         words = text.split()
         word_timestamps = []
-        current_word_start = None
-        current_word_end = None
-        word_idx = 0
-        for span in token_spans:
-            token_char = labels[span.token]
-            if token_char == "|":  # Word separator
-                if current_word_start is not None and word_idx < len(words):
-                    word_timestamps.append(
-                        {
-                            "word": words[word_idx],
-                            "start": current_word_start * frame_duration,
-                            "end": current_word_end * frame_duration,
-                        }
-                    )
-                    word_idx += 1
-                current_word_start = None
-                current_word_end = None
-            else:
-                if current_word_start is None:
-                    current_word_start = span.start
-                current_word_end = span.end
-        # Don't forget the last word
-        if current_word_start is not None and word_idx < len(words):
-            word_timestamps.append(
-                {
-                    "word": words[word_idx],
-                    "start": current_word_start * frame_duration,
-                    "end": current_word_end * frame_duration,
-                }
-            )
         return word_timestamps

 """ASR pipeline for audio-to-text transcription with optional timestamps and diarization."""
 import re
+from dataclasses import dataclass
 from pathlib import Path
 from typing import Any
     return "cpu"
+@dataclass
+class _AlignPoint:
+    """A point in the alignment path."""
+    token_index: int
+    time_index: int
+    score: float
+@dataclass
+class _AlignSegment:
+    """An aligned character/word segment."""
+    label: str
+    start: int
+    end: int
+    score: float
+    @property
+    def length(self):
+        return self.end - self.start
+def _get_trellis(emission: torch.Tensor, tokens: list[int], blank_id: int = 0) -> torch.Tensor:
+    """Build dynamic programming trellis for CTC alignment.
+    Based on WhisperX's alignment algorithm for improved accuracy.
+    """
+    num_frame = emission.size(0)
+    num_tokens = len(tokens)
+    trellis = torch.zeros((num_frame, num_tokens))
+    trellis[1:, 0] = torch.cumsum(emission[1:, blank_id], 0)
+    trellis[0, 1:] = -float("inf")
+    trellis[-num_tokens + 1 :, 0] = float("inf")
+    for t in range(num_frame - 1):
+        trellis[t + 1, 1:] = torch.maximum(
+            # Score for staying at the same token
+            trellis[t, 1:] + emission[t, blank_id],
+            # Score for changing to the next token
+            trellis[t, :-1] + emission[t, tokens[1:]],
+        )
+    return trellis
+def _backtrack(
+    trellis: torch.Tensor,
+    emission: torch.Tensor,
+    tokens: list[int],
+    blank_id: int = 0,
+) -> list[_AlignPoint]:
+    """Backtrack through trellis to find optimal alignment path."""
+    t, j = trellis.size(0) - 1, trellis.size(1) - 1
+    path = [_AlignPoint(j, t, emission[t, blank_id].exp().item())]
+    while j > 0:
+        assert t > 0
+        p_stay = emission[t - 1, blank_id]
+        p_change = emission[t - 1, tokens[j]]
+        stayed = trellis[t - 1, j] + p_stay
+        changed = trellis[t - 1, j - 1] + p_change
+        t -= 1
+        if changed > stayed:
+            j -= 1
+        prob = (p_change if changed > stayed else p_stay).exp().item()
+        path.append(_AlignPoint(j, t, prob))
+    while t > 0:
+        prob = emission[t - 1, blank_id].exp().item()
+        path.append(_AlignPoint(j, t - 1, prob))
+        t -= 1
+    return path[::-1]
+def _merge_repeats(path: list[_AlignPoint], transcript: str) -> list[_AlignSegment]:
+    """Merge repeated tokens into character segments."""
+    i1, i2 = 0, 0
+    segments = []
+    while i1 < len(path):
+        while i2 < len(path) and path[i1].token_index == path[i2].token_index:
+            i2 += 1
+        score = sum(path[k].score for k in range(i1, i2)) / (i2 - i1)
+        segments.append(
+            _AlignSegment(
+                transcript[path[i1].token_index],
+                path[i1].time_index,
+                path[i2 - 1].time_index + 1,
+                score,
+            )
+        )
+        i1 = i2
+    return segments
+def _merge_words(segments: list[_AlignSegment], separator: str = "|") -> list[_AlignSegment]:
+    """Merge character segments into word segments."""
+    words = []
+    i1, i2 = 0, 0
+    while i1 < len(segments):
+        if i2 >= len(segments) or segments[i2].label == separator:
+            if i1 != i2:
+                segs = segments[i1:i2]
+                word = "".join([seg.label for seg in segs])
+                total_length = sum(seg.length for seg in segs)
+                score = (
+                    sum(seg.score * seg.length for seg in segs) / total_length
+                    if total_length > 0
+                    else 0
+                )
+                words.append(_AlignSegment(word, segments[i1].start, segments[i2 - 1].end, score))
+            i1 = i2 + 1
+            i2 = i1
+        else:
+            i2 += 1
+    return words
 class ForcedAligner:
+    """Forced aligner for word-level timestamps using wav2vec2.
+    Uses WhisperX-style dynamic programming alignment for improved accuracy
+    over simple CTC greedy alignment.
+    """
     _bundle = None
     _model = None
         if cls._model is None:
             import torchaudio
+            # Use LARGE model for better accuracy (same as WhisperX recommendation)
+            cls._bundle = torchaudio.pipelines.WAV2VEC2_ASR_LARGE_960H
             cls._model = cls._bundle.get_model().to(device)
             cls._model.eval()
             cls._labels = cls._bundle.get_labels()
         audio: np.ndarray,
         text: str,
         sample_rate: int = 16000,
+        _language: str = "en",
         _batch_size: int = 16,
     ) -> list[dict]:
         """Align transcript to audio and return word-level timestamps.
+        Uses WhisperX-style dynamic programming for improved alignment accuracy.
         Args:
             audio: Audio waveform as numpy array
             text: Transcript text to align
             sample_rate: Audio sample rate (default 16000)
+            _language: Language code (unused, English only)
+            _batch_size: Batch size (unused)
         Returns:
             List of dicts with 'word', 'start', 'end' keys
         """
         import torchaudio
         device = _get_device()
         model, labels, dictionary = cls.get_instance(device)
+        # Convert audio to tensor
         if isinstance(audio, np.ndarray):
             waveform = torch.from_numpy(audio.copy()).float()
         else:
         if waveform.dim() == 1:
             waveform = waveform.unsqueeze(0)
+        # Resample if needed
         if sample_rate != cls._bundle.sample_rate:
             waveform = torchaudio.functional.resample(
                 waveform, sample_rate, cls._bundle.sample_rate
         emission = emissions[0].cpu()
+        # Normalize text and build token sequence
         transcript = text.upper()
         tokens = []
+        clean_transcript = ""
         for char in transcript:
             if char in dictionary:
                 tokens.append(dictionary[char])
+                clean_transcript += char
             elif char == " ":
+                sep_token = dictionary.get("|", dictionary.get(" ", 0))
+                tokens.append(sep_token)
+                clean_transcript += "|"
         if not tokens:
             return []
+        # Build trellis and find optimal path (WhisperX-style DP alignment)
+        trellis = _get_trellis(emission, tokens, blank_id=0)
+        path = _backtrack(trellis, emission, tokens, blank_id=0)
+        # Merge into character segments, then word segments
+        char_segments = _merge_repeats(path, clean_transcript)
+        word_segments = _merge_words(char_segments, separator="|")
+        # Convert frame indices to time
+        frame_duration = 320 / cls._bundle.sample_rate  # 20ms per frame
+        # Build output with original words
         words = text.split()
         word_timestamps = []
+        for i, seg in enumerate(word_segments):
+            if i < len(words):
+                word_timestamps.append(
+                    {
+                        "word": words[i],
+                        "start": seg.start * frame_duration,
+                        "end": seg.end * frame_duration,
+                    }
+                )
         return word_timestamps