We present TranscribeAI, an open-source, fully offline automatic speech recognition (ASR) system with integrated speaker diarization capabilities. The system operates entirely on local hardware without requiring API keys, cloud services, or internet connectivity, addressing critical privacy and cost concerns in audio transcription workflows. TranscribeAI employs a dual-engine architecture combining faster-whisper for CPU-based inference and mlx-whisper for Apple Silicon GPU acceleration, achieving 2–5x performance improvements on supported hardware. The system supports 99+ languages with automatic language detection, provides speaker identification through MFCC feature extraction and agglomerative clustering, and exports results in multiple formats (SRT, TXT, DOCX). We describe the system architecture, processing pipeline, and design decisions that enable production-quality transcription in a zero-cost, privacy-preserving local environment.
Automatic speech recognition (ASR) has made remarkable advances in recent years, driven by large-scale transformer-based models such as OpenAI's Whisper. However, most production ASR systems rely on cloud-based APIs, introducing concerns around data privacy, recurring costs, internet dependency, and latency. For organizations handling sensitive audio data—such as legal proceedings, medical dictation, financial meetings, or journalistic interviews—transmitting audio to third-party servers presents unacceptable risks.
TranscribeAI addresses these challenges by providing a complete, locally-executed transcription system that requires zero cloud connectivity. The system combines state-of-the-art Whisper models with speaker diarization capabilities, offering a practical solution for users who require accurate, multi-speaker transcription without compromising data sovereignty.
The key contributions of this work are:
OpenAI's Whisper (Radford et al., 2022) demonstrated that large-scale weakly supervised pretraining on 680,000 hours of multilingual audio data can produce robust ASR models. The original Whisper models, while powerful, require significant computational resources and are primarily accessed through cloud APIs.
The faster-whisper project (Guillaumie, 2023) reimplemented Whisper using CTranslate2, achieving up to 4x faster inference with comparable accuracy. For Apple Silicon devices, mlx-whisper (Apple MLX Team, 2024) leverages the Metal Performance Shaders framework to achieve GPU-accelerated inference natively on M-series chips.
Speaker diarization—the task of determining "who spoke when"—has traditionally relied on systems like pyannote.audio (Bredin et al., 2020) which require pretrained neural models and significant dependencies. TranscribeAI takes a lightweight approach using classical signal processing (MFCC features) combined with agglomerative clustering from scikit-learn, reducing dependency complexity while maintaining practical accuracy for common use cases.
TranscribeAI follows a modular architecture organized into four primary layers: audio input processing, speech recognition engine, speaker attribution, and output formatting.
The system implements a hardware-aware engine selection strategy. On Apple Silicon Macs (M1/M2/M3/M4), the system utilizes mlx-whisper, which leverages the Metal GPU framework for 2–5x faster inference compared to CPU-based processing. On all other platforms (Intel Macs, Linux, Windows), the system defaults to faster-whisper with CTranslate2 optimization.
| Platform | Engine | Acceleration |
|---|---|---|
| Apple Silicon (M1/M2/M3/M4) | mlx-whisper | Metal GPU, 2–5x speedup |
| macOS Intel | faster-whisper | CTranslate2 (CPU) |
| Linux | faster-whisper | CTranslate2 (CPU/CUDA) |
| Windows | faster-whisper | CTranslate2 (CPU) |
The transcription workflow follows a five-stage pipeline:
Stage 1 — Audio Upload: The system accepts multiple audio and video formats including MP3, MP4, WAV, M4A, OGG, FLAC, and WEBM. Input files are normalized and preprocessed using pydub and librosa for consistent downstream processing.
Stage 2 — Model Loading: Whisper models are loaded from local cache. On first use, models are downloaded once and stored persistently, enabling fully offline operation thereafter. Users select from five model sizes based on their speed/accuracy requirements.
Stage 3 — Whisper Transcription: The selected engine processes the audio through the Whisper model, producing timestamped text segments with word-level confidence scores. Voice activity detection (VAD) using Silero VAD removes silence segments to improve processing efficiency and output quality.
Stage 4 — Speaker Identification: For multi-speaker audio, the system extracts Mel-frequency cepstral coefficient (MFCC) features from each speech segment using librosa. These features are then clustered using agglomerative clustering (scikit-learn) to assign speaker labels (Speaker 1, Speaker 2, etc.) to each transcribed segment.
Stage 5 — Format Export: The final transcription with speaker labels and timestamps is exported in the user's chosen format: SRT (subtitles), TXT (plain text), or DOCX (formatted document with speaker annotations).
| Model | Parameters | Download Size | Speed Profile | Recommended Use |
|---|---|---|---|---|
| tiny | 39M | ~75 MB | Fastest | Quick drafts, low-resource devices |
| base | 74M | ~145 MB | Very Fast | General-purpose, balanced |
| small | 244M | ~465 MB | Balanced | Recommended default |
| medium | 769M | ~1.5 GB | Slower | High accuracy needs |
| large-v3 | 1,550M | ~2.9 GB | Slowest | Maximum accuracy |
Speaker diarization in TranscribeAI employs a lightweight, dependency-minimal approach designed for offline operation without requiring large pretrained neural diarization models.
For each speech segment produced by the Whisper model, the system extracts 13-dimensional MFCC feature vectors using librosa. MFCCs capture the spectral envelope characteristics of speech that are distinctive across speakers, including vocal tract shape, pitch patterns, and speaking style.
The extracted MFCC features are aggregated per segment and fed into an agglomerative clustering algorithm (scikit-learn's AgglomerativeClustering). This bottom-up hierarchical approach iteratively merges the most similar segments until the target number of speakers is reached. Users can specify the expected number of speakers, or the system applies a default configuration.
This approach trades some accuracy compared to neural diarization systems (e.g., pyannote.audio) for significantly reduced complexity, zero additional model downloads, and fully offline operation.
| Component | Technology | Purpose |
|---|---|---|
| Backend | Flask (Python 3.10+) | Web server and API routing |
| ASR Engine (CPU) | faster-whisper / CTranslate2 | Optimized CPU inference |
| ASR Engine (GPU) | mlx-whisper / Metal | Apple Silicon GPU acceleration |
| Audio Processing | librosa, pydub, numpy | Feature extraction, format conversion |
| Speaker Clustering | scikit-learn | Agglomerative clustering |
| VAD | Silero VAD | Voice activity detection |
| Document Export | python-docx | DOCX generation with formatting |
| Frontend | Vanilla HTML/CSS/JS | Zero-dependency web UI |
TranscribeAI provides two interaction modes:
Web Interface: A professional dark-themed web UI accessible at http://localhost:8080, featuring drag-and-drop file upload, real-time progress visualization across all five pipeline stages, an integrated audio player, and search functionality within transcription results.
Command-Line Interface: A CLI tool for batch processing and scripting integration:
python3 transcribe_cli.py audio.mp3 --language id --model medium --speakers 3
Models are automatically downloaded on first use and cached locally in the user's home directory. A dedicated model management script allows pre-downloading models for air-gapped environments:
python3 download_models.py small medium large-v3
TranscribeAI was designed with privacy as a first-class requirement. The system guarantees:
These properties make TranscribeAI suitable for sensitive domains including legal transcription, medical dictation, confidential business meetings, and journalistic source protection.
TranscribeAI supports all major desktop operating systems:
TranscribeAI demonstrates that production-quality speech transcription with speaker diarization can be achieved entirely on local hardware, without cloud dependencies or recurring costs. The dual-engine architecture ensures optimal performance across hardware platforms, while the MFCC-based diarization approach provides practical speaker identification without heavy neural model dependencies.
Future directions include:
The complete source code is available at https://github.com/romizone/transcribeAI under the MIT license.