TranscribeAI: A Fully Offline AI-Powered Transcription System with Speaker Diarization

Abstract

We present TranscribeAI, an open-source, fully offline automatic speech recognition (ASR) system with integrated speaker diarization capabilities. The system operates entirely on local hardware without requiring API keys, cloud services, or internet connectivity, addressing critical privacy and cost concerns in audio transcription workflows. TranscribeAI employs a dual-engine architecture combining faster-whisper for CPU-based inference and mlx-whisper for Apple Silicon GPU acceleration, achieving 2–5x performance improvements on supported hardware. The system supports 99+ languages with automatic language detection, provides speaker identification through MFCC feature extraction and agglomerative clustering, and exports results in multiple formats (SRT, TXT, DOCX). We describe the system architecture, processing pipeline, and design decisions that enable production-quality transcription in a zero-cost, privacy-preserving local environment.

1. Introduction

Automatic speech recognition (ASR) has made remarkable advances in recent years, driven by large-scale transformer-based models such as OpenAI's Whisper. However, most production ASR systems rely on cloud-based APIs, introducing concerns around data privacy, recurring costs, internet dependency, and latency. For organizations handling sensitive audio data—such as legal proceedings, medical dictation, financial meetings, or journalistic interviews—transmitting audio to third-party servers presents unacceptable risks.

TranscribeAI addresses these challenges by providing a complete, locally-executed transcription system that requires zero cloud connectivity. The system combines state-of-the-art Whisper models with speaker diarization capabilities, offering a practical solution for users who require accurate, multi-speaker transcription without compromising data sovereignty.

2. Related Work

OpenAI's Whisper (Radford et al., 2022) demonstrated that large-scale weakly supervised pretraining on 680,000 hours of multilingual audio data can produce robust ASR models. The original Whisper models, while powerful, require significant computational resources and are primarily accessed through cloud APIs.

The faster-whisper project (Guillaumie, 2023) reimplemented Whisper using CTranslate2, achieving up to 4x faster inference with comparable accuracy. For Apple Silicon devices, mlx-whisper (Apple MLX Team, 2024) leverages the Metal Performance Shaders framework to achieve GPU-accelerated inference natively on M-series chips.

Speaker diarization—the task of determining "who spoke when"—has traditionally relied on systems like pyannote.audio (Bredin et al., 2020) which require pretrained neural models and significant dependencies. TranscribeAI takes a lightweight approach using classical signal processing (MFCC features) combined with agglomerative clustering from scikit-learn, reducing dependency complexity while maintaining practical accuracy for common use cases.

3. System Architecture

TranscribeAI follows a modular architecture organized into four primary layers: audio input processing, speech recognition engine, speaker attribution, and output formatting.

3.1 Dual-Engine Design

The system implements a hardware-aware engine selection strategy. On Apple Silicon Macs (M1/M2/M3/M4), the system utilizes mlx-whisper, which leverages the Metal GPU framework for 2–5x faster inference compared to CPU-based processing. On all other platforms (Intel Macs, Linux, Windows), the system defaults to faster-whisper with CTranslate2 optimization.

3.2 Processing Pipeline

Table 1: Engine selection based on hardware platform
Platform	Engine	Acceleration
Apple Silicon (M1/M2/M3/M4)	mlx-whisper	Metal GPU, 2–5x speedup
macOS Intel	faster-whisper	CTranslate2 (CPU)
Linux	faster-whisper	CTranslate2 (CPU/CUDA)
Windows	faster-whisper	CTranslate2 (CPU)

Stage 1 — Audio Upload: The system accepts multiple audio and video formats including MP3, MP4, WAV, M4A, OGG, FLAC, and WEBM. Input files are normalized and preprocessed using pydub and librosa for consistent downstream processing.

Stage 2 — Model Loading: Whisper models are loaded from local cache. On first use, models are downloaded once and stored persistently, enabling fully offline operation thereafter. Users select from five model sizes based on their speed/accuracy requirements.

Stage 3 — Whisper Transcription: The selected engine processes the audio through the Whisper model, producing timestamped text segments with word-level confidence scores. Voice activity detection (VAD) using Silero VAD removes silence segments to improve processing efficiency and output quality.

Stage 4 — Speaker Identification: For multi-speaker audio, the system extracts Mel-frequency cepstral coefficient (MFCC) features from each speech segment using librosa. These features are then clustered using agglomerative clustering (scikit-learn) to assign speaker labels (Speaker 1, Speaker 2, etc.) to each transcribed segment.

Stage 5 — Format Export: The final transcription with speaker labels and timestamps is exported in the user's chosen format: SRT (subtitles), TXT (plain text), or DOCX (formatted document with speaker annotations).

3.3 Model Specifications

4. Speaker Diarization

Speaker diarization in TranscribeAI employs a lightweight, dependency-minimal approach designed for offline operation without requiring large pretrained neural diarization models.

4.1 Feature Extraction

For each speech segment produced by the Whisper model, the system extracts 13-dimensional MFCC feature vectors using librosa. MFCCs capture the spectral envelope characteristics of speech that are distinctive across speakers, including vocal tract shape, pitch patterns, and speaking style.

4.2 Clustering

Table 2: Available Whisper model configurations
Model	Parameters	Download Size	Speed Profile	Recommended Use
tiny	39M	~75 MB	Fastest	Quick drafts, low-resource devices
base	74M	~145 MB	Very Fast	General-purpose, balanced
small	244M	~465 MB	Balanced	Recommended default
medium	769M	~1.5 GB	Slower	High accuracy needs
large-v3	1,550M	~2.9 GB	Slowest	Maximum accuracy

The extracted MFCC features are aggregated per segment and fed into an agglomerative clustering algorithm (scikit-learn's AgglomerativeClustering). This bottom-up hierarchical approach iteratively merges the most similar segments until the target number of speakers is reached. Users can specify the expected number of speakers, or the system applies a default configuration.

This approach trades some accuracy compared to neural diarization systems (e.g., pyannote.audio) for significantly reduced complexity, zero additional model downloads, and fully offline operation.

5. Implementation Details

5.1 Technology Stack

5.2 User Interfaces

Table 3: Core technology stack
Component	Technology	Purpose
Backend	Flask (Python 3.10+)	Web server and API routing
ASR Engine (CPU)	faster-whisper / CTranslate2	Optimized CPU inference
ASR Engine (GPU)	mlx-whisper / Metal	Apple Silicon GPU acceleration
Audio Processing	librosa, pydub, numpy	Feature extraction, format conversion
Speaker Clustering	scikit-learn	Agglomerative clustering
VAD	Silero VAD	Voice activity detection
Document Export	python-docx	DOCX generation with formatting
Frontend	Vanilla HTML/CSS/JS	Zero-dependency web UI

Web Interface: A professional dark-themed web UI accessible at http://localhost:8080, featuring drag-and-drop file upload, real-time progress visualization across all five pipeline stages, an integrated audio player, and search functionality within transcription results.

Command-Line Interface: A CLI tool for batch processing and scripting integration:

5.3 Model Management

Models are automatically downloaded on first use and cached locally in the user's home directory. A dedicated model management script allows pre-downloading models for air-gapped environments:

6. Privacy and Security Considerations

TranscribeAI was designed with privacy as a first-class requirement. The system guarantees:

These properties make TranscribeAI suitable for sensitive domains including legal transcription, medical dictation, confidential business meetings, and journalistic source protection.

7. Platform Compatibility

8. Conclusion and Future Work

TranscribeAI demonstrates that production-quality speech transcription with speaker diarization can be achieved entirely on local hardware, without cloud dependencies or recurring costs. The dual-engine architecture ensures optimal performance across hardware platforms, while the MFCC-based diarization approach provides practical speaker identification without heavy neural model dependencies.

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
Guillaumie, G. (2023). faster-whisper: Faster Whisper transcription with CTranslate2. GitHub Repository. https://github.com/SYSTRAN/faster-whisper
Apple MLX Team (2024). mlx-whisper: Whisper inference on Apple Silicon using MLX. GitHub Repository. https://github.com/ml-explore/mlx-examples
Bredin, H., Yin, R., Coria, J.M., et al. (2020). pyannote.audio: neural building blocks for speaker diarization. ICASSP 2020.
McFee, B., Raffel, C., Liang, D., et al. (2015). librosa: Audio and Music Signal Analysis in Python. Proceedings of the 14th Python in Science Conference, pp. 18-25.
Pedregosa, F., Varoquaux, G., Gramfort, A., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, pp. 2825-2830.
Silero Team (2021). Silero VAD: pre-trained enterprise-grade Voice Activity Detector. GitHub Repository. https://github.com/snakers4/silero-vad