Transkrip: A Privacy-First Cross-Platform Desktop Application for Local AI-Powered Audio and Video Transcription

Abstract

We present Transkrip, an open-source cross-platform desktop application for local, privacy-first transcription of audio and video files. Transkrip combines a modern Electron-based shell with a React 19 / TypeScript 6 renderer and integrates whisper.cpp as its on-device automatic speech recognition (ASR) engine. By performing every stage of the pipeline — file ingestion, format conversion, speech recognition, and document export — entirely on the user's own machine, Transkrip eliminates the privacy, cost, and connectivity concerns that typically accompany cloud-based transcription services. The application supports common audio and video containers (MP3, WAV, M4A, MP4, MOV), provides a persistent local history backed by SQLite via better-sqlite3, and exports results to .txt, .docx, and .pdf. We describe the motivation, system architecture, technology choices, and practical deployment considerations that enable Transkrip to deliver a native desktop experience on macOS, Windows, and Linux while retaining a strict no-cloud guarantee.

1. Introduction

Modern automatic speech recognition systems, driven by large transformer-based models, have achieved near human-level accuracy across dozens of languages. Yet the vast majority of production transcription workflows remain tied to cloud APIs, requiring users to upload potentially sensitive audio to remote servers in exchange for convenience. For journalists protecting sources, researchers handling interview recordings, medical or legal professionals working with confidential material, and enterprises with data-residency obligations, cloud-based transcription is often unacceptable.

Transkrip addresses this gap with a polished, cross-platform desktop application that performs every stage of transcription locally. Built on Electron 41 with a React 19 and Tailwind CSS 4 interface, Transkrip wraps the highly optimized whisper.cpp engine into a friendly user experience that requires no cloud accounts, no API keys, and — after an initial model download — no network connectivity at all.

2. Related Work

OpenAI's Whisper (Radford et al., 2022) established a robust multilingual ASR baseline by training on 680,000 hours of weakly supervised audio. Several derivative projects have optimized Whisper for local inference: faster-whisper (Guillaumie, 2023) rebuilds the model on top of CTranslate2 for efficient CPU execution, mlx-whisper (Apple MLX Team, 2024) exploits Apple Silicon GPUs via the MLX framework, and whisper.cpp (Gerganov, 2022) provides a dependency-light C/C++ port that runs efficiently across desktop platforms using integer quantization and SIMD acceleration.

Prior offline transcription tools either target power users through command-line interfaces or ship as heavy Python applications with complex dependency chains. Transkrip differs by providing a first-class native desktop UX on top of whisper.cpp, coupled with a modern web technology renderer — trading the Python ecosystem's flexibility for smaller footprint, faster startup, and easier installation for end users.

3. System Architecture

Transkrip adopts the standard Electron two-process model. The renderer process hosts the React application and handles all user interactions, while the main process manages file system access, spawns the ASR subprocess, and owns the SQLite database. A preload script exposes a narrow, typed IPC surface between the two, following the principle of least privilege.

3.1 Process Separation

3.2 Processing Pipeline

Table 1: Responsibilities of the Electron main and renderer processes
Process	Responsibilities	Key Modules
Renderer (React 19 / TS)	UI, drag-and-drop, settings, history view, progress updates	`UploadZone`, `Settings`, `HistoryList`
Preload	Typed IPC bridge, sandboxed API exposure	`preload.ts`
Main (Electron 41)	File I/O, subprocess management, DB, lifecycle	`main.ts`, `whisper.ts`, `database.ts`

Stage 1 — Ingestion: Users drop audio or video files into the upload zone. Supported containers include MP3, WAV, M4A, MP4, and MOV. The renderer forwards file paths through IPC to the main process.

Stage 2 — Normalization: The main process invokes ffmpeg as a subprocess to decode the input and resample it to 16 kHz mono PCM, the canonical format expected by whisper.cpp.

Stage 3 — Inference: A whisper.cpp child process runs the selected model against the normalized audio, streaming progress updates back over stdout. The renderer receives these events via IPC and updates the progress UI in real time.

Stage 4 — Segment Assembly: Timestamped segments produced by Whisper are concatenated into a single document. Metadata (language, model size, duration, source filename) is captured alongside the transcript.

Stage 5 — Persistence and Export: The completed session is persisted to SQLite via better-sqlite3. Users can then export to .txt (plain text), .docx (via the docx library), or .pdf (via jspdf), all generated locally without any server round-trip.

4. Technology Stack

5. Installation and Distribution

Table 2: Core technology stack of Transkrip
Layer	Technology	Role
Desktop shell	Electron 41	Cross-platform native window, process model, packaging
Renderer framework	React 19, TypeScript 6	Component model, type safety
Build tooling	Vite 8	Dev server, bundling, HMR
Styling	Tailwind CSS 4	Utility-first responsive UI
ASR engine	whisper.cpp	On-device Whisper inference
Audio normalization	ffmpeg	Decoding and 16 kHz PCM conversion
Persistence	better-sqlite3	Local history database
Document export	docx, jspdf, file-saver	TXT / DOCX / PDF generation
Packaging	electron-builder	Installers for macOS, Windows, Linux

Transkrip is distributed through GitHub Releases as pre-built installers for macOS, Windows, and Linux. System dependencies (ffmpeg and whisper.cpp) are installed via the platform's standard package manager. On macOS, for example:

Once the dependencies are present, installing Transkrip is a single-click operation. On first launch, the application verifies the presence of the required binaries and guides the user through any missing steps.

6. Privacy and Security Considerations

Transkrip was designed from the ground up with privacy as a non-negotiable property. Specifically, the system guarantees:

These properties make Transkrip appropriate for legal transcription, medical dictation, confidential interviews, and any scenario where audio content is subject to confidentiality or regulatory constraints.

7. Platform Compatibility

8. Use Cases

9. Conclusion and Future Work

Transkrip demonstrates that privacy-first, production-quality transcription can be packaged as a polished desktop application accessible to non-technical users. By combining Electron, React, TypeScript, and whisper.cpp, the project delivers a zero-cloud workflow without sacrificing the ergonomics users expect from modern software.

References

Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356.
Gerganov, G. (2022). whisper.cpp: Port of OpenAI's Whisper model in C/C++. GitHub Repository. https://github.com/ggerganov/whisper.cpp
Guillaumie, G. (2023). faster-whisper: Faster Whisper transcription with CTranslate2. GitHub Repository. https://github.com/SYSTRAN/faster-whisper
Apple MLX Team (2024). mlx-whisper: Whisper inference on Apple Silicon using MLX. GitHub Repository. https://github.com/ml-explore/mlx-examples
Electron Maintainers (2024). Electron: Build cross-platform desktop apps with JavaScript, HTML, and CSS. GitHub Repository. https://github.com/electron/electron
FFmpeg Developers (2024). FFmpeg: A complete, cross-platform solution to record, convert and stream audio and video. https://ffmpeg.org
Meta Open Source (2024). React 19. React Documentation. https://react.dev