Open Chatbot: A Privacy-First AI Document Processing Chatbot with Zero File Leakage

Abstract

We present Open Chatbot, a privacy-first AI chatbot designed to process sensitive documents without ever transmitting original files to cloud AI providers. Built on a Privacy by Design architecture, the system extracts text locally on the user's server using specialized processing engines (PDFtotext, Mammoth, SheetJS, Tesseract OCR) and transmits only plain text JSON chunks—capped at 30KB per file—to AI inference endpoints. This three-stage pipeline (local extraction, text-only transmission, immediate cleanup) ensures that no binary files, document metadata, or structural information ever leaves the user's infrastructure. The system supports multiple AI providers (DeepSeek, OpenAI, Anthropic), handles diverse file formats including PDF, DOCX, XLSX, images with OCR, and 20+ source code formats, and features streaming responses, LaTeX rendering, syntax highlighting, and multi-session chat history. We describe the architecture, security model, and design decisions that enable privacy-preserving AI document interaction suitable for processing confidential financial, legal, and medical documents.

1. Introduction

The rapid adoption of AI-powered chatbots such as ChatGPT, Claude, and Gemini has transformed how organizations interact with documents. Users routinely upload financial reports, legal contracts, medical records, and proprietary research to these services for summarization, analysis, and question-answering. However, this convenience creates a fundamental tension: the original files—often containing highly sensitive information—are transmitted in full to third-party servers, where they may be stored, logged with metadata, or potentially used for model training.

Consider the standard upload flow: when a user attaches a 5MB PDF to ChatGPT, the entire binary file is sent to OpenAI's servers. The document structure, embedded fonts, metadata (author, creation date, revision history), and complete content are all transmitted and stored alongside the user's account information. For organizations bound by data protection regulations (GDPR, HIPAA, banking secrecy laws), this presents unacceptable compliance risks.

Open Chatbot addresses this problem through a fundamental architectural principle: files never leave your infrastructure. Instead of transmitting documents to AI providers, the system processes files locally, extracts only plain text content, and sends compact text-only payloads to AI inference endpoints. The key contributions of this work are:

2. Related Work

The challenge of privacy-preserving AI interaction has been approached from several directions. Retrieval-Augmented Generation (RAG) systems (Lewis et al., 2020) store document embeddings in vector databases and retrieve relevant chunks during inference, but still require document content to be processed and stored in some form. Fully local LLM solutions (e.g., llama.cpp, Ollama) eliminate cloud dependency entirely but sacrifice the quality advantages of large-scale commercial models.

Existing privacy-focused chatbot solutions typically fall into two categories: fully local systems that run smaller models on-device, or enterprise API platforms with contractual data protection guarantees. Open Chatbot occupies a unique middle ground: it leverages the superior capabilities of large commercial models (GPT-4o, Claude Sonnet 4.5) while ensuring that only extracted text—not original files—reaches these services. This approach provides state-of-the-art AI quality without compromising document confidentiality.

The Vercel AI SDK (Vercel, 2024) provides the streaming infrastructure that enables real-time response delivery. Combined with specialized extraction libraries (pdf-parse, Mammoth, SheetJS), the system achieves comprehensive format coverage without relying on cloud-based document processing services.

3. System Architecture

Open Chatbot follows a modular, three-stage architecture designed around the principle that document content should be minimally exposed during AI interaction. The system comprises a local document processing engine, a text-only API gateway, and a reactive client interface.

3.1 Three-Stage Privacy Pipeline

The core privacy guarantee is implemented through a strict three-stage pipeline:

Stage 1 — Local Processing: Files uploaded by the user are received by the Next.js API route (/api/upload/route.ts) and processed entirely on the local server. Each file type is handled by a specialized extraction engine: PDFtotext for PDF documents, Mammoth and word-extractor for DOCX/DOC files, SheetJS for spreadsheets (XLSX/XLS/CSV), and Tesseract OCR for scanned images (PNG, JPG, BMP, TIFF, WebP). The extraction output is pure text content with no binary data, embedded objects, or metadata preserved.

Stage 2 — Text-Only Transmission: The extracted text is formatted as standard JSON payloads conforming to the AI provider's chat API format. Each file's content is capped at 30KB before transmission, ensuring that even for large documents, the data exposure is strictly limited. The API request contains only: {"role": "user", "content": "extracted text..."}—no file attachments, no binary encodings, no metadata.

Stage 3 — Immediate Cleanup: After text extraction completes, the original file buffer is immediately released from memory using JavaScript finally blocks. No temporary files are written to disk, and no persistent storage of uploaded documents occurs on the server. The file exists only transiently in memory during the extraction process.

3.2 Multi-Provider AI Architecture

The system supports three AI provider families through a unified streaming interface built on Vercel AI SDK 6's streamText function:

Table 1: Supported AI providers and models
Provider	Models	Max Tokens	Characteristics
DeepSeek	Chat, Reasoner	8K–16K	Cost-effective, strong reasoning
OpenAI	GPT-4o, GPT-4o Mini, GPT-4.1	16K–32K	Highest versatility, broad capabilities
Anthropic	Claude Sonnet 4.5, Claude Haiku 4.5	8K–16K	Strong analysis, safety-focused

Each provider requires its own API key, stored exclusively in the browser's localStorage—never transmitted to or stored on the application server. The chat API route (/api/chat/route.ts) dynamically selects the appropriate SDK provider based on the user's configuration, enabling seamless switching between models without code changes.

3.3 Document Processing Engine

The file processing subsystem (lib/file-processor.ts) implements format-specific extraction strategies:

4. Security Model

Open Chatbot's security model is built on the principle of minimal data exposure. We formalize the security guarantees through a comparison with traditional AI document upload workflows:

4.1 Client-Side Key Management

API keys for each provider are stored exclusively in the browser's localStorage, never transmitted to the application backend except as part of authenticated API requests to the respective AI provider. This design ensures that even the application server operator cannot access users' API credentials, and no centralized key database exists as an attack surface.

4.2 Stateless Server Architecture

The Next.js API routes operate in a completely stateless manner. No session data, uploaded files, or chat histories are persisted on the server. Each request is processed independently, and all file buffers are released immediately after extraction. The server maintains zero knowledge of previous interactions, providing natural protection against data accumulation risks.

5. Implementation Details

5.1 Technology Stack

5.2 User Interface

The chat interface is designed as a responsive single-page application with several advanced features:

5.3 Streaming Architecture

Table 2: Supported file formats and extraction methods
Format	Extensions	Extraction Engine	Notes
PDF	.pdf	pdftotext + Tesseract OCR	Text + OCR fallback for scanned pages (max 20 pages)
Word	.docx, .doc	Mammoth, word-extractor	DOCX via Mammoth, legacy DOC via word-extractor
Spreadsheet	.xlsx, .xls, .csv	SheetJS (xlsx)	Converted to structured text tables
Images	.png, .jpg, .bmp, .tiff, .webp	Tesseract OCR	Optical character recognition
Text & Code	.txt, .md, .json, .xml, .html, .py, .js, .ts, etc.	Direct UTF-8 read	20+ source code formats supported

Table 3: Security comparison — traditional upload vs. Open Chatbot
Aspect	Traditional AI Upload	Open Chatbot
Original file sent to cloud	Yes — full binary	No — never
Document metadata transmitted	Yes — author, dates, revisions	No — stripped during extraction
Binary data exposure	Complete file (MBs)	Zero — text only
Data size to cloud	Full file size	Max 30KB per file
Used for model training	Possibly (varies by provider)	No — API mode excludes training
Data retention control	Minimal — provider-dependent	Full — user-controlled
Regulatory compliance	Complex — DPA required	Simplified — no file transfer

Table 4: Core technology stack
Layer	Technology	Purpose
Framework	Next.js 16 (App Router, Turbopack)	Full-stack React framework with API routes
UI Library	React 19, Tailwind CSS 4, Radix UI	Responsive, accessible component library
AI Integration	Vercel AI SDK 6 (`streamText`)	Unified streaming interface for multi-provider
PDF Processing	pdftotext, Tesseract	Text extraction and OCR for scanned documents
Word Processing	Mammoth, word-extractor	DOCX/DOC text extraction
Spreadsheet Processing	SheetJS (xlsx)	Excel/CSV to structured text
Image OCR	Tesseract.js	Optical character recognition for images
Math Rendering	KaTeX, remark-math, rehype-katex	LaTeX formula rendering in responses
Code Display	react-syntax-highlighter (Prism)	Syntax-highlighted code blocks
Language	TypeScript 5 (strict mode)	Type-safe development

Response delivery uses Vercel AI SDK 6's streamText function, which establishes a Server-Sent Events (SSE) connection between the client and the Next.js API route. The API route, in turn, streams responses from the selected AI provider. This double-streaming architecture ensures that users see response tokens as they are generated, providing a responsive experience even for long AI outputs.

6. Deployment Options

Open Chatbot supports three deployment configurations, each with different trade-offs between convenience and feature completeness:

6.1 Self-Hosted (Recommended)

The recommended deployment for maximum privacy and full feature support, including OCR capabilities:

This configuration supports all file formats including OCR for scanned PDFs and images, as Tesseract and poppler-utils can be installed on the host system.

6.2 Docker

A Dockerfile is provided with all OCR dependencies (Tesseract, Poppler) pre-installed:

6.3 Vercel (Serverless)

For quick deployment without infrastructure management. Note that OCR features are limited in serverless environments due to binary dependency constraints:

7. Project Structure

The codebase follows Next.js 16 App Router conventions with a clear separation between API routes, UI components, state management, and processing logic:

8. Conclusion and Future Work

Open Chatbot demonstrates that it is possible to leverage state-of-the-art commercial AI models for document analysis while maintaining strict privacy guarantees over sensitive files. The three-stage privacy pipeline—local extraction, text-only transmission, immediate cleanup—provides a practical and auditable approach to the fundamental tension between AI capability and data confidentiality.

The system's multi-provider architecture ensures that users are not locked into a single AI vendor, while client-side key management eliminates centralized credential exposure. The comprehensive file format support, combined with streaming responses, LaTeX rendering, and syntax highlighting, delivers a production-quality user experience comparable to commercial AI chatbots.

References

Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
Vercel (2024). Vercel AI SDK: Build AI-powered applications with React and Next.js. GitHub Repository. https://github.com/vercel/ai
OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
Anthropic (2024). The Claude Model Family. Anthropic Technical Documentation. https://docs.anthropic.com
DeepSeek AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.
Smith, R. (2007). An Overview of the Tesseract OCR Engine. Proceedings of the Ninth International Conference on Document Analysis and Recognition, pp. 629-633.
Goldberg, D. (2023). Mammoth: Convert Word documents (.docx) to HTML and Markdown. GitHub Repository. https://github.com/mwilliamson/mammoth.js
SheetJS Team (2024). SheetJS: Spreadsheet Data Toolkit. GitHub Repository. https://github.com/SheetJS/sheetjs
European Parliament (2016). General Data Protection Regulation (GDPR). Regulation (EU) 2016/679.
U.S. Department of Health and Human Services (1996). Health Insurance Portability and Accountability Act (HIPAA). Public Law 104-191.