Open Chatbot: A Privacy-First AI Document Processing Chatbot with Zero File Leakage

Romi Nur Ismanto
Independent AI Research Lab, Jakarta, Indonesia
rominur@gmail.com
February 2026

Abstract

We present Open Chatbot, a privacy-first AI chatbot designed to process sensitive documents without ever transmitting original files to cloud AI providers. Built on a Privacy by Design architecture, the system extracts text locally on the user's server using specialized processing engines (PDFtotext, Mammoth, SheetJS, Tesseract OCR) and transmits only plain text JSON chunks—capped at 30KB per file—to AI inference endpoints. This three-stage pipeline (local extraction, text-only transmission, immediate cleanup) ensures that no binary files, document metadata, or structural information ever leaves the user's infrastructure. The system supports multiple AI providers (DeepSeek, OpenAI, Anthropic), handles diverse file formats including PDF, DOCX, XLSX, images with OCR, and 20+ source code formats, and features streaming responses, LaTeX rendering, syntax highlighting, and multi-session chat history. We describe the architecture, security model, and design decisions that enable privacy-preserving AI document interaction suitable for processing confidential financial, legal, and medical documents.

Keywords: privacy-preserving AI, document processing, zero file leakage, chatbot, local extraction, multi-provider LLM, Next.js, Vercel AI SDK, OCR, confidential data

1. Introduction

The rapid adoption of AI-powered chatbots such as ChatGPT, Claude, and Gemini has transformed how organizations interact with documents. Users routinely upload financial reports, legal contracts, medical records, and proprietary research to these services for summarization, analysis, and question-answering. However, this convenience creates a fundamental tension: the original files—often containing highly sensitive information—are transmitted in full to third-party servers, where they may be stored, logged with metadata, or potentially used for model training.

Consider the standard upload flow: when a user attaches a 5MB PDF to ChatGPT, the entire binary file is sent to OpenAI's servers. The document structure, embedded fonts, metadata (author, creation date, revision history), and complete content are all transmitted and stored alongside the user's account information. For organizations bound by data protection regulations (GDPR, HIPAA, banking secrecy laws), this presents unacceptable compliance risks.

Open Chatbot addresses this problem through a fundamental architectural principle: files never leave your infrastructure. Instead of transmitting documents to AI providers, the system processes files locally, extracts only plain text content, and sends compact text-only payloads to AI inference endpoints. The key contributions of this work are:

2. Related Work

The challenge of privacy-preserving AI interaction has been approached from several directions. Retrieval-Augmented Generation (RAG) systems (Lewis et al., 2020) store document embeddings in vector databases and retrieve relevant chunks during inference, but still require document content to be processed and stored in some form. Fully local LLM solutions (e.g., llama.cpp, Ollama) eliminate cloud dependency entirely but sacrifice the quality advantages of large-scale commercial models.

Existing privacy-focused chatbot solutions typically fall into two categories: fully local systems that run smaller models on-device, or enterprise API platforms with contractual data protection guarantees. Open Chatbot occupies a unique middle ground: it leverages the superior capabilities of large commercial models (GPT-4o, Claude Sonnet 4.5) while ensuring that only extracted text—not original files—reaches these services. This approach provides state-of-the-art AI quality without compromising document confidentiality.

The Vercel AI SDK (Vercel, 2024) provides the streaming infrastructure that enables real-time response delivery. Combined with specialized extraction libraries (pdf-parse, Mammoth, SheetJS), the system achieves comprehensive format coverage without relying on cloud-based document processing services.

3. System Architecture

Open Chatbot follows a modular, three-stage architecture designed around the principle that document content should be minimally exposed during AI interaction. The system comprises a local document processing engine, a text-only API gateway, and a reactive client interface.

3.1 Three-Stage Privacy Pipeline

The core privacy guarantee is implemented through a strict three-stage pipeline:

File Upload (Local Server) → Text Extraction (Local Processing) → Text-Only JSON to AI Provider → Immediate File Cleanup

Stage 1 — Local Processing: Files uploaded by the user are received by the Next.js API route (/api/upload/route.ts) and processed entirely on the local server. Each file type is handled by a specialized extraction engine: PDFtotext for PDF documents, Mammoth and word-extractor for DOCX/DOC files, SheetJS for spreadsheets (XLSX/XLS/CSV), and Tesseract OCR for scanned images (PNG, JPG, BMP, TIFF, WebP). The extraction output is pure text content with no binary data, embedded objects, or metadata preserved.

Stage 2 — Text-Only Transmission: The extracted text is formatted as standard JSON payloads conforming to the AI provider's chat API format. Each file's content is capped at 30KB before transmission, ensuring that even for large documents, the data exposure is strictly limited. The API request contains only: {"role": "user", "content": "extracted text..."}—no file attachments, no binary encodings, no metadata.

Stage 3 — Immediate Cleanup: After text extraction completes, the original file buffer is immediately released from memory using JavaScript finally blocks. No temporary files are written to disk, and no persistent storage of uploaded documents occurs on the server. The file exists only transiently in memory during the extraction process.

3.2 Multi-Provider AI Architecture

The system supports three AI provider families through a unified streaming interface built on Vercel AI SDK 6's streamText function:

Table 1: Supported AI providers and models
Provider Models Max Tokens Characteristics
DeepSeek Chat, Reasoner 8K–16K Cost-effective, strong reasoning
OpenAI GPT-4o, GPT-4o Mini, GPT-4.1 16K–32K Highest versatility, broad capabilities
Anthropic Claude Sonnet 4.5, Claude Haiku 4.5 8K–16K Strong analysis, safety-focused

Each provider requires its own API key, stored exclusively in the browser's localStorage—never transmitted to or stored on the application server. The chat API route (/api/chat/route.ts) dynamically selects the appropriate SDK provider based on the user's configuration, enabling seamless switching between models without code changes.

3.3 Document Processing Engine

The file processing subsystem (lib/file-processor.ts) implements format-specific extraction strategies:

Table 2: Supported file formats and extraction methods
Format Extensions Extraction Engine Notes
PDF .pdf pdftotext + Tesseract OCR Text + OCR fallback for scanned pages (max 20 pages)
Word .docx, .doc Mammoth, word-extractor DOCX via Mammoth, legacy DOC via word-extractor
Spreadsheet .xlsx, .xls, .csv SheetJS (xlsx) Converted to structured text tables
Images .png, .jpg, .bmp, .tiff, .webp Tesseract OCR Optical character recognition
Text & Code .txt, .md, .json, .xml, .html, .py, .js, .ts, etc. Direct UTF-8 read 20+ source code formats supported

4. Security Model

Open Chatbot's security model is built on the principle of minimal data exposure. We formalize the security guarantees through a comparison with traditional AI document upload workflows:

Table 3: Security comparison — traditional upload vs. Open Chatbot
Aspect Traditional AI Upload Open Chatbot
Original file sent to cloud Yes — full binary No — never
Document metadata transmitted Yes — author, dates, revisions No — stripped during extraction
Binary data exposure Complete file (MBs) Zero — text only
Data size to cloud Full file size Max 30KB per file
Used for model training Possibly (varies by provider) No — API mode excludes training
Data retention control Minimal — provider-dependent Full — user-controlled
Regulatory compliance Complex — DPA required Simplified — no file transfer

4.1 Client-Side Key Management

API keys for each provider are stored exclusively in the browser's localStorage, never transmitted to the application backend except as part of authenticated API requests to the respective AI provider. This design ensures that even the application server operator cannot access users' API credentials, and no centralized key database exists as an attack surface.

4.2 Stateless Server Architecture

The Next.js API routes operate in a completely stateless manner. No session data, uploaded files, or chat histories are persisted on the server. Each request is processed independently, and all file buffers are released immediately after extraction. The server maintains zero knowledge of previous interactions, providing natural protection against data accumulation risks.

5. Implementation Details

5.1 Technology Stack

Table 4: Core technology stack
Layer Technology Purpose
Framework Next.js 16 (App Router, Turbopack) Full-stack React framework with API routes
UI Library React 19, Tailwind CSS 4, Radix UI Responsive, accessible component library
AI Integration Vercel AI SDK 6 (streamText) Unified streaming interface for multi-provider
PDF Processing pdftotext, Tesseract Text extraction and OCR for scanned documents
Word Processing Mammoth, word-extractor DOCX/DOC text extraction
Spreadsheet Processing SheetJS (xlsx) Excel/CSV to structured text
Image OCR Tesseract.js Optical character recognition for images
Math Rendering KaTeX, remark-math, rehype-katex LaTeX formula rendering in responses
Code Display react-syntax-highlighter (Prism) Syntax-highlighted code blocks
Language TypeScript 5 (strict mode) Type-safe development

5.2 User Interface

The chat interface is designed as a responsive single-page application with several advanced features:

5.3 Streaming Architecture

Response delivery uses Vercel AI SDK 6's streamText function, which establishes a Server-Sent Events (SSE) connection between the client and the Next.js API route. The API route, in turn, streams responses from the selected AI provider. This double-streaming architecture ensures that users see response tokens as they are generated, providing a responsive experience even for long AI outputs.

// Simplified streaming flow
const result = streamText({
  model: selectedProvider(modelId),
  messages: conversationHistory,
  maxTokens: modelConfig.maxTokens,
});
return result.toDataStreamResponse();

6. Deployment Options

Open Chatbot supports three deployment configurations, each with different trade-offs between convenience and feature completeness:

6.1 Self-Hosted (Recommended)

The recommended deployment for maximum privacy and full feature support, including OCR capabilities:

git clone https://github.com/romizone/chatbot-next.git
cd chatbot-next
npm install
npm run build
npm start

This configuration supports all file formats including OCR for scanned PDFs and images, as Tesseract and poppler-utils can be installed on the host system.

6.2 Docker

A Dockerfile is provided with all OCR dependencies (Tesseract, Poppler) pre-installed:

docker build -t open-chatbot .
docker run -p 3000:3000 open-chatbot

6.3 Vercel (Serverless)

For quick deployment without infrastructure management. Note that OCR features are limited in serverless environments due to binary dependency constraints:

npx vercel

7. Project Structure

The codebase follows Next.js 16 App Router conventions with a clear separation between API routes, UI components, state management, and processing logic:

src/
├── app/
│   ├── api/
│   │   ├── chat/route.ts        # Multi-provider streaming endpoint
│   │   └── upload/route.ts      # Local file processing endpoint
│   ├── layout.tsx               # Root layout with providers
│   └── page.tsx                 # Entry point
├── components/chat/
│   ├── chat-page.tsx            # Main chat container
│   ├── chat-area.tsx            # Message display with markdown
│   ├── chat-input.tsx           # Input with file attachment
│   ├── markdown-renderer.tsx    # KaTeX + syntax highlighting
│   ├── settings-dialog.tsx      # Provider/model configuration
│   └── sidebar.tsx              # Session management
├── hooks/
│   └── use-chat-store.ts       # Zustand/localStorage state
└── lib/
    ├── file-processor.ts        # Document extraction engine
    └── constants.ts             # Model configs, providers

8. Conclusion and Future Work

Open Chatbot demonstrates that it is possible to leverage state-of-the-art commercial AI models for document analysis while maintaining strict privacy guarantees over sensitive files. The three-stage privacy pipeline—local extraction, text-only transmission, immediate cleanup—provides a practical and auditable approach to the fundamental tension between AI capability and data confidentiality.

The system's multi-provider architecture ensures that users are not locked into a single AI vendor, while client-side key management eliminates centralized credential exposure. The comprehensive file format support, combined with streaming responses, LaTeX rendering, and syntax highlighting, delivers a production-quality user experience comparable to commercial AI chatbots.

Future directions include:

The complete source code is available at https://github.com/romizone/chatbot-next under the MIT license.

References

  1. Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33.
  2. Vercel (2024). Vercel AI SDK: Build AI-powered applications with React and Next.js. GitHub Repository. https://github.com/vercel/ai
  3. OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.
  4. Anthropic (2024). The Claude Model Family. Anthropic Technical Documentation. https://docs.anthropic.com
  5. DeepSeek AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv preprint arXiv:2405.04434.
  6. Smith, R. (2007). An Overview of the Tesseract OCR Engine. Proceedings of the Ninth International Conference on Document Analysis and Recognition, pp. 629-633.
  7. Goldberg, D. (2023). Mammoth: Convert Word documents (.docx) to HTML and Markdown. GitHub Repository. https://github.com/mwilliamson/mammoth.js
  8. SheetJS Team (2024). SheetJS: Spreadsheet Data Toolkit. GitHub Repository. https://github.com/SheetJS/sheetjs
  9. European Parliament (2016). General Data Protection Regulation (GDPR). Regulation (EU) 2016/679.
  10. U.S. Department of Health and Human Services (1996). Health Insurance Portability and Accountability Act (HIPAA). Public Law 104-191.